For past week I have been working on a web scraper for the spatial semantic web project. I needed some test data sets of places with location attributes (latitude and longitude) to populate our data store and can later be queries with our system. Since spatial data sets are not easy to find, I decided to write a scraper to get some POI data of gas stations in a region from some websites. The data set I obtained are only from research and testing and purposes and no way used commercially.
Once again, since I love .NET platform so much, the scraper is a Windows Form Application written in C# and .NET Framework.
One thing I found very handy is the XPathNavigator class in the System.Xml.XPath namespace of the .NET Framework. The way I use is I load a segment of HTML code into an XmlDocument object and create an XPathNavigator object as an cursor. After having the object, I can traverse the hierarchy of the HTML code. Snippet of code I used in my project:
Geocoding is the process of translating a street address into latitude and longitude values. After I obtain the gas stations from the website, I’d like to find the coordinates of the gas stations and augment my data set. The following are two geocoding services I used:
- Google Geocoding API: http://code.google.com/apis/maps/documentation/geocoding/
To obtain lat, lon values from Geocoder.us service, make an HTTP request to http://geocoder.us/service/csv?address=ADDRESS. The following is my Geocoder.us geocoder class:
One thing to note about this service is that many times it is unable to resolve an address and return and error message. For this reason I also use Google geocoder.
I found that Google’s Geocoding almost always be able to resolve an address. Google service does have the limitation of only allowing 2,500 request per day. To use the service, make an HTTP request to http://maps.googleapis.com/maps/api/geocode/output?parameters. The following is my class.
Because of the limitation of the Google service, I first use the Geocoder.us service to see if I can resolve an address, if I couldn’t then I’d use the Google’s service.