A significant amount of the work that we’ve been doing lately has made use of the plethora of user-generated, geospatially-related content publically available online. Web applications such as Yelp, Google Places and Foursquare provide access to tens of millions of points of interest (POI) worldwide. These POI offer a range of descriptive information ranging from semi-structured content such as price range ($$) and ambiance, to unstructured review text and personal user check-ins.
One of the issues that continues to plague our research is the sparsity of the data and variability of the contributions. For example, one Yelp POI may consist of 4000 user reviews while another lists none. One provider may list 5000 venues in a city such as Santa Barbara while another only shows 400. Additionally, the types of attributes accessible via the provider APIs ranges considerably. Yelp, for example offers over 20 different categories of descriptive attributes ranging from WiFi Access to Wheelchair accessibility. Foursquare on the other hand provides access to temporal geosocial data such as check-ins and friendship ties.
In order to accurately model placial activities and interactions in the real-world, we need to be able to access as many POI (and descriptive attributes of these POI) as possible. Given this need, we propose a weighted multi-attribute matching approach that makes use of the range of properties offered by providers. Our proposed regression-based, weighted-attribute model takes a novel approach to POI matching incorporating attributes such as Name, Category, Geographic Coordinates and Descriptive user-generated text.
Using a range of independent measures (e.g., LDA Topic Modeling, Double Metaphone Phonetic similarity) and a binomial probit regression model we are able to identify the same real-world POI across two different providers with 97% accuracy!
A draft of the paper recently submitted for publication is accessible here: POI Matching PDF
As usual, we would welcome input and feedback on this work.