The Data That We Dig

Our Odyssey to Acquire 'Good' Data

To explore this problem, we collected restaurant reviews posted on Yelp and TripAdvisor, the two largest online rating services. We chose to look at restaurants because unlike businesses like hotels, they have a higher potential for a combination of both local and non-local reviews. We took reviews from TripAdvisor and Yelp, because according to our research they are the two most popular reviewing websites (Kolowhich, 2015,) and both provide us with reviewer location. This means we can gather the largest set of reviews from a wide range of user geographic locations. We will geographically define “local” as a metropolitan area using OpenStreetMap rather than Google Maps to resolve addresses to coordinates. The Washington metropolitan area will be defined as all locations within a 50 mile radius of the District, in order to have a reasonable amount of data. We hypothesize that when searching for destinations and activities, the interests and knowledge of locals and non-locals vary, thus impacting the average star rating and review type.

Yelp has an API, but it did not provide all of the reviews that we wanted, so we used a scraper to pull review information. TripAdvisor had no API, so we modified our scraper to also pull these reviews. In the end, we had put together a DC restaurant review dataset that yielded the following number of reviews.

Since we wanted to compare non-local reviews to the reviews of locals, we took the reviewers' locations and calculated the distance from each reviewer to each restaurant that they reviewed. We considered a reviewer to be a local if they lived within fifty miles of the restaurant - close enough that one might say they are from the D.C. area. Yelp's reviews were over 60% local reviewers, whereas TripAdvisor's reviews were just over 28% local. This disparity, we believe, is due mostly to the focus of each website:

TripAdvisor is more focused on vacations and destinations, asking "Where are you going?" whereas Yelp is more focused on helping you find "cheap dinner," and already assumes that we want to stay in Washington, D.C. instead of going and exploring what the rest of the world has to offer.

A First Look at our Datasets

We don't like staring at a long list of incomprehensible words and numbers, so we did a little bit of a visualization in order to better understand our data. We clustered reviewers by the log of their average distance to restaurants and the log of their number of reviews. Using k-means, we found no significant clusters, but using DBScan, we were able to identify some core groups. Notice in the Yelp clustering, the green cluster is a group of local reviewers, while the red and light blue clusters are reviewers from further away. The TripAdvisor Clustering has fewer clusterings, but the green and red clusters still represent large groups of local and non-local reviewers respectively.

The Yelp DBScan Clustering

The TripAdvisor DBScan Clustering

Even these clusterings on a 2-D plot don't tell us much about our data. A great woman named Lisa Singh once said, "Graphs are cool!" We agree. We created bipartite graphs out of our Yelp and TripAdvisor data. There were reviewer nodes as well as restaurant nodes, with an edge between any given reviewer and restaurant nodes if that reviewer had reviewed that restaurant. In the two graphs below, we highlight the nodes of highest degree in the graph.

The Yelp Graph

The TripAdvisor Graph

Unsurprisingly, these high-degree nodes are all restaurants; it's much harder for one person to eat at and review five-hundred restaurants than it is for five-hundred people to eat at and review one restaurant.

Notice that the restaurants with highest degree in each graph are not the same. There is some distinct overlap, but there also exist noticeable differences. In TripAdvisor's graph, more upscale restaurants are of high degree, while Yelp's graph highlights more everyday eateries such as District Taco.

Now that we've given our data a once-over, let's dig deeper and answer some of the questions that we had!

Take a Step Back...

Keep digging!