Little Facts You Might Like to Know

Our code, datasets, and just about everything else can be found here.


  • Non-Random Sample: We acknowledge that this is not a random sample because only Yelp and TripAdvisor reviews are accounted for. For instance for Yelp reviews, certain reviews are filtered out because they are deemed non-helpful. For TripAdvisor, we had to remove reviews that did not have locations.

  • Incomplete user review text: Initially, user review data was cut off for TripAdvisor. Some user review text ended with the word "more", meaning that user review length was not accurate for all TripAdvisor data.

  • Single reviews across multiple rows: We noticed that a small number of user reviews (less than 75) in the Yelp data set were broken up into multiple rows. This was likely caused by a newline or EOF character in the user review text data. Since this was a very rare event, these user reviews were skipped. For example, in yelp_dc_2_cleaned_features both observations 5418 and 5419 correspond to the same review.

  • Missing location data: For TripAdvisor, as we noticed earlier on, many of the user reviews do not have a location (or have an unrecognized location) and therefore are missing data for the user_is_local variable. We omitted these observations, since we were focused on the differences between local vs. non-local reviewers.

  • Faulty local/non-local data: A small number (1,627) of Yelp reviews had faulty data for local vs. non-local, and were therefore omitted.

