Our code, datasets, and just about everything else can be found here.
Non-Random Sample: We acknowledge that this is not a random sample because only Yelp and TripAdvisor reviews are accounted for. For instance for Yelp reviews, certain reviews are filtered out because they are deemed non-helpful. For TripAdvisor, we had to remove reviews that did not have locations.
Incomplete user review text: Initially, user review data was cut off for TripAdvisor. Some user review text ended with the word "more", meaning that user review length was not accurate for all TripAdvisor data.
Single reviews across multiple rows: We noticed that a small number of user reviews (less than 75) in the Yelp data set were broken up into multiple rows. This was likely caused by a newline or EOF character in the user review text data. Since this was a very rare event, these user reviews were skipped. For example, in yelp_dc_2_cleaned_features both observations 5418 and 5419 correspond to the same review.
Missing location data: For TripAdvisor, as we noticed earlier on, many of the user reviews do not have a location (or have an unrecognized location) and therefore are missing data for the user_is_local variable. We omitted these observations, since we were focused on the differences between local vs. non-local reviewers.
Faulty local/non-local data: A small number (1,627) of Yelp reviews had faulty data for local vs. non-local, and were therefore omitted.