Analyzed by Bryan Chang
Nov, 2018
Introduction on Yelp Open Dataset
The Yelp Open Dataset is a subset of Yelp's businesses, reviews, and user data for use in personal, educational, and academic purposes.
The dataset is presented as JSON files, which contain 5,996,996 reviews,
188,593 businesses, 280,992 pictures and so on.
For detailed explanations and the download link for Yelp Open Dataset,
please click here.
The dataset is partitioned into several parts, such as business.json, review.json, user.json, photo.json, etc.
Our research focus is on the business dataset (business.json) and the user dataset (user.json).
Accountability of the Dataset
Source: Where did this data come from?
This dataset was collected by Yelp, and is a subset of its businesses, reviews, and user data.
Accessibility: Who has access to the dataset?
This dataset is publicly available, i.e., accessible for everyone.
Anyone can download, view and use it as long as for personal, educational, and academic purposes.
The download link can be found in the aforementioned link.
Analysis on Business Dataset
Components of the Dataset
The dataset is named “yelp_academic_dataset_business.json”, which contains the following information:
- business’s ID
- business’s name
- business’s neighborhood’s name
- business’s full address, including street, city and state
- business’s postal code
- business’s latitude and longitude
- business’s stars rating
- business’s number of reviews
- whether the business is open or closed
- business’s attributes, e.g., whether it offers parking, etc.
- business’s categories, e.g., “Mexican”, “Burgers”, etc.
- business’s working hours
Inspection on Components
- Business ID
-
All of the values are non-empty.
For most of the ID, it is a string that contains 22 characters.
However,
there are rows with an ID as “#NAME?”. We believe this is an error. Hence, those rows are removed.
- Business’s Name
-
All of the values are non-empty.
Most of the names look normal.
However, there are several names with strange characters, e.g.,
“H么tel Auberge Montréal Espace Confort”, “L'am猫re 脌 Boire”,
“Crémy P芒tisserie”. We believe these names use special characters that include diacritics,
and were inaccurately decoded during the data collection and processing procedure.
- Business’s Neighborhood’s Name
-
There are many rows with missing values with respect to this attibute.
Besides that, there are several names with strange characters, such as “Notre-Dame-de-Gr芒ce”.
We believe this can also be attibuted to inaccurate decoding.
- Business’s Address
-
There are many rows with missing values with respect to this attibute.
Another issue is that there are also strange characters in some rows.
- Business’s City
-
All of the values are non-empty.
Some of the values have inaccurate decoding problem.
- Business’s State
-
All of the values are valid.
- Business’s Postal Code
-
There are some rows with missing values with respect to this attibute.
The remaining parts are valid.
It should be noted that the file contains both US and Canada Postal Code.
- Business’s Latitude and Longitude
-
There are some rows with missing values with respect to this attibute.
The remaining parts are valid, which have been verified with the aid of Google Map.
- Business’s Stars
-
The following chart demonstrates the distribution of different values for stars.
The X-axis is the stars.
The Y-axis is the number of rows that take a particular value.
From the chart, we can see that all the values are valid.
The values range from 1.0 to 5.0, with a stride of 0.5.
- Business’s Number of Reviews
-
All of the values are valid, namely non-negative.
- Business's status (Open or Closed)
-
All of the values are valid.
The value can take 0, meaning the business is closed, or 1, meaning the business is open.
- Business's Attributes
-
There are some rows with missing values with respect to this attibute.
For those rows with values, they are all valid.
However, the format for different rows varies. For instance,
some businesses clearly state whether they provide bike parking or not, and the rest
of them do not say anything about bike parking. Some businesses say they accept credit
cards or not, and the rest of them say nothing about credit cards. What do they mean if
they do not say it? It means they do not provide such service, or they fail to report?
This is not clear.
Therefore, this attribute is somewhat incoherent.
- Business's Categories
-
There are some rows with missing values with respect to this attibute.
The remaining parts are valid.
- Business's Working Hours
-
There are some rows with missing values with respect to this attibute.
The remaining parts are valid.
Quality of the Dataset
- Is the Data Complete?
-
No, the data are not complete.
From the aforementioned analysis, we know that there are missing values in many fields, such as business ID.
The concern raised from this phenomenon is that we cannot verify the identity of the business,
therefore, these rows may have to be discarded, which may introduce bias.
- Is the Data Coherent?
-
No, the data are not coherent.
From the aforementioned analysis, we know that there are inaccurately decoded values in name, neighborhood, address and city. Moreover, the format for attributes is not coherent.
- Is the Data Correct?
-
Yes, the data are mostly correct.
A minor concern is that if we discard invalid rows, the dataset may become biased.
Analysis on User Dataset
Components of the Dataset
The dataset is named “yelp_academic_dataset_user.json”, which contains the following information:
- user id
- user's name
- the number of reviews a user has written
- when the user joined Yelp
- user's friends
- number of useful votes sent by the user
- number of funny votes sent by the user
- number of cool votes sent by the user
- number of fans the user has
- the years the user was elite
- average rating of all the user's reviews
- number of hot compliments received by the user
- number of more compliments received by the user
- number of profile compliments received by the user
- number of cute compliments received by the user
- number of list compliments received by the user
- number of note compliments received by the user
- number of plain compliments received by the user
- number of cool compliments received by the user
- number of funny compliments received by the user
- number of writer compliments received by the user
- number of photo compliments received by the user
Inspection on Components
A similar inspection for user dataset was conducted, which shows that the quality of this dataset is much higher than the business dataset. Few concerns arised when we explored this dataset.
Since this process is almost identical to what we did on business dataset, we omit the specific details for brevity.
Quality of the Dataset
- Is the Data Complete?
-
Yes, it is coherent.
- Is the Data Coherent?
-
Yes, it is coherent.
- Is the Data Correct?
-
Yes, it is correct.
Potential Research Questions
Location for Restaurants
This question is expected to be answered with the business dataset.
There are many different types of restaurants in differnt places. We would like to know what is the best place
to run a restaurant, and what type of restaurant is most likely to be popular in a certain region.
Probability theory and estimation theory are expected to be the foundation of our analysis.
Elite Users
This question is expected to be answered with the user dataset.
Some of the users are quite popular on Yelp. They received many comments and stars for their valuable and useful reviews on foods, shops or fashions. These users are defined as Elites.
We would like to know how many types of Elite users are there, and how we can classify them reasonably.
Statistical methods, such as Gaussian Mixture Model may be adopted to achieve our goal.