Analyzed by Bryan Chang

Nov, 2018

Introduction on Yelp Open Dataset

The Yelp Open Dataset is a subset of Yelp's businesses, reviews, and user data for use in personal, educational, and academic purposes.
The dataset is presented as JSON files, which contain 5,996,996 reviews, 188,593 businesses, 280,992 pictures and so on.
For detailed explanations and the download link for Yelp Open Dataset, please click here.

The dataset is partitioned into several parts, such as business.json, review.json, user.json, photo.json, etc.
Our research focus is on the business dataset (business.json) and the user dataset (user.json).

Accountability of the Dataset

Source: Where did this data come from?

This dataset was collected by Yelp, and is a subset of its businesses, reviews, and user data.

Accessibility: Who has access to the dataset?

This dataset is publicly available, i.e., accessible for everyone. Anyone can download, view and use it as long as for personal, educational, and academic purposes.
The download link can be found in the aforementioned link.

Analysis on Business Dataset

Components of the Dataset

The dataset is named “yelp_academic_dataset_business.json”, which contains the following information:

business’s ID
business’s name
business’s neighborhood’s name
business’s full address, including street, city and state
business’s postal code
business’s latitude and longitude
business’s stars rating
business’s number of reviews
whether the business is open or closed
business’s attributes, e.g., whether it offers parking, etc.
business’s categories, e.g., “Mexican”, “Burgers”, etc.
business’s working hours

Inspection on Components

Business ID

All of the values are non-empty.
For most of the ID, it is a string that contains 22 characters.
However, there are rows with an ID as “#NAME?”. We believe this is an error. Hence, those rows are removed.

Business’s Name

All of the values are non-empty.
Most of the names look normal.
However, there are several names with strange characters, e.g., “H么tel Auberge Montréal Espace Confort”, “L'am猫re 脌 Boire”, “Crémy P芒tisserie”. We believe these names use special characters that include diacritics, and were inaccurately decoded during the data collection and processing procedure.

Business’s Neighborhood’s Name

There are many rows with missing values with respect to this attibute.
Besides that, there are several names with strange characters, such as “Notre-Dame-de-Gr芒ce”. We believe this can also be attibuted to inaccurate decoding.

Business’s Address

There are many rows with missing values with respect to this attibute.
Another issue is that there are also strange characters in some rows.

Business’s City

All of the values are non-empty.
Some of the values have inaccurate decoding problem.

Business’s State

All of the values are valid.

Business’s Postal Code

There are some rows with missing values with respect to this attibute.
The remaining parts are valid.
It should be noted that the file contains both US and Canada Postal Code.

Business’s Latitude and Longitude

There are some rows with missing values with respect to this attibute.
The remaining parts are valid, which have been verified with the aid of Google Map.

Business’s Stars

The following chart demonstrates the distribution of different values for stars.
The X-axis is the stars.
The Y-axis is the number of rows that take a particular value.

From the chart, we can see that all the values are valid. The values range from 1.0 to 5.0, with a stride of 0.5.

Business’s Number of Reviews

All of the values are valid, namely non-negative.

Business's status (Open or Closed)

All of the values are valid.
The value can take 0, meaning the business is closed, or 1, meaning the business is open.

Business's Attributes

There are some rows with missing values with respect to this attibute.
For those rows with values, they are all valid.
However, the format for different rows varies. For instance, some businesses clearly state whether they provide bike parking or not, and the rest of them do not say anything about bike parking. Some businesses say they accept credit cards or not, and the rest of them say nothing about credit cards. What do they mean if they do not say it? It means they do not provide such service, or they fail to report? This is not clear.
Therefore, this attribute is somewhat incoherent.

Business's Categories

There are some rows with missing values with respect to this attibute.
The remaining parts are valid.

Business's Working Hours

There are some rows with missing values with respect to this attibute.
The remaining parts are valid.

Quality of the Dataset

Is the Data Complete?

No, the data are not complete.
From the aforementioned analysis, we know that there are missing values in many fields, such as business ID. The concern raised from this phenomenon is that we cannot verify the identity of the business, therefore, these rows may have to be discarded, which may introduce bias.

Is the Data Coherent?

No, the data are not coherent.
From the aforementioned analysis, we know that there are inaccurately decoded values in name, neighborhood, address and city. Moreover, the format for attributes is not coherent.

Is the Data Correct?

Yes, the data are mostly correct.
A minor concern is that if we discard invalid rows, the dataset may become biased.

Analysis on User Dataset

Components of the Dataset

The dataset is named “yelp_academic_dataset_user.json”, which contains the following information:

user id
user's name
the number of reviews a user has written
when the user joined Yelp
user's friends
number of useful votes sent by the user
number of funny votes sent by the user
number of cool votes sent by the user
number of fans the user has
the years the user was elite
average rating of all the user's reviews
number of hot compliments received by the user
number of more compliments received by the user
number of profile compliments received by the user
number of cute compliments received by the user
number of list compliments received by the user
number of note compliments received by the user
number of plain compliments received by the user
number of cool compliments received by the user
number of funny compliments received by the user
number of writer compliments received by the user
number of photo compliments received by the user

Inspection on Components

A similar inspection for user dataset was conducted, which shows that the quality of this dataset is much higher than the business dataset. Few concerns arised when we explored this dataset.
Since this process is almost identical to what we did on business dataset, we omit the specific details for brevity.

Quality of the Dataset

Is the Data Complete?

Yes, it is coherent.

Is the Data Coherent?

Yes, it is coherent.

Is the Data Correct?

Yes, it is correct.

Potential Research Questions

Location for Restaurants

This question is expected to be answered with the business dataset.
There are many different types of restaurants in differnt places. We would like to know what is the best place to run a restaurant, and what type of restaurant is most likely to be popular in a certain region.
Probability theory and estimation theory are expected to be the foundation of our analysis.

Elite Users

This question is expected to be answered with the user dataset.
Some of the users are quite popular on Yelp. They received many comments and stars for their valuable and useful reviews on foods, shops or fashions. These users are defined as Elites.
We would like to know how many types of Elite users are there, and how we can classify them reasonably.
Statistical methods, such as Gaussian Mixture Model may be adopted to achieve our goal.

About

Yelp Open Dataset

Exploratory Analysis

Introduction on Yelp Open Dataset

Accountability of the Dataset

Source: Where did this data come from?

Accessibility: Who has access to the dataset?

Analysis on Business Dataset

Components of the Dataset

Inspection on Components

Quality of the Dataset

Analysis on User Dataset

Components of the Dataset

Inspection on Components

Quality of the Dataset

Potential Research Questions

Location for Restaurants

Elite Users