Homework-2-1 Clustering Yelp Restaurants


5/5 - (2 votes)


Clustering Yelp Restaurants

In [ ]: print(’Your fitst name + last name’)
In this assignment, we will be working with the Yelp dataset. You can find the format of the dataset
From the Business Objects, let’s try to find culinary districts in Las Vegas. These are characterized
by closeness and similarity of restaurants. Use the “longitude” and “latitude” to cluster closeness. Use
“categories” to cluster for similarity. You may want to use only a subset (15-20) of popular categories.
Note that the spatial coordinates and restaurant categories have different units of scale. Your results
could be arbitrarily skewed if you don’t incorporate some scaling.
Find clusters using the 3 different techniques we discussed in class: k-means++, hierarchical, and GMM.
Explain your data representation and how you determined certain parameters (for example, the number of
clusters in k-means++). (30 pts)
In [ ]:
Visualize the clusters by plotting the longitude/latitude of the restaurants in a scatter plot. Label each
cluster with a category. In a markdown, explain how labels are assigned. (10 pts)
Note that some categories are inherently more common (e.g. “pizza”). When labeling your clusters,
you want to avoid the scenario where all clusters are labeled as “pizza” simply because of the uniformly
large number of these restaurants across all clusters. In other words, we don’t want to point out that pizza
restaurants are pretty much evenly distributed in high quantities everywhere, but rather discover when they,
or another type of restaurant, appear in notably high quantities.
In [ ]:
Now let’s detect outliers. These are the ones who are the farthest from the centroids of their clusters.
Track them down and describe any interesting observations that you can make. (10 pts)
In [ ]:
Give a detailed analysis comparing the results you obtained from the 3 techniques. (10 pts)
In [ ]:

PlaceholderHomework-2-1 Clustering Yelp Restaurants
Open chat
Need help?
Can we help?