Language: Python
Skill: Data Preparation, Data Visualization, Jupyter Notebook, REST API, K-Means Clustering
The project will take you through the day-to-day activities of a data science engineer, from data preparation on real-world datasets to data visualization and machine learning algorithm execution to a presentation of the results, using Python language and in Juypter Notebook development environment.
In the fast-paced, effort-intensive atmosphere that the average person lives in, it's common to be too fatigued to prepare a home-cooked supper. Of course, even if one eats home-cooked meals every day, it is not uncommon to desire to go out for a decent dinner for social/recreational reasons every now and again.
Consider a situation in which a person has recently relocated to a new location. They already have particular tastes and interests. If the student lived close to their favourite sources, it would save both the student and the food providers a lot of time and effort. Convenience translates to higher sales and less time spent by the customer.
This project involves the use of K-Means Clustering to find the best neighbourhoods for students in Waterloo by classifying neighbourhoods for incoming students based on their preferences on amenities and proximity to the location.
Clustering is the process of grouping elements so that observations from the same group are more similar than those from different groups. Geolocational analysis is the process of applying geographic models to satellite photos, GPS locations, and street addresses.
The dataset of students: https://www.kaggle.com/borapajo/food-choices
The dataset of neighbourhoods: https://opendata-city-of-waterloo.opendata.arcgis.com/datasets/RMW::census-neighbourhoods-2016/explore?location=43.476769%2C-80.527100%2C10.86&showTable=true
Before moving to the data analysis phase, I need to fetch data from a required dataset to setup up the environment. I imported data from the food-coded.csv file in our project for further visualization and interpretation.
The data I use contains a lot of columns, but I only extracted a few relevant columns from them.
Column | Meaning | Content |
---|---|---|
cook | how often do you cook? | 1 - Every day 2 - A couple of times a week 3 - Whenever I can, but that is not very often 4 - I only help a little during holidays 5 - Never |
eating_out | frequency of eating out in a typical week? | 1 - Never 2 - 1-2 times 3 - 2-3 times 4 - 3-5 times 5 - every day |
employment | do you work? | 1 - yes full time 2 - yes part time 3 - no 4 - other |
exercise | how often do you exercise in a regular week? | 1 - Everyday 2 - Twice or three times per week 3 - Once a week 4 - Sometimes 5 – Never |
income | can afford more costly apartments? | 1 - less than $15,000 2 - $15,001 to $30,000 3 - $30,001 to $50,000 4 - $50,001 to $70,000 5 - $70,001 to $100,000 6 - higher than $100,000 |
sports | do you do any sporting activity? | 1 - Yes 2 - No 99 – no answer |
Data Cleaning:
The process of removing data which are incorrect, incomplete, missing, or duplicate values are called data cleaning. It increases the accuracy of data.
There are many ways to clean data, but I have chosen to drop the rows with missing values.
Dropna() is used to delete rows or columns that contains missing or NA values. Syntax: dataframe.dropna()
K-means clustering is a popular unsupervised machine learning technique. It finds the data points which are similar or closes to each other and separate them in different groups of clusters.
I plot a diagram regarding the Distortions of k, which is the number of clusters. Based on the diagram, I chose a k of 4 and run K-means clustering on all the columns.
As the diagram shows, I get 4 clusters of students based on their income status and preference of eating out. I then plot a boxplot for each of these four clusters.
In Cluster-0, I can tell those students in this cluster tend to be rich. They rarely cook or eat out. Their interests in exercise and sports are high.
In Cluster-1, I can tell those students in this cluster sometimes cook or eat out. Their income is relatively low, compared to the students in the other three clusters. Their interests in exercise and sports are average.
In Cluster-2, I can tell those students in this cluster tend to be rich. They cook quite often and sometimes eat out. Their interests in exercise and sports are average.
In Cluster-3, I can tell those students in this cluster tend to be rich. They rarely cook but eat out quite often. Their interests in exercise and sports are high.
First, I import the dataset of neighbourhoods in the Waterloo region.
The Regional Municipality of Waterloo is a metropolitan area of Southern Ontario, Canada. It contains the cities of Cambridge, Kitchener and Waterloo, and the townships of North Dumfries, Wellesley, Wilmot, and Woolwich. Kitchener, the largest city, is the seat of government (Regional_Municipality_of_Waterloo, 2008).
Then, I implement data cleaning to extract the names of neighbourhood from it.
Then, I used Nominatim to retrieve latitude and longitude into this dataset.
To help these people make better choice at finding accommodation, we also need to get more detailed geolocational data.
I made a free foursquare account and get our API credentials set up.
By using the Foursquare API, I added three more columns into our dataset, including restaurant, fitness centres, and transport.
Run K Means clustering on the dataset, with the optimal K value using Elbow Method
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.
Here is what I got after running K Means clustering on the geolocational dataset.
From the diagram, I chose the k of 4 and find real centroids. Then, I plotted those neighborhood clusters.
I drew a boxplot for each of these clusters of geolocation data.
In Cluster-0, I can tell that near those venues in this cluster, there tends to be less restaurants and fitness centres. The transportation convenience is low.
In Cluster-1, I can tell that near those venues in this cluster, there tends to be relatively more restaurants and fitness centres. The transportation convenience is relatively high.
In Cluster-2, I can tell that near those venues in this cluster, there tends to be relatively less restaurants and fitness centres. The transportation convenience is relatively low.
In Cluster-3, I can tell that near those venues in this cluster, there tends to be more restaurants and fitness centres. The transportation convenience is high.
Using Folium, which is a Python library for generating maps based on the location it receives, I can plot the clustered locations on a map within the region of Waterloo.
In the first cluster diagram, I got 4 clusters of students.
Clusters of Students | Income | Cook | Eat out | Sports Interest |
---|---|---|---|---|
Students-0 | High | Rarely | Rarely | High |
Students-1 | Low | Sometimes | Sometimes | Average |
Students-2 | High | Quite often | Sometimes | Average |
Students-3 | High | Rarely | Quite often | High |
In the second cluster diagram, I got 4 clusters of neighborhoods.
Clusters of Neighborhoods | Restaurants | Fitness Centers | Transportation |
---|---|---|---|
Neighbourhoods-0 | Extremely Low | Extremely Low | Extremely Low |
Neighbourhoods-1 | High | High | High |
Neighbourhoods-2 | Low | Low | Low |
Neighbourhoods-3 | Extremely High | Extremely High | Extremely High |
By analysing those eight boxplots together, I can conclude that:
Clusters of Neighborhoods | Clusters of Students |
---|---|
Neighbourhoods-0 | Students-1 |
Neighbourhoods-1 | Students-0 Students-2 |
Neighbourhoods-2 | Students-1 |
Neighbourhoods-3 | Students-3 |
For students in Students-0, their income is high, and they have high interests in sports. Though they rarely eat out, I recommend them to choose places in Neighbourhoods-1 for it has high amenities. More specifically, if students in this cluster would like to have the best living experience, they can also choose places in Neighbourhoods-3. Otherwise, places in Neighbourhoods-1 should satisfy their needs.
For students in Students-1, their income is relatively low and their interests in sports are average. They sometimes eat out. I consider the places in Neighbourhoods-0 and Neighbourhoods-2 as their best choices. Because it is highly likely that the cost of living there would be much lower in these areas, due to the low transportation convenience as well as the low number of nearby restaurants and fitness centres. If students in this cluster would like to save money, they should choose Neighbourhoods-0. If they want to have a better life quality, they can choose Neighbourhoods-2.
For students in the clusters of Students-2, they have high income, cook quite often and sometimes eat out, having average interest in sports. So, I recommend Neighbourhoods-1 as their best choice of accommodation because there are relatively more amenities.
For students in the clusters of Students-3, they have high income and eat out quite often, having high interest in sports. So, I recommend Neighbourhoods-3 as their best choice of accommodation, because there are the most restaurants and fitness centres among all the clusters.