ExploratoryAnalysisGeolocationData

Exploratory Analysis of Geolocational Data

Language: Python
Skill: Data Preparation, Data Visualization, Jupyter Notebook, REST API, K-Means Clustering

Introduction
Project Steps
Summary
References

Introduction

The project will take you through the day-to-day activities of a data science engineer, from data preparation on real-world datasets to data visualization and machine learning algorithm execution to a presentation of the results, using Python language and in Juypter Notebook development environment.

In the fast-paced, effort-intensive atmosphere that the average person lives in, it's common to be too fatigued to prepare a home-cooked supper. Of course, even if one eats home-cooked meals every day, it is not uncommon to desire to go out for a decent dinner for social/recreational reasons every now and again.

Consider a situation in which a person has recently relocated to a new location. They already have particular tastes and interests. If the student lived close to their favourite sources, it would save both the student and the food providers a lot of time and effort. Convenience translates to higher sales and less time spent by the customer.

Objective

This project involves the use of K-Means Clustering to find the best neighbourhoods for students in Waterloo by classifying neighbourhoods for incoming students based on their preferences on amenities and proximity to the location.

Methods

Clustering is the process of grouping elements so that observations from the same group are more similar than those from different groups. Geolocational analysis is the process of applying geographic models to satellite photos, GPS locations, and street addresses.

Data Sources

The dataset of students: https://www.kaggle.com/borapajo/food-choices
The dataset of neighbourhoods: https://opendata-city-of-waterloo.opendata.arcgis.com/datasets/RMW::census-neighbourhoods-2016/explore?location=43.476769%2C-80.527100%2C10.86&showTable=true

Project Steps

1. Collect Data

Before moving to the data analysis phase, I need to fetch data from a required dataset to setup up the environment. I imported data from the food-coded.csv file in our project for further visualization and interpretation.

The data I use contains a lot of columns, but I only extracted a few relevant columns from them.

Column	Meaning	Content
cook	how often do you cook?	1 - Every day 2 - A couple of times a week 3 - Whenever I can, but that is not very often 4 - I only help a little during holidays 5 - Never
eating_out	frequency of eating out in a typical week?	1 - Never 2 - 1-2 times 3 - 2-3 times 4 - 3-5 times 5 - every day
employment	do you work?	1 - yes full time 2 - yes part time 3 - no 4 - other
exercise	how often do you exercise in a regular week?	1 - Everyday 2 - Twice or three times per week 3 - Once a week 4 - Sometimes 5 – Never
income	can afford more costly apartments?	1 - less than $15,000 2 - $15,001 to $30,000 3 - $30,001 to $50,000 4 - $50,001 to $70,000 5 - $70,001 to $100,000 6 - higher than $100,000
sports	do you do any sporting activity?	1 - Yes 2 - No 99 – no answer

Pandas is a python library which is used to read and write data. It is also used for data analysis, manipulation and structuring of data.
Feature Extraction: In the given data set it contains 61 columns, I do not need all the columns to get result. So, I extract 11 columns and store it in a separate data frame.

2. Data Cleaning and Visualization

Visualization is a graphic way of representing data.
For larger datasets, data visualization helps to demonstrate the insights of data.
Common python plotting libraries: matplotlib, seaborn, ggplot, ploty.

Data Cleaning:
The process of removing data which are incorrect, incomplete, missing, or duplicate values are called data cleaning. It increases the accuracy of data.

There are many ways to clean data, but I have chosen to drop the rows with missing values.

Dropna() is used to delete rows or columns that contains missing or NA values. Syntax: dataframe.dropna()

3. K’Means Clustering

K-means clustering is a popular unsupervised machine learning technique. It finds the data points which are similar or closes to each other and separate them in different groups of clusters.

I plot a diagram regarding the Distortions of k, which is the number of clusters. Based on the diagram, I chose a k of 4 and run K-means clustering on all the columns.

As the diagram shows, I get 4 clusters of students based on their income status and preference of eating out. I then plot a boxplot for each of these four clusters.

In Cluster-0, I can tell those students in this cluster tend to be rich. They rarely cook or eat out. Their interests in exercise and sports are high.

In Cluster-1, I can tell those students in this cluster sometimes cook or eat out. Their income is relatively low, compared to the students in the other three clusters. Their interests in exercise and sports are average.

In Cluster-2, I can tell those students in this cluster tend to be rich. They cook quite often and sometimes eat out. Their interests in exercise and sports are average.

In Cluster-3, I can tell those students in this cluster tend to be rich. They rarely cook but eat out quite often. Their interests in exercise and sports are high.

4. Get Geolocational Data

First, I import the dataset of neighbourhoods in the Waterloo region.

The Regional Municipality of Waterloo is a metropolitan area of Southern Ontario, Canada. It contains the cities of Cambridge, Kitchener and Waterloo, and the townships of North Dumfries, Wellesley, Wilmot, and Woolwich. Kitchener, the largest city, is the seat of government (Regional_Municipality_of_Waterloo, 2008).

Then, I implement data cleaning to extract the names of neighbourhood from it.

Then, I used Nominatim to retrieve latitude and longitude into this dataset.

To help these people make better choice at finding accommodation, we also need to get more detailed geolocational data.

I made a free foursquare account and get our API credentials set up.

By using the Foursquare API, I added three more columns into our dataset, including restaurant, fitness centres, and transport.

Restaurant: Number of Restaurant in the radius of 2 km
Fitness Centres: Number of Fitness Centres in the radius of 2 km
Transport and Travel: Number of Transport and Travel in the radius of 2 km

Run K Means clustering on the dataset, with the optimal K value using Elbow Method
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

Here is what I got after running K Means clustering on the geolocational dataset.

From the diagram, I chose the k of 4 and find real centroids. Then, I plotted those neighborhood clusters.

I drew a boxplot for each of these clusters of geolocation data.

In Cluster-0, I can tell that near those venues in this cluster, there tends to be less restaurants and fitness centres. The transportation convenience is low.

In Cluster-1, I can tell that near those venues in this cluster, there tends to be relatively more restaurants and fitness centres. The transportation convenience is relatively high.

In Cluster-2, I can tell that near those venues in this cluster, there tends to be relatively less restaurants and fitness centres. The transportation convenience is relatively low.

In Cluster-3, I can tell that near those venues in this cluster, there tends to be more restaurants and fitness centres. The transportation convenience is high.

5. Plotting the clustered locations on a map

Using Folium, which is a Python library for generating maps based on the location it receives, I can plot the clustered locations on a map within the region of Waterloo.

Summary

In the first cluster diagram, I got 4 clusters of students.

Clusters of Students	Income	Cook	Eat out	Sports Interest
Students-0	High	Rarely	Rarely	High
Students-1	Low	Sometimes	Sometimes	Average
Students-2	High	Quite often	Sometimes	Average
Students-3	High	Rarely	Quite often	High

In the second cluster diagram, I got 4 clusters of neighborhoods.

Clusters of Neighborhoods	Restaurants	Fitness Centers	Transportation
Neighbourhoods-0	Extremely Low	Extremely Low	Extremely Low
Neighbourhoods-1	High	High	High
Neighbourhoods-2	Low	Low	Low
Neighbourhoods-3	Extremely High	Extremely High	Extremely High

By analysing those eight boxplots together, I can conclude that:

Clusters of Neighborhoods	Clusters of Students
Neighbourhoods-0	Students-1
Neighbourhoods-1	Students-0 Students-2
Neighbourhoods-2	Students-1
Neighbourhoods-3	Students-3

For students in Students-0, their income is high, and they have high interests in sports. Though they rarely eat out, I recommend them to choose places in Neighbourhoods-1 for it has high amenities. More specifically, if students in this cluster would like to have the best living experience, they can also choose places in Neighbourhoods-3. Otherwise, places in Neighbourhoods-1 should satisfy their needs.

For students in Students-1, their income is relatively low and their interests in sports are average. They sometimes eat out. I consider the places in Neighbourhoods-0 and Neighbourhoods-2 as their best choices. Because it is highly likely that the cost of living there would be much lower in these areas, due to the low transportation convenience as well as the low number of nearby restaurants and fitness centres. If students in this cluster would like to save money, they should choose Neighbourhoods-0. If they want to have a better life quality, they can choose Neighbourhoods-2.

For students in the clusters of Students-2, they have high income, cook quite often and sometimes eat out, having average interest in sports. So, I recommend Neighbourhoods-1 as their best choice of accommodation because there are relatively more amenities.

For students in the clusters of Students-3, they have high income and eat out quite often, having high interest in sports. So, I recommend Neighbourhoods-3 as their best choice of accommodation, because there are the most restaurants and fitness centres among all the clusters.

References

Anant Shukla, K. P. (n.d.). Exploratory Analysis of Geolocational Data. Regional_Municipality_of_Waterloo. (2008, August). Retrieved from https://en.wikipedia.org/wiki/: https://en.wikipedia.org/wiki/Regional_Municipality_of_Waterloo