ExploratoryAnalysisGeolocationData
README.md

Exploratory Analysis of Geolocational Data

Language: Python
Skill: Data Preparation, Data Visualization, Jupyter Notebook, REST API, K-Means Clustering


Table of Contents

  1. Introduction
  2. Project Steps
    1. Collection Of Data
    2. Data Cleaning and Visualization
    3. K-Means Clustering
    4. Get Geolocational Data
    5. Plot On Map
  3. Summary
  4. References

Introduction

The project will take you through the day-to-day activities of a data science engineer, from data preparation on real-world datasets to data visualization and machine learning algorithm execution to a presentation of the results, using Python language and in Juypter Notebook development environment.

In the fast-paced, effort-intensive atmosphere that the average person lives in, it's common to be too fatigued to prepare a home-cooked supper. Of course, even if one eats home-cooked meals every day, it is not uncommon to desire to go out for a decent dinner for social/recreational reasons every now and again.

Consider a situation in which a person has recently relocated to a new location. They already have particular tastes and interests. If the student lived close to their favourite sources, it would save both the student and the food providers a lot of time and effort. Convenience translates to higher sales and less time spent by the customer.

Objective

This project involves the use of K-Means Clustering to find the best neighbourhoods for students in Waterloo by classifying neighbourhoods for incoming students based on their preferences on amenities and proximity to the location.

Methods

Clustering is the process of grouping elements so that observations from the same group are more similar than those from different groups. Geolocational analysis is the process of applying geographic models to satellite photos, GPS locations, and street addresses.

Data Sources

The dataset of students: https://www.kaggle.com/borapajo/food-choices
The dataset of neighbourhoods: https://opendata-city-of-waterloo.opendata.arcgis.com/datasets/RMW::census-neighbourhoods-2016/explore?location=43.476769%2C-80.527100%2C10.86&showTable=true


Project Steps

1. Collect Data

Before moving to the data analysis phase, I need to fetch data from a required dataset to setup up the environment. I imported data from the food-coded.csv file in our project for further visualization and interpretation.

The data I use contains a lot of columns, but I only extracted a few relevant columns from them.

Column Meaning Content
cook how often do you cook? 1 - Every day
2 - A couple of times a week
3 - Whenever I can, but that is not very often
4 - I only help a little during holidays
5 - Never
eating_out frequency of eating out in a typical week? 1 - Never
2 - 1-2 times
3 - 2-3 times
4 - 3-5 times
5 - every day
employment do you work? 1 - yes full time
2 - yes part time
3 - no
4 - other
exercise how often do you exercise in a regular week? 1 - Everyday
2 - Twice or three times per week
3 - Once a week
4 - Sometimes
5 – Never
income can afford more costly apartments? 1 - less than $15,000
2 - $15,001 to $30,000
3 - $30,001 to $50,000
4 - $50,001 to $70,000
5 - $70,001 to $100,000
6 - higher than $100,000
sports do you do any sporting activity? 1 - Yes
2 - No
99 – no answer
  • Pandas is a python library which is used to read and write data. It is also used for data analysis, manipulation and structuring of data.
  • Feature Extraction: In the given data set it contains 61 columns, I do not need all the columns to get result. So, I extract 11 columns and store it in a separate data frame.

2. Data Cleaning and Visualization

  • Visualization is a graphic way of representing data.
  • For larger datasets, data visualization helps to demonstrate the insights of data.
  • Common python plotting libraries: matplotlib, seaborn, ggplot, ploty.

Data Cleaning:
The process of removing data which are incorrect, incomplete, missing, or duplicate values are called data cleaning. It increases the accuracy of data.

There are many ways to clean data, but I have chosen to drop the rows with missing values.

Dropna() is used to delete rows or columns that contains missing or NA values. Syntax: dataframe.dropna()

3. K’Means Clustering

K-means clustering is a popular unsupervised machine learning technique. It finds the data points which are similar or closes to each other and separate them in different groups of clusters.

I plot a diagram regarding the Distortions of k, which is the number of clusters. Based on the diagram, I chose a k of 4 and run K-means clustering on all the columns.
Picture1 Picture2

As the diagram shows, I get 4 clusters of students based on their income status and preference of eating out. I then plot a boxplot for each of these four clusters.
Picture3

In Cluster-0, I can tell those students in this cluster tend to be rich. They rarely cook or eat out. Their interests in exercise and sports are high.
Picture4
In Cluster-1, I can tell those students in this cluster sometimes cook or eat out. Their income is relatively low, compared to the students in the other three clusters. Their interests in exercise and sports are average.
Picture5

In Cluster-2, I can tell those students in this cluster tend to be rich. They cook quite often and sometimes eat out. Their interests in exercise and sports are average.
Picture6
In Cluster-3, I can tell those students in this cluster tend to be rich. They rarely cook but eat out quite often. Their interests in exercise and sports are high.

4. Get Geolocational Data

First, I import the dataset of neighbourhoods in the Waterloo region.

The Regional Municipality of Waterloo is a metropolitan area of Southern Ontario, Canada. It contains the cities of Cambridge, Kitchener and Waterloo, and the townships of North Dumfries, Wellesley, Wilmot, and Woolwich. Kitchener, the largest city, is the seat of government (Regional_Municipality_of_Waterloo, 2008).

Then, I implement data cleaning to extract the names of neighbourhood from it.
Picture7

Then, I used Nominatim to retrieve latitude and longitude into this dataset.

To help these people make better choice at finding accommodation, we also need to get more detailed geolocational data.

I made a free foursquare account and get our API credentials set up.

By using the Foursquare API, I added three more columns into our dataset, including restaurant, fitness centres, and transport.

  • Restaurant: Number of Restaurant in the radius of 2 km
  • Fitness Centres: Number of Fitness Centres in the radius of 2 km
  • Transport and Travel: Number of Transport and Travel in the radius of 2 km

Picture8

Run K Means clustering on the dataset, with the optimal K value using Elbow Method
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

Here is what I got after running K Means clustering on the geolocational dataset.

From the diagram, I chose the k of 4 and find real centroids. Then, I plotted those neighborhood clusters.
Picture9 Picture10
I drew a boxplot for each of these clusters of geolocation data.
Picture11

In Cluster-0, I can tell that near those venues in this cluster, there tends to be less restaurants and fitness centres. The transportation convenience is low.
Picture12

In Cluster-1, I can tell that near those venues in this cluster, there tends to be relatively more restaurants and fitness centres. The transportation convenience is relatively high.
Picture13

In Cluster-2, I can tell that near those venues in this cluster, there tends to be relatively less restaurants and fitness centres. The transportation convenience is relatively low.
Picture14
In Cluster-3, I can tell that near those venues in this cluster, there tends to be more restaurants and fitness centres. The transportation convenience is high.

5. Plotting the clustered locations on a map

Using Folium, which is a Python library for generating maps based on the location it receives, I can plot the clustered locations on a map within the region of Waterloo.
Picture15


Summary

In the first cluster diagram, I got 4 clusters of students.

Clusters of Students Income Cook Eat out Sports Interest
Students-0 High Rarely Rarely High
Students-1 Low Sometimes Sometimes Average
Students-2 High Quite often Sometimes Average
Students-3 High Rarely Quite often High

In the second cluster diagram, I got 4 clusters of neighborhoods.

Clusters of Neighborhoods Restaurants Fitness Centers Transportation
Neighbourhoods-0 Extremely Low Extremely Low Extremely Low
Neighbourhoods-1 High High High
Neighbourhoods-2 Low Low Low
Neighbourhoods-3 Extremely High Extremely High Extremely High

By analysing those eight boxplots together, I can conclude that:

Clusters of Neighborhoods Clusters of Students
Neighbourhoods-0 Students-1
Neighbourhoods-1 Students-0
Students-2
Neighbourhoods-2 Students-1
Neighbourhoods-3 Students-3

For students in Students-0, their income is high, and they have high interests in sports. Though they rarely eat out, I recommend them to choose places in Neighbourhoods-1 for it has high amenities. More specifically, if students in this cluster would like to have the best living experience, they can also choose places in Neighbourhoods-3. Otherwise, places in Neighbourhoods-1 should satisfy their needs.

For students in Students-1, their income is relatively low and their interests in sports are average. They sometimes eat out. I consider the places in Neighbourhoods-0 and Neighbourhoods-2 as their best choices. Because it is highly likely that the cost of living there would be much lower in these areas, due to the low transportation convenience as well as the low number of nearby restaurants and fitness centres. If students in this cluster would like to save money, they should choose Neighbourhoods-0. If they want to have a better life quality, they can choose Neighbourhoods-2.

For students in the clusters of Students-2, they have high income, cook quite often and sometimes eat out, having average interest in sports. So, I recommend Neighbourhoods-1 as their best choice of accommodation because there are relatively more amenities.

For students in the clusters of Students-3, they have high income and eat out quite often, having high interest in sports. So, I recommend Neighbourhoods-3 as their best choice of accommodation, because there are the most restaurants and fitness centres among all the clusters.


References