r-data-visualization-gpa-correlations / GPA Correlations.Rmd
GPA Correlations.Rmd
Raw
---
title: "GPA Correlations"
---

```{r setup, include=FALSE}
library(tidyverse)
library(colorspace)
knitr::opts_chunk$set(echo = TRUE)
```

```{r message = FALSE, warning = FALSE}
food <- readr::read_csv("https://wilkelab.org/DSC385/datasets/food_coded.csv")
food
```

A detailed data dictionary for this dataset is available [here.](https://wilkelab.org/DSC385/datasets/food_codebook.pdf) The dataset was originally downloaded from Kaggle, and you can find additional information about the dataset [here.](https://www.kaggle.com/borapajo/food-choices/version/5)


**Question:** Is GPA related to student income, the father's educational level, or the student's perception of what an ideal diet is?

**Introduction:**

We will explore the `food` dataset (named "Food choices" on Kaggle) for this project. This dataset contains 126 college student responses regarding food choices, income, GPA and other personal details. We will be considering the following attributes/variables out of the dataset to answer our project's question - *Is GPA related to student income, the father's educational level, or the student's perception of what an ideal diet is?* : 

- `income` (student's income as an integer coding several income ranges)
  - 1 - less than $15,000
  - 2 - $15,001 to $30,000
  - 3 - $30,001 to $50,000
  - 4 - $50,001 to $70,000
  - 5 - $70,001 to $100,000
  - 6 - higher than $100,000
- `father_education` (level of student's father's education as an integer coding several levels of education)
  - 1 - less than high school
  - 2 - high school degree
  - 3 - some college degree
  - 4 - college degree
  - 5 - graduate degree
- `ideal_diet_coded` (what a student thinks about an ideal diet as an integer coding several response categories)
  - 1 – portion control
  - 2 – adding veggies/eating healthier food/adding fruit
  - 3 – balance
  - 4 – less sugar
  - 5 – home cooked/organic
  - 6 – current diet
  - 7 – more protein
  - 8 – unclear
- `GPA` (student's GPA as a decimal number)

**Approach:** 

For our analysis, we will be following the steps below:

1. Wrangle our dataset to ready it for analysis as follows:

  - Select our 4 columns from the dataset
  - Omit `NA` values from any row in the dataset
  - Remove characters from the GPA columns, then filter the rows that turn empty after that (meaning they did not contain a GPA)
  - Convert `GPA` to a numeric
  - Map all integers for `income`, `father_education` and `ideal_diet_coded`
  - Convert `income`, `father_education` and `ideal_diet_coded` to factors
  - Order `income`, `father_education` and `ideal_diet_coded` by mean GPA in descending errors
2. Create violin plots for `income`, `father_education` and `ideal_diet_coded` versus GPA

The reason to use violin plots for categorical representations is that violins show the distribution of the dependent variable across each category which would best enable us to compare the categories against each other regarding the dependent variable value.

**Analysis:**

```{r }

# Data Wrangling

food_wrangled <- food %>%
  select(GPA, income, father_education, ideal_diet_coded) %>%
  na.omit() %>%
  mutate(GPA = trimws(GPA, whitespace = "[^0-9]")) %>% 
  filter(nzchar(GPA)) %>%
  transform(GPA = as.numeric(GPA)) %>%
  mutate(income = case_when(
    income == 1 ~ "less than $15,000",
    income == 2 ~ "$15,001 to $30,000",
    income == 3 ~ "$30,001 to $50,000",
    income == 4 ~ "$50,001 to $70,000",
    income == 5 ~ "$70,001 to $100,000",
    income == 6 ~ "higher than $100,000"
  ), father_education = case_when(
    father_education == 1 ~ "less than high school",
    father_education == 2 ~ "high school degree",
    father_education == 3 ~ "some college degree",
    father_education == 4 ~ "college degree",
    father_education == 5 ~ "graduate degree"
  ), ideal_diet_coded = case_when(
    ideal_diet_coded == 1 ~ "portion control",
    ideal_diet_coded == 2 ~ "add fruit/veggies or eat healthier",
    ideal_diet_coded == 3 ~ "balance",
    ideal_diet_coded == 4 ~ "less sugar",
    ideal_diet_coded == 5 ~ "home cooked/organic",
    ideal_diet_coded == 6 ~ "current diet",
    ideal_diet_coded == 7 ~ "more protein",
    ideal_diet_coded == 8 ~ "unclear"
  )) %>%
  mutate(income = as.factor(income),
         father_education = as.factor(father_education),
         ideal_diet_coded = as.factor(ideal_diet_coded)) %>%
  mutate(income = fct_reorder(income, GPA, mean),
         father_education = fct_reorder(father_education, GPA, mean),
         ideal_diet_coded = fct_reorder(ideal_diet_coded, GPA, mean))

```

```{r fig.height=5, fig.width=13}

# Violin Plots

ggplot(food_wrangled, aes(income, GPA)) +
  geom_violin(fill = "#E69F00") +
  scale_x_discrete(
  name = "Income Bracket", # x-axis name
  ) +
  scale_y_continuous(
  name = "GPA" # y-axis name
  ) +
  ggtitle("GPA vs. Income Bracket")+ 
  theme_bw(12)

ggplot(food_wrangled, aes(father_education, GPA)) +
  geom_violin(fill = "#56B4E9") +
  scale_x_discrete(
  name = "Father's Education Category", # x-axis name
  ) +
  scale_y_continuous(
  name = "GPA" # y-axis name
  ) +
  ggtitle("GPA vs. Father's Education")+ 
  theme_bw(12)

ggplot(food_wrangled, aes(ideal_diet_coded, GPA)) +
  geom_violin(fill = "#009E73") +
  scale_x_discrete(
  name = "Student's Idea of Ideal Diet", # x-axis name
  ) +
  scale_y_continuous(
  name = "GPA" # y-axis name
  ) +
  ggtitle("GPA vs. Ideal Diet")+ 
  theme_bw(12)

```

**Discussion:** 

As apparent in the results, our charts show the following:

- There doesn't seem to be a significant relationship between income and GPA
- There seems to be a slight correlation between father's education and GPA. Maximum GPA occurs when father has been granted a college degree. GPA falls down with higher or lower education levels
- A relationship is apparent between the idea of a healthy diet vs. GPA as violin values differ between categories (arranged in ascending number of GPA)

To obtain even more accurate results, we can consider implementing the following:

- Gathering more survey responses from other students to reduce our confidence intervals within the violins
_ Consider segmentation criteria in comparing student together e.g. male vs. female