3_Data Prep and transformation 2.Rmd · R-Bootcamp-Course

---
title: "Data Preparation and Transformation - 2"

author: "Chiranjit Dutta"
date: "7/25/21"
output: 
   html_document:
    df_print: paged
    toc: yes
    number_sections: yes
---


# Using apply() family of functions

In this tutorial, we'll use the Flags dataset from the UCI Machine Learning Repository, where each observation is a country and each variable describes some characteristic of that country or its flag.  More information may be found here: http://archive.ics.uci.edu/ml/datasets/Flags


```{r}
flags <- read.csv("C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/flags.csv")

head(flags)
```

```{r}
# Dimension of flags data frame:
dim(flags)
```

The apply() function is most often used to apply a function to the rows or columns (margins) of matrices or data frames. 
The syntax for apply() is as follows where:

- x is the matrix, dataframe or array
- MARGIN is a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
- FUN is the function to be applied

```{r}
# Create a new data frame containing only the color data of the flags 
flag_colors <- flags[, 11:17]
head(flag_colors)
```

$\textbf{Find total number of countries/flags having each of the unique colors using apply()}.$

```{r}
# Find total number of flags/countries having each unique color
apply(flag_colors,2,sum)
```
## lapply()

The lapply() function takes a list as input, applies a function to each element of the
list, then returns a list of the same length as the original one. 

$\textbf{We want to find out class/type of each column in flags data frame using lapply()}.$

```{r}
# Class of each column in flags
cls_list <- lapply(flags, class)
class(cls_list)
```
```{r}
as.character(cls_list)
```
## sapply()

$\textbf{We want to find out to find the proportion of flags (out of 194) containing each color using sapply()}.$

```{r}
# The output is a vector
sapply(flag_colors, sum)/nrow(flag_colors)
```
## tapply()

The 'landmass' variable in our dataset takes on integer values between 1 and 6, each of
which represents a different part of the world. Use table(flags$landmass) to see how many flags/countries fall into each group.

```{r}
# How many flags/countries fall into each group?
table(flags$landmass)
```

The 'animate' variable in our dataset takes the value 1 if a country's flag contains an
 animate image (e.g. an eagle, a tree, a human hand) and 0 otherwise.

```{r}
# How many flags/countries fall into each group?
table(flags$animate)
```

If you take the arithmetic mean of a bunch of 0s and 1s, you get the proportion of 1s. Use
tapply() to apply the mean function to the 'animate' variable separately for each of the six landmass groups, thus giving us the proportion of flags containing an animate image within each landmass group.

$\textbf{Find the proportion of flags containing an animate image within each landmass group using tapply().}$
```{r}
tapply(flags$animate, flags$landmass, mean)
```

# Some useful functions from tidyr package

```{r}
library(tidyr)
```

## spread()

The syntax is: spread(data, key, value), where

- data is your dataframe of interest.
- key is the column whose values will become variable names.
- value is the column where values will fill in under the new variables created from key.

```{r}
# Take a look at the pre-loaded data table2 in tidyr package
table2
```

We can use the spread() function to turn the values in the 'type' column into their own columns:

```{r}
spread(table2, key=type, value=count)
```

## gather()

gather() does the reverse of spread(). gather() collects a set of column names and places them into a single “key” column.  

```{r}
# Take a look at the pre-loaded data table4 in tidyr package
table4a
```
```{r}
gather(table4a, "year", "cases", 2:3)
```

## separate()

separate() turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears.

```{r}
# Take a look at the pre-loaded data table3 in tidyr package
table3
```

```{r}
separate(table3, rate, into = c("cases", "population"),sep = "/")
```

## unite()

unite() does the opposite of separate(). It combines multiple columns into a single column.

```{r}
# Take a look at the pre-loaded data table5 in tidyr package
table5
```


```{r}
unite(table5, "new", century, year, sep = "")
```

# Random Sampling

The sample() function in R allows you to take a random sample of elements from a dataset or a vector, either with or without replacement. The basic syntax for the sample() function is as follows:

- sample(x, size, replace = FALSE, prob = NULL)

## Sampling with replacement

```{r}
#define vector a with 10 elements in it
a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
```

```{r}
#set.seed(some random number) to ensure that we get the same sample each time
set.seed(123)
#generate random sample of 5 elements from vector a
sample(x=a, size = 8,replace = TRUE)
```

## Sampling without replacement

```{r}
a <- 1:20
sample(x = a,size = 10,replace = FALSE)
```

## Split data into train and test set

In data science often we are required to split the original data into train and test set.

```{r}
# Load Boston Housing data from the R package 'MASS'
library(MASS)
data("Boston")
```

```{r}
# Check the dimension of the data set:
dim(Boston)
```
We want to generate a train data set containing a random sample of 70\% of the number of rows as in the original data set.

```{r}
set.seed(1234)
train_id <- sample(x = 1:nrow(Boston),size = 0.7*nrow(Boston),replace = FALSE)
train_set <- Boston[train_id,]
head(train_set)
```


```{r}
all_rows <- 1:nrow(Boston)
test_id <- all_rows[-train_id]
test_set <- Boston[test_id,]
head(test_set)
```

# Summarizing missing values: 

We would use sleep data from the 'VIM' package to illustrate missing data analysis. Some useful functions used in this section is discussed below.

- complete.cases() function which returns a logical vector identifying rows which are complete cases.
- which() will return the position of the elements(i.e., row number/column number/array index) in a logical vector which are TRUE.
- is.na() returns TRUE if it finds NA value and FALSE if it does not find in the dataset.
- colSums() returns column sum.

```{r}
# Load the package:
library(VIM)

# Load the data:
data("sleep")

head(sleep)
```

```{r}
# Check the dimensions
dim(sleep)
```

```{r}
# Which matrix indices have NAs?
which(is.na(sleep),arr.ind = TRUE)
```

```{r}
#  Count NA Values in All Data Frame Columns
colSums(is.na(sleep))
```

```{r}
# Omit the rows having NA:
data_without_NA <- sleep[complete.cases(sleep),]
head(data_without_NA)
```

```{r}
dim(data_without_NA)
```

# Imputing missing values using simple methods:

We continue our discussion using the sleep data from VIM package.

```{r}
summary(sleep)
```

## Missing values imputation using mean:

```{r}
# Simple imputation using mean:
sleep_df_imputed_mean <- as.data.frame(lapply(sleep, function(x) { 
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  x
}))

sleep_df_imputed_mean
```

## Missing values imputation using median:

```{r}
# Simple imputation using median:
sleep_df_imputed_median <- as.data.frame(lapply(sleep, function(x) { 
  x[is.na(x)] <- median(x, na.rm = TRUE)
  x
}))

sleep_df_imputed_median
```


# Working with strings:

## paste()

- paste() takes one or more R objects, converts them to “character”, and then it concatenates (pastes) them to form one or several character strings.

```{r}
paste("Everybody", "loves", "R Programming.")
paste("Everybody", "loves", "R Programming.", sep="*")
```

```{r}
a <- c("something", "to", "paste")
paste(a, collapse="_") 
```
## strsplit()

- strsplit() is used to split the elements of a character string and returns a list.

```{r}
z <- "Alabama-Alaska-Arizona-Arkansas-California"
strsplit(z, split = "-")
```
## Dealing with regular expressions (regex):

Metacharacters consist of non-alphanumeric symbols such as:

.    \\\    |    (    )    [    {    $    *    +   ?

To match metacharacters in R you need to escape them with a double backslash “\\”.

- The sub() function will replace the first occurrence leaving the other as it is.
- On the other hand, the gsub() function will replace all the strings or values with the input strings.

```{r}
# substitute $ with !
sub(pattern = "\\$", "\\!", "I love R$")
```

```{r}
# substitute \\ with white space
gsub(pattern = "\\\\", " ", "I\\need\\space")
```
```{r}
x <- c("RStudio", "v.4.2.0", "2022", "04-22-2022")

# find any strings with numeric values between 0-9
grep(pattern = "[0-9]", x, value = TRUE)
```

```{r}
# find any strings with the character R or r
grep(pattern = "[Rr]", x, value = TRUE)
```

```{r}
# find any strings that have non-alphanumeric characters
grep(pattern = "[^0-9a-zA-Z]", x, value = TRUE)
```

# Additional useful functions:

## match()

match() returns a vector of the positions of (first) matches of its first argument in its second.

```{r}
rv <- c(11, 12, 19, 22, 25)
s_rv <- c(19, 21, 11, 18, 46)

match(rv, s_rv)
```

## intersect()

intersect() returns the intersection of two vectors.

```{r}
# intersection of two vectors 
    
# Vector 1 
x1 <- c(1, 2, 3, 4, 5, 6, 5, 5)    
    
# Vector 2  
x2 <- c(2:4)     
    
# Intersection of two vectors   
x3 <- intersect(x1, x2)       
    
x3
```