1 Using apply() family of functions

In this tutorial, we’ll use the Flags dataset from the UCI Machine Learning Repository, where each observation is a country and each variable describes some characteristic of that country or its flag. More information may be found here: http://archive.ics.uci.edu/ml/datasets/Flags

flags <- read.csv("C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/flags.csv")

head(flags)

# Dimension of flags data frame:
dim(flags)

## [1] 194  30

The apply() function is most often used to apply a function to the rows or columns (margins) of matrices or data frames. The syntax for apply() is as follows where:

x is the matrix, dataframe or array
MARGIN is a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
FUN is the function to be applied

# Create a new data frame containing only the color data of the flags 
flag_colors <- flags[, 11:17]
head(flag_colors)

$\textbf{Find total number of countries/flags having each of the unique colors using apply()}.$

# Find total number of flags/countries having each unique color
apply(flag_colors,2,sum)

##    red  green   blue   gold  white  black orange 
##    153     91     99     91    146     52     26

1.1 lapply()

The lapply() function takes a list as input, applies a function to each element of the list, then returns a list of the same length as the original one.

$\textbf{We want to find out class/type of each column in flags data frame using lapply()}.$

# Class of each column in flags
cls_list <- lapply(flags, class)
class(cls_list)

## [1] "list"

as.character(cls_list)

##  [1] "character" "integer"   "integer"   "integer"   "integer"   "integer"  
##  [7] "integer"   "integer"   "integer"   "integer"   "integer"   "integer"  
## [13] "integer"   "integer"   "integer"   "integer"   "integer"   "character"
## [19] "integer"   "integer"   "integer"   "integer"   "integer"   "integer"  
## [25] "integer"   "integer"   "integer"   "integer"   "character" "character"

1.2 sapply()

$\textbf{We want to find out to find the proportion of flags (out of 194) containing each color using sapply()}.$

# The output is a vector
sapply(flag_colors, sum)/nrow(flag_colors)

##       red     green      blue      gold     white     black    orange 
## 0.7886598 0.4690722 0.5103093 0.4690722 0.7525773 0.2680412 0.1340206

1.3 tapply()

The ‘landmass’ variable in our dataset takes on integer values between 1 and 6, each of which represents a different part of the world. Use table(flags$landmass) to see how many flags/countries fall into each group.

# How many flags/countries fall into each group?
table(flags$landmass)

## 
##  1  2  3  4  5  6 
## 31 17 35 52 39 20

The ‘animate’ variable in our dataset takes the value 1 if a country’s flag contains an animate image (e.g. an eagle, a tree, a human hand) and 0 otherwise.

# How many flags/countries fall into each group?
table(flags$animate)

## 
##   0   1 
## 155  39

If you take the arithmetic mean of a bunch of 0s and 1s, you get the proportion of 1s. Use tapply() to apply the mean function to the ‘animate’ variable separately for each of the six landmass groups, thus giving us the proportion of flags containing an animate image within each landmass group.

$\textbf{Find the proportion of flags containing an animate image within each landmass group using tapply().}$

tapply(flags$animate, flags$landmass, mean)

##         1         2         3         4         5         6 
## 0.4193548 0.1764706 0.1142857 0.1346154 0.1538462 0.3000000

2 Some useful functions from tidyr package

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.2.1

2.1 spread()

The syntax is: spread(data, key, value), where

data is your dataframe of interest.
key is the column whose values will become variable names.
value is the column where values will fill in under the new variables created from key.

# Take a look at the pre-loaded data table2 in tidyr package
table2

We can use the spread() function to turn the values in the ‘type’ column into their own columns:

spread(table2, key=type, value=count)

2.2 gather()

gather() does the reverse of spread(). gather() collects a set of column names and places them into a single “key” column.

# Take a look at the pre-loaded data table4 in tidyr package
table4a

gather(table4a, "year", "cases", 2:3)

2.3 separate()

separate() turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears.

# Take a look at the pre-loaded data table3 in tidyr package
table3

separate(table3, rate, into = c("cases", "population"),sep = "/")

2.4 unite()

unite() does the opposite of separate(). It combines multiple columns into a single column.

# Take a look at the pre-loaded data table5 in tidyr package
table5

unite(table5, "new", century, year, sep = "")

3 Random Sampling

The sample() function in R allows you to take a random sample of elements from a dataset or a vector, either with or without replacement. The basic syntax for the sample() function is as follows:

sample(x, size, replace = FALSE, prob = NULL)

3.1 Sampling with replacement

#define vector a with 10 elements in it
a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

#set.seed(some random number) to ensure that we get the same sample each time
set.seed(123)
#generate random sample of 5 elements from vector a
sample(x=a, size = 8,replace = TRUE)

## [1]  3  3 10  2  6  5  4  6

3.2 Sampling without replacement

a <- 1:20
sample(x = a,size = 10,replace = FALSE)

##  [1]  5 19  9  3  8 10  7 15 18 17

3.3 Split data into train and test set

In data science often we are required to split the original data into train and test set.

# Load Boston Housing data from the R package 'MASS'
library(MASS)
data("Boston")

# Check the dimension of the data set:
dim(Boston)

## [1] 506  14

We want to generate a train data set containing a random sample of 70% of the number of rows as in the original data set.

set.seed(1234)
train_id <- sample(x = 1:nrow(Boston),size = 0.7*nrow(Boston),replace = FALSE)
train_set <- Boston[train_id,]
head(train_set)

all_rows <- 1:nrow(Boston)
test_id <- all_rows[-train_id]
test_set <- Boston[test_id,]
head(test_set)

4 Summarizing missing values:

We would use sleep data from the ‘VIM’ package to illustrate missing data analysis. Some useful functions used in this section is discussed below.

complete.cases() function which returns a logical vector identifying rows which are complete cases.
which() will return the position of the elements(i.e., row number/column number/array index) in a logical vector which are TRUE.
is.na() returns TRUE if it finds NA value and FALSE if it does not find in the dataset.
colSums() returns column sum.

# Load the package:
library(VIM)

## Warning: package 'VIM' was built under R version 4.2.1

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

# Load the data:
data("sleep")

head(sleep)

# Check the dimensions
dim(sleep)

## [1] 62 10

# Which matrix indices have NAs?
which(is.na(sleep),arr.ind = TRUE)

##       row col
##  [1,]   1   3
##  [2,]   3   3
##  [3,]   4   3
##  [4,]  14   3
##  [5,]  21   3
##  [6,]  24   3
##  [7,]  26   3
##  [8,]  30   3
##  [9,]  31   3
## [10,]  41   3
## [11,]  47   3
## [12,]  53   3
## [13,]  55   3
## [14,]  62   3
## [15,]   1   4
## [16,]   3   4
## [17,]   4   4
## [18,]  14   4
## [19,]  24   4
## [20,]  26   4
## [21,]  30   4
## [22,]  31   4
## [23,]  47   4
## [24,]  53   4
## [25,]  55   4
## [26,]  62   4
## [27,]  21   5
## [28,]  31   5
## [29,]  41   5
## [30,]  62   5
## [31,]   4   6
## [32,]  13   6
## [33,]  35   6
## [34,]  36   6
## [35,]  13   7
## [36,]  19   7
## [37,]  20   7
## [38,]  56   7

#  Count NA Values in All Data Frame Columns
colSums(is.na(sleep))

##  BodyWgt BrainWgt     NonD    Dream    Sleep     Span     Gest     Pred 
##        0        0       14       12        4        4        4        0 
##      Exp   Danger 
##        0        0

# Omit the rows having NA:
data_without_NA <- sleep[complete.cases(sleep),]
head(data_without_NA)

dim(data_without_NA)

## [1] 42 10

5 Imputing missing values using simple methods:

We continue our discussion using the sleep data from VIM package.

summary(sleep)

##     BodyWgt            BrainWgt            NonD            Dream      
##  Min.   :   0.005   Min.   :   0.14   Min.   : 2.100   Min.   :0.000  
##  1st Qu.:   0.600   1st Qu.:   4.25   1st Qu.: 6.250   1st Qu.:0.900  
##  Median :   3.342   Median :  17.25   Median : 8.350   Median :1.800  
##  Mean   : 198.790   Mean   : 283.13   Mean   : 8.673   Mean   :1.972  
##  3rd Qu.:  48.202   3rd Qu.: 166.00   3rd Qu.:11.000   3rd Qu.:2.550  
##  Max.   :6654.000   Max.   :5712.00   Max.   :17.900   Max.   :6.600  
##                                       NA's   :14       NA's   :12     
##      Sleep            Span              Gest             Pred      
##  Min.   : 2.60   Min.   :  2.000   Min.   : 12.00   Min.   :1.000  
##  1st Qu.: 8.05   1st Qu.:  6.625   1st Qu.: 35.75   1st Qu.:2.000  
##  Median :10.45   Median : 15.100   Median : 79.00   Median :3.000  
##  Mean   :10.53   Mean   : 19.878   Mean   :142.35   Mean   :2.871  
##  3rd Qu.:13.20   3rd Qu.: 27.750   3rd Qu.:207.50   3rd Qu.:4.000  
##  Max.   :19.90   Max.   :100.000   Max.   :645.00   Max.   :5.000  
##  NA's   :4       NA's   :4         NA's   :4                       
##       Exp            Danger     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000  
##  Mean   :2.419   Mean   :2.613  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000  
##

5.1 Missing values imputation using mean:

# Simple imputation using mean:
sleep_df_imputed_mean <- as.data.frame(lapply(sleep, function(x) { 
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  x
}))

sleep_df_imputed_mean

5.2 Missing values imputation using median:

# Simple imputation using median:
sleep_df_imputed_median <- as.data.frame(lapply(sleep, function(x) { 
  x[is.na(x)] <- median(x, na.rm = TRUE)
  x
}))

sleep_df_imputed_median

6 Working with strings:

6.1 paste()

paste() takes one or more R objects, converts them to “character”, and then it concatenates (pastes) them to form one or several character strings.

paste("Everybody", "loves", "R Programming.")

## [1] "Everybody loves R Programming."

paste("Everybody", "loves", "R Programming.", sep="*")

## [1] "Everybody*loves*R Programming."

a <- c("something", "to", "paste")
paste(a, collapse="_")

## [1] "something_to_paste"

6.2 strsplit()

strsplit() is used to split the elements of a character string and returns a list.

z <- "Alabama-Alaska-Arizona-Arkansas-California"
strsplit(z, split = "-")

## [[1]]
## [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"

6.3 Dealing with regular expressions (regex):

Metacharacters consist of non-alphanumeric symbols such as:

. \ | ( ) [ { $ * + ?

To match metacharacters in R you need to escape them with a double backslash “\”.

The sub() function will replace the first occurrence leaving the other as it is.
On the other hand, the gsub() function will replace all the strings or values with the input strings.

# substitute $ with !
sub(pattern = "\\$", "\\!", "I love R$")

## [1] "I love R!"

# substitute \\ with white space
gsub(pattern = "\\\\", " ", "I\\need\\space")

## [1] "I need space"

x <- c("RStudio", "v.4.2.0", "2022", "04-22-2022")

# find any strings with numeric values between 0-9
grep(pattern = "[0-9]", x, value = TRUE)

## [1] "v.4.2.0"    "2022"       "04-22-2022"

# find any strings with the character R or r
grep(pattern = "[Rr]", x, value = TRUE)

## [1] "RStudio"

# find any strings that have non-alphanumeric characters
grep(pattern = "[^0-9a-zA-Z]", x, value = TRUE)

## [1] "v.4.2.0"    "04-22-2022"

7 Additional useful functions:

7.1 match()

match() returns a vector of the positions of (first) matches of its first argument in its second.

rv <- c(11, 12, 19, 22, 25)
s_rv <- c(19, 21, 11, 18, 46)

match(rv, s_rv)

## [1]  3 NA  1 NA NA

7.2 intersect()

intersect() returns the intersection of two vectors.

# intersection of two vectors 
    
# Vector 1 
x1 <- c(1, 2, 3, 4, 5, 6, 5, 5)    
    
# Vector 2  
x2 <- c(2:4)     
    
# Intersection of two vectors   
x3 <- intersect(x1, x2)       
    
x3

## [1] 2 3 4

Data Preparation and Transformation - 2

Chiranjit Dutta

7/25/21