In this tutorial, we’ll use the Flags dataset from the UCI Machine Learning Repository, where each observation is a country and each variable describes some characteristic of that country or its flag. More information may be found here: http://archive.ics.uci.edu/ml/datasets/Flags
flags <- read.csv("C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/flags.csv")
head(flags)
# Dimension of flags data frame:
dim(flags)
## [1] 194 30
The apply() function is most often used to apply a function to the rows or columns (margins) of matrices or data frames. The syntax for apply() is as follows where:
# Create a new data frame containing only the color data of the flags
flag_colors <- flags[, 11:17]
head(flag_colors)
\(\textbf{Find total number of countries/flags having each of the unique colors using apply()}.\)
# Find total number of flags/countries having each unique color
apply(flag_colors,2,sum)
## red green blue gold white black orange
## 153 91 99 91 146 52 26
The lapply() function takes a list as input, applies a function to each element of the list, then returns a list of the same length as the original one.
\(\textbf{We want to find out class/type of each column in flags data frame using lapply()}.\)
# Class of each column in flags
cls_list <- lapply(flags, class)
class(cls_list)
## [1] "list"
as.character(cls_list)
## [1] "character" "integer" "integer" "integer" "integer" "integer"
## [7] "integer" "integer" "integer" "integer" "integer" "integer"
## [13] "integer" "integer" "integer" "integer" "integer" "character"
## [19] "integer" "integer" "integer" "integer" "integer" "integer"
## [25] "integer" "integer" "integer" "integer" "character" "character"
\(\textbf{We want to find out to find the proportion of flags (out of 194) containing each color using sapply()}.\)
# The output is a vector
sapply(flag_colors, sum)/nrow(flag_colors)
## red green blue gold white black orange
## 0.7886598 0.4690722 0.5103093 0.4690722 0.7525773 0.2680412 0.1340206
The ‘landmass’ variable in our dataset takes on integer values between 1 and 6, each of which represents a different part of the world. Use table(flags$landmass) to see how many flags/countries fall into each group.
# How many flags/countries fall into each group?
table(flags$landmass)
##
## 1 2 3 4 5 6
## 31 17 35 52 39 20
The ‘animate’ variable in our dataset takes the value 1 if a country’s flag contains an animate image (e.g. an eagle, a tree, a human hand) and 0 otherwise.
# How many flags/countries fall into each group?
table(flags$animate)
##
## 0 1
## 155 39
If you take the arithmetic mean of a bunch of 0s and 1s, you get the proportion of 1s. Use tapply() to apply the mean function to the ‘animate’ variable separately for each of the six landmass groups, thus giving us the proportion of flags containing an animate image within each landmass group.
\(\textbf{Find the proportion of flags containing an animate image within each landmass group using tapply().}\)
tapply(flags$animate, flags$landmass, mean)
## 1 2 3 4 5 6
## 0.4193548 0.1764706 0.1142857 0.1346154 0.1538462 0.3000000
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.2.1
The syntax is: spread(data, key, value), where
# Take a look at the pre-loaded data table2 in tidyr package
table2
We can use the spread() function to turn the values in the ‘type’ column into their own columns:
spread(table2, key=type, value=count)
gather() does the reverse of spread(). gather() collects a set of column names and places them into a single “key” column.
# Take a look at the pre-loaded data table4 in tidyr package
table4a
gather(table4a, "year", "cases", 2:3)
separate() turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears.
# Take a look at the pre-loaded data table3 in tidyr package
table3
separate(table3, rate, into = c("cases", "population"),sep = "/")
unite() does the opposite of separate(). It combines multiple columns into a single column.
# Take a look at the pre-loaded data table5 in tidyr package
table5
unite(table5, "new", century, year, sep = "")
The sample() function in R allows you to take a random sample of elements from a dataset or a vector, either with or without replacement. The basic syntax for the sample() function is as follows:
#define vector a with 10 elements in it
a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#set.seed(some random number) to ensure that we get the same sample each time
set.seed(123)
#generate random sample of 5 elements from vector a
sample(x=a, size = 8,replace = TRUE)
## [1] 3 3 10 2 6 5 4 6
a <- 1:20
sample(x = a,size = 10,replace = FALSE)
## [1] 5 19 9 3 8 10 7 15 18 17
In data science often we are required to split the original data into train and test set.
# Load Boston Housing data from the R package 'MASS'
library(MASS)
data("Boston")
# Check the dimension of the data set:
dim(Boston)
## [1] 506 14
We want to generate a train data set containing a random sample of 70% of the number of rows as in the original data set.
set.seed(1234)
train_id <- sample(x = 1:nrow(Boston),size = 0.7*nrow(Boston),replace = FALSE)
train_set <- Boston[train_id,]
head(train_set)
all_rows <- 1:nrow(Boston)
test_id <- all_rows[-train_id]
test_set <- Boston[test_id,]
head(test_set)
We would use sleep data from the ‘VIM’ package to illustrate missing data analysis. Some useful functions used in this section is discussed below.
# Load the package:
library(VIM)
## Warning: package 'VIM' was built under R version 4.2.1
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# Load the data:
data("sleep")
head(sleep)
# Check the dimensions
dim(sleep)
## [1] 62 10
# Which matrix indices have NAs?
which(is.na(sleep),arr.ind = TRUE)
## row col
## [1,] 1 3
## [2,] 3 3
## [3,] 4 3
## [4,] 14 3
## [5,] 21 3
## [6,] 24 3
## [7,] 26 3
## [8,] 30 3
## [9,] 31 3
## [10,] 41 3
## [11,] 47 3
## [12,] 53 3
## [13,] 55 3
## [14,] 62 3
## [15,] 1 4
## [16,] 3 4
## [17,] 4 4
## [18,] 14 4
## [19,] 24 4
## [20,] 26 4
## [21,] 30 4
## [22,] 31 4
## [23,] 47 4
## [24,] 53 4
## [25,] 55 4
## [26,] 62 4
## [27,] 21 5
## [28,] 31 5
## [29,] 41 5
## [30,] 62 5
## [31,] 4 6
## [32,] 13 6
## [33,] 35 6
## [34,] 36 6
## [35,] 13 7
## [36,] 19 7
## [37,] 20 7
## [38,] 56 7
# Count NA Values in All Data Frame Columns
colSums(is.na(sleep))
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
## 0 0 14 12 4 4 4 0
## Exp Danger
## 0 0
# Omit the rows having NA:
data_without_NA <- sleep[complete.cases(sleep),]
head(data_without_NA)
dim(data_without_NA)
## [1] 42 10
We continue our discussion using the sleep data from VIM package.
summary(sleep)
## BodyWgt BrainWgt NonD Dream
## Min. : 0.005 Min. : 0.14 Min. : 2.100 Min. :0.000
## 1st Qu.: 0.600 1st Qu.: 4.25 1st Qu.: 6.250 1st Qu.:0.900
## Median : 3.342 Median : 17.25 Median : 8.350 Median :1.800
## Mean : 198.790 Mean : 283.13 Mean : 8.673 Mean :1.972
## 3rd Qu.: 48.202 3rd Qu.: 166.00 3rd Qu.:11.000 3rd Qu.:2.550
## Max. :6654.000 Max. :5712.00 Max. :17.900 Max. :6.600
## NA's :14 NA's :12
## Sleep Span Gest Pred
## Min. : 2.60 Min. : 2.000 Min. : 12.00 Min. :1.000
## 1st Qu.: 8.05 1st Qu.: 6.625 1st Qu.: 35.75 1st Qu.:2.000
## Median :10.45 Median : 15.100 Median : 79.00 Median :3.000
## Mean :10.53 Mean : 19.878 Mean :142.35 Mean :2.871
## 3rd Qu.:13.20 3rd Qu.: 27.750 3rd Qu.:207.50 3rd Qu.:4.000
## Max. :19.90 Max. :100.000 Max. :645.00 Max. :5.000
## NA's :4 NA's :4 NA's :4
## Exp Danger
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000
## Mean :2.419 Mean :2.613
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000
##
# Simple imputation using mean:
sleep_df_imputed_mean <- as.data.frame(lapply(sleep, function(x) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
}))
sleep_df_imputed_mean
# Simple imputation using median:
sleep_df_imputed_median <- as.data.frame(lapply(sleep, function(x) {
x[is.na(x)] <- median(x, na.rm = TRUE)
x
}))
sleep_df_imputed_median
paste("Everybody", "loves", "R Programming.")
## [1] "Everybody loves R Programming."
paste("Everybody", "loves", "R Programming.", sep="*")
## [1] "Everybody*loves*R Programming."
a <- c("something", "to", "paste")
paste(a, collapse="_")
## [1] "something_to_paste"
z <- "Alabama-Alaska-Arizona-Arkansas-California"
strsplit(z, split = "-")
## [[1]]
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
Metacharacters consist of non-alphanumeric symbols such as:
. \ | ( ) [ { $ * + ?
To match metacharacters in R you need to escape them with a double backslash “\”.
# substitute $ with !
sub(pattern = "\\$", "\\!", "I love R$")
## [1] "I love R!"
# substitute \\ with white space
gsub(pattern = "\\\\", " ", "I\\need\\space")
## [1] "I need space"
x <- c("RStudio", "v.4.2.0", "2022", "04-22-2022")
# find any strings with numeric values between 0-9
grep(pattern = "[0-9]", x, value = TRUE)
## [1] "v.4.2.0" "2022" "04-22-2022"
# find any strings with the character R or r
grep(pattern = "[Rr]", x, value = TRUE)
## [1] "RStudio"
# find any strings that have non-alphanumeric characters
grep(pattern = "[^0-9a-zA-Z]", x, value = TRUE)
## [1] "v.4.2.0" "04-22-2022"
match() returns a vector of the positions of (first) matches of its first argument in its second.
rv <- c(11, 12, 19, 22, 25)
s_rv <- c(19, 21, 11, 18, 46)
match(rv, s_rv)
## [1] 3 NA 1 NA NA
intersect() returns the intersection of two vectors.
# intersection of two vectors
# Vector 1
x1 <- c(1, 2, 3, 4, 5, 6, 5, 5)
# Vector 2
x2 <- c(2:4)
# Intersection of two vectors
x3 <- intersect(x1, x2)
x3
## [1] 2 3 4