1 Reading files in R
- 1.1 Reading csv file
- 1.2 Reading excel(xls/xlsx) file
2 RData files
- 2.1 Load RData file
- 2.2 Save RData file
3 Load required packages
4 Data as Tibbles
- 4.1 Viewing and understanding the structure of the data
5 Commonly used functions from dplyr package
6 Combining multiple sources of information using joins
7 Working with dates and time in R using lubridate package

1 Reading files in R

1.1 Reading csv file

read.csv() is used to read csv files and creates a data frame in R. The argument ‘header’ is the logical value i.e. TRUE/FALSE indicating whether the file contains the names of the variables as its first line.

nycflights <- read.csv(file="C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/flights.csv",header = TRUE)

head(nycflights)

1.2 Reading excel(xls/xlsx) file

Load the library readxl to use the function read_excel for reading excel files in R. The argument ‘sheet’ takes a string (file name) or integer (sheet number).

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.1

read_excel("C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/planes.xlsx",sheet = 1)

2 RData files

2.1 Load RData file

# Load the weather.RData file into current R workspace:
load(file = "C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/weather.RData")

2.2 Save RData file

# Save the data frames nycflights and weather as .RData file
save(nycflights,weather,file = "C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/data_lecture2.RData")

3 Load required packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.1

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.8     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.2.1

## Warning: package 'tibble' was built under R version 4.2.1

## Warning: package 'tidyr' was built under R version 4.2.1

## Warning: package 'readr' was built under R version 4.2.1

## Warning: package 'forcats' was built under R version 4.2.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(nycflights13)

## Warning: package 'nycflights13' was built under R version 4.2.1

Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().

4 Data as Tibbles

Tibbles are data frames with some tweaks in behavior of data frames
It does not print entire data - easier to work with large data

flights

This data frame contains all 336,776 flights that departed from New York City in 2013.

4.1 Viewing and understanding the structure of the data

View(flights) # to see the whole dataset in the R studio viewer

str(flights)

## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

The types of variables are:

int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
lgl stands for logical, vectors that contain only TRUE or FALSE.
fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
date stands for dates.

5 Commonly used functions from dplyr package

Some most used functions are:

Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

5.1 Pipe function: %>%

Convenient for clearly expressing a sequence of multiple operations
Rstudio shortcut is Ctrl + Shift + M

Example: Find dimension of data “flights”

dim(flights)

## [1] 336776     19

Using Pipe function

flights %>% dim

## [1] 336776     19

“flights” is passed into function dim as an argument.

5.2 Filter rows with filter()

filter() allows you to subset observations based on their values.

jan1 <- filter(flights, month == 1, day == 1) #The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.
jan1

nov_dec <- filter(flights, month %in% c(11, 12))
nov_dec

5.3 Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by.

arrange(flights, year, month, day)

If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. Missing values are always sorted at the end. Use desc() to re-order by a column in descending order:

arrange(flights, desc(dep_delay))

5.4 Select columns with select()

select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select(flights, year, month, day) # Select columns by name

# Select all columns between year and day (inclusive)
select(flights, year:day)

# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))

There are a number of helper functions you can use within select():

starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)//1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range(“x”, 1:3): matches x1, x2 and x3.

select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:

rename(flights, tail_num = tailnum)

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

select(flights, time_hour, air_time, everything())

5.5 Add new variables with mutate()

mutate() always adds new columns at the end of your dataset.

flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time)
mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60)

If you only want to keep the new variables, use transmute():

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

5.6 Grouped summaries with summarise()

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

5.6.1 Combining multiple operations with the pipe

Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this:

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dest != "HNL")
# ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
#   geom_point(aes(size = count), alpha = 1/3) +
#   geom_smooth(se = FALSE)

delay

This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don’t care about it. Naming things is hard, so this slows down our analysis.

There’s another way to tackle the same problem with the pipe, %>%.

delays <- flights %>%
  group_by(dest) %>%
  summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(count > 20, dest != "HNL")

# ggplot(delays, mapping = aes(x = dist, y = delay)) +
#   geom_point(aes(size = count), alpha = 1/3)+
#   geom_smooth(se = TRUE)

delays

5.6.2 Missing values

If we do not set na.rm = TRUE in the above codes we could have fot the aggregated value as NA as some of the rows have missing values.

In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights.

not_cancelled <-flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay))

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

5.6.3 Counts

sum(!is.na(flights$dep_delay)) # count of non missing values

## [1] 328521

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay)
  )
# ggplot(data = delays, mapping = aes(x = delay)) + 
#   geom_freqpoly(binwidth = 10)

delays

The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

# ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
#   geom_point(alpha = 1/10)

delays

When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups.

# delays %>% 
#   filter(n > 25) %>% 
#   ggplot(mapping = aes(x = n, y = delay)) + 
#     geom_point(alpha = 1/10)

5.6.4 Useful summary functions

Measures of location: mean(x), median(x)
Measures of spread: sd(x), IQR(x), mad(x)
Measures of rank: min(x), quantile(x, 0.25), max(x) Measures of position: first(x), nth(x, 2), last(x)

not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(avg_delay1 = mean(arr_delay),
    avg_delay2 = mean(arr_delay[arr_delay > 0]) )

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

5.6.5 Grouping by multiple variables

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset:

daily <- group_by(flights, year, month, day)
per_day   <- summarise(daily, flights = n())

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

per_day

5.6.6 Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily %>% 
  ungroup() %>%             # no longer grouped by date
  summarise(flights = n())  # all flights

6 Combining multiple sources of information using joins

Joins are defined in terms of two tables or data frames
A row in one data frame gets matched with a row in another data frame because they have the same “key”.
“key” is a designated variable(s), present in both tables.

We will use nycflights13 data from the package(“nycflights13”) to learn about joins. Let us consider the following tibbles:

\(\texttt{airlines}\): lets you look up the full carrier name from its abbreviated code:

head(airlines)

\(\texttt{airports}\): gives information about each airport, identified by the faa airport code:

head(airports)

\(\texttt{planes}\): gives information about each plane, identified by its tailnum

head(planes)

\(\texttt{weather}\): gives the weather at each NYC airport for each hour

head(weather)

6.1 Leftjoin

6.1.1 leftjoin() using same key in both tibbles

Let us create a data frame flights2 containing the columns year,month,day, hour, origin, dest, tailnum and carrier.

flights2 <- flights %>% 
  select(year,month,day, hour, origin, dest, tailnum, carrier)

We want to add the full airline name to the flights2 data using airlines and flights2 data frames with left_join().

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")

left_join(x, y): Return all rows from x, and all columns from x and y

6.1.2 leftjoin() using different key in each tibble

Suppose we want to draw a map, we need to combine the flights data with the airports data which contains the location (lat and lon) of each airport. Each flight has an origin and destination airport, so we need to specify which one we want to join to:

flights2 %>% 
  left_join(airports, c("dest" = "faa"))

A named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y.

6.2 right_join()

We want to right join the tibbles flights and weather by the user-defined key ‘merge_key’

merge_key <- c("time_hour", "origin")

flights %>% 
  right_join(weather, by = merge_key)

right_join(x, y): Return all rows from y, and all columns from y and x

6.3 inner_join()

Suppose we want to retain only those flights data which contains the location (lat and lon) of each airport.

flights2 %>% 
  inner_join(airports, c("dest" = "faa"))

inner_join(x, y): Return all rows from x where there are matching values in y

6.4 anti_join()

Count the number of flights without a match in planes:

flights %>%
  anti_join(planes,
            by = "tailnum") %>%
  count(tailnum, sort = TRUE)

anti_join(x, y) drops all observations in x that have a match in y.

6.5 semi_join()

Now imagine, we want to obtain the information of the top less visited destinations:

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest

we want to extract information from those destinations and create a new database

flights %>% 
  semi_join(top_dest)

## Joining, by = "dest"

semi_join(x, y) keeps all observations in x that have a match in y

7 Working with dates and time in R using lubridate package

There are three ways you’re likely to create a date/time:

From a string.
From individual date-time components.
From an existing date/time object.

# Load the R package lubridate (useful for working with dates and time)
library(lubridate)

## Warning: package 'lubridate' was built under R version 4.2.1

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

7.1 From a string

Date/time data often comes as strings. One approach is to use the helpers provided by lubridate. They automatically work out the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:

ymd("2017-01-31")

## [1] "2017-01-31"

mdy("January 31st, 2017")

## [1] "2017-01-31"

dmy("31-Jan-2017")

## [1] "2017-01-31"

These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. ymd() is short and unambiguous:

ymd(20170131)

## [1] "2017-01-31"

ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:

ymd_hms("2017-01-31 20:11:59")

## [1] "2017-01-31 20:11:59 UTC"

mdy_hm("01/31/2017 08:01")

## [1] "2017-01-31 08:01:00 UTC"

You can also force the creation of a date-time from a date by supplying a timezone:

ymd(20170131, tz = "UTC")

## [1] "2017-01-31 UTC"

7.2 From individual date-time components

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

flights %>% 
  select(year, month, day, hour, minute)

To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))

7.3 From an existing date/time object

You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():

as_datetime(today())

## [1] "2022-08-04 UTC"

as_date(now())

## [1] "2022-08-04"

Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().

as_datetime(60 * 60 * 10)

## [1] "1970-01-01 10:00:00 UTC"

as_date(365 * 10 + 2)

## [1] "1980-01-01"

Data Preparation and Transformation - 1

Chiranjit Dutta

7/17/21