read.csv() is used to read csv files and creates a data frame in R. The argument ‘header’ is the logical value i.e. TRUE/FALSE indicating whether the file contains the names of the variables as its first line.
nycflights <- read.csv(file="C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/flights.csv",header = TRUE)
head(nycflights)
Load the library readxl to use the function read_excel for reading excel files in R. The argument ‘sheet’ takes a string (file name) or integer (sheet number).
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.1
read_excel("C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/planes.xlsx",sheet = 1)
# Load the weather.RData file into current R workspace:
load(file = "C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/weather.RData")
# Save the data frames nycflights and weather as .RData file
save(nycflights,weather,file = "C:/Users/Chiranjit Dutta/Dropbox/Chiranjit Dutta/R Tutorial Summer 2022/R Tutorial 2022/Lecture_materials/Data/data_lecture2.RData")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.2.1
## Warning: package 'tibble' was built under R version 4.2.1
## Warning: package 'tidyr' was built under R version 4.2.1
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'forcats' was built under R version 4.2.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.2.1
Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag().
flights
This data frame contains all 336,776 flights that departed from New York City in 2013.
View(flights) # to see the whole dataset in the R studio viewer
str(flights)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
The types of variables are:
int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
lgl stands for logical, vectors that contain only TRUE or FALSE.
fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
date stands for dates.
Some most used functions are:
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
Example: Find dimension of data “flights”
dim(flights)
## [1] 336776 19
Using Pipe function
flights %>% dim
## [1] 336776 19
“flights” is passed into function dim as an argument.
filter() allows you to subset observations based on their values.
jan1 <- filter(flights, month == 1, day == 1) #The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.
jan1
nov_dec <- filter(flights, month %in% c(11, 12))
nov_dec
arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by.
arrange(flights, year, month, day)
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. Missing values are always sorted at the end. Use desc() to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
select(flights, year, month, day) # Select columns by name
# Select all columns between year and day (inclusive)
select(flights, year:day)
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
There are a number of helper functions you can use within select():
starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)//1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range(“x”, 1:3): matches x1, x2 and x3.
select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:
rename(flights, tail_num = tailnum)
Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.
select(flights, time_hour, air_time, everything())
mutate() always adds new columns at the end of your dataset.
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time)
mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60)
If you only want to keep the new variables, use transmute():
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dest != "HNL")
# ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
# geom_point(aes(size = count), alpha = 1/3) +
# geom_smooth(se = FALSE)
delay
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don’t care about it. Naming things is hard, so this slows down our analysis.
There’s another way to tackle the same problem with the pipe, %>%.
delays <- flights %>%
group_by(dest) %>%
summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dest != "HNL")
# ggplot(delays, mapping = aes(x = dist, y = delay)) +
# geom_point(aes(size = count), alpha = 1/3)+
# geom_smooth(se = TRUE)
delays
If we do not set na.rm = TRUE in the above codes we could have fot the aggregated value as NA as some of the rows have missing values.
In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights.
not_cancelled <-flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
sum(!is.na(flights$dep_delay)) # count of non missing values
## [1] 328521
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
)
# ggplot(data = delays, mapping = aes(x = delay)) +
# geom_freqpoly(binwidth = 10)
delays
The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# ggplot(data = delays, mapping = aes(x = n, y = delay)) +
# geom_point(alpha = 1/10)
delays
When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups.
# delays %>%
# filter(n > 25) %>%
# ggplot(mapping = aes(x = n, y = delay)) +
# geom_point(alpha = 1/10)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) )
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset:
daily <- group_by(flights, year, month, day)
per_day <- summarise(daily, flights = n())
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
per_day
If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
We will use nycflights13 data from the package(“nycflights13”) to learn about joins. Let us consider the following tibbles:
\(\texttt{airlines}\): lets you look up the full carrier name from its abbreviated code:
head(airlines)
\(\texttt{airports}\): gives information about each airport, identified by the faa airport code:
head(airports)
\(\texttt{planes}\): gives information about each plane, identified by its tailnum
head(planes)
\(\texttt{weather}\): gives the weather at each NYC airport for each hour
head(weather)
Let us create a data frame flights2 containing the columns year,month,day, hour, origin, dest, tailnum and carrier.
flights2 <- flights %>%
select(year,month,day, hour, origin, dest, tailnum, carrier)
We want to add the full airline name to the flights2 data using airlines and flights2 data frames with left_join().
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier")
Suppose we want to draw a map, we need to combine the flights data with the airports data which contains the location (lat and lon) of each airport. Each flight has an origin and destination airport, so we need to specify which one we want to join to:
flights2 %>%
left_join(airports, c("dest" = "faa"))
We want to right join the tibbles flights and weather by the user-defined key ‘merge_key’
merge_key <- c("time_hour", "origin")
flights %>%
right_join(weather, by = merge_key)
Suppose we want to retain only those flights data which contains the location (lat and lon) of each airport.
flights2 %>%
inner_join(airports, c("dest" = "faa"))
Count the number of flights without a match in planes:
flights %>%
anti_join(planes,
by = "tailnum") %>%
count(tailnum, sort = TRUE)
Now imagine, we want to obtain the information of the top less visited destinations:
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
we want to extract information from those destinations and create a new database
flights %>%
semi_join(top_dest)
## Joining, by = "dest"
There are three ways you’re likely to create a date/time:
# Load the R package lubridate (useful for working with dates and time)
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.2.1
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Date/time data often comes as strings. One approach is to use the helpers provided by lubridate. They automatically work out the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:
ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. ymd() is short and unambiguous:
ymd(20170131)
## [1] "2017-01-31"
ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:
ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"
You can also force the creation of a date-time from a date by supplying a timezone:
ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"
Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:
flights %>%
select(year, month, day, hour, minute)
To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():
as_datetime(today())
## [1] "2022-08-04 UTC"
as_date(now())
## [1] "2022-08-04"
Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().
as_datetime(60 * 60 * 10)
## [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
## [1] "1980-01-01"