source: 5 Data transformation | R for Data Science
dplyr
Data transformation is a pre-process of getting the data in the right form to present. Typically, in R, we use dplyr
package to do this task.
Main functions in dplyr
are called verbs and work similarly:
- The first argument is a data frame
- The subsequent arguments describe what to do with the data frame, using the variable names (without quotes)
- The result is a new data frame
filter()
filter()
allows you to subset observations based on their values.
The first argument is the name of the data frame.
The second and subsequent arguments are the expressions that filter the data frame (returned data frame should satisfy these expressions).
jan1 <- filter(flights, month == 1, day == 1)
- Multiple arguments to
filter()
are combined with "and": every expression must be true in order for a row to be included in the output. - Don't use short-circuiting logical operators here. Because the expressions in
filter()
filter all the data items. filter()
filters out bothFALSE
andNA
arrange()
arrange()
takes a data frame and a set of column names (or more complicated expressions) to order rows. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns.
arrange(flights, year, desc(month), day)
desc()
is used to reverse the order- Missing values
NA
are always sorted at the end
select()
select()
is used to select columns (variables). Some ways to fill the arguments
A,B,C
alias = A
renames the columnA:B
selects all columns betweenA
andB
(inclusive)-(A:B)
selects all columns except those fromA
toB
starts_with("abc")
matches names that begin withabc
ends_with("xyz")
matches names that end withxyz
contains("ijk")
matches names that containijk
matches("(.)\\1")
selects variables that match a Regular Expressionnum_range("x", 1:3)
matchesx1
,x2
andx3
everything()
matches all columns, only useful when you want to move some columns to the start
mutate()
mutate()
adds new columns at the end of the data frame.
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
gain_per_hour = gain / hours # refers to a column just created
)
mutate()
keeps the old columns. To only keep the new variables, use transmute()
.
summarise()
summarise()
or summarize()
collapses a data frame to a single row, with values calculated by functions that take the whole data as the argument, like mean
and sum
.
summarise()
is more useful paired with function group_by()
. group_by()
returns a grouped data frame (you can think it adds a new column group
). When applied to a grouped data frame, summarise()
will summarize each group.
Grouped data frame has a groups
attribute, while summarise()
will remove it.
If the data frame is grouped by multiple variables, summarise()
will remove one (the last one) group variable.
You can also use ungroup()
to manually remove the groups.
When calculating the proportions, pay attention to the groups. If you want to calculate the absolute proportion, ungroup
first; if you want to calculate the relative proportion within the group, be careful when summarise
removes the group.
suppressMessages(library(tidyverse))
df <- as.data.frame(Titanic)
df2 <- df %>%
group_by(Class, Survived) %>%
summarize(Freq = sum(Freq)) %>%
ungroup() %>% # very important for absolute prop
mutate(prop = Freq/sum(Freq))
print(df2)
df3 <- df %>%
group_by(Class, Survived) %>%
summarize(Freq = sum(Freq)) %>% # summersize removes the Survived group
mutate(prop = Freq/sum(Freq)) # relative proportion in Classes
print(df3)
Useful summary functions:
-
mean()
,median()
,sum()
- use option
na.rm = TRUE
to remove the NAs before aggregation
- use option
-
n()
,n_distinct(var)
for countn()
includes NAs, usesum(!is.na(x))
to count non-NAs
-
min()
,max()
,quantile(x, 0.25)