source: 5 Data transformation | R for Data Science
dplyr
Data transformation is a pre-process of getting the data in the right form to present. Typically, in R, we use dplyr package to do this task.
Main functions in dplyr are called verbs and work similarly:
- The first argument is a data frame
- The subsequent arguments describe what to do with the data frame, using the variable names (without quotes)
- The result is a new data frame
filter()
filter() allows you to subset observations based on their values.
The first argument is the name of the data frame.
The second and subsequent arguments are the expressions that filter the data frame (returned data frame should satisfy these expressions).
jan1 <- filter(flights, month == 1, day == 1)
- Multiple arguments to
filter()are combined with "and": every expression must be true in order for a row to be included in the output. - Don't use short-circuiting logical operators here. Because the expressions in
filter()filter all the data items. filter()filters out bothFALSEandNA
arrange()
arrange() takes a data frame and a set of column names (or more complicated expressions) to order rows. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns.
arrange(flights, year, desc(month), day)
desc()is used to reverse the order- Missing values
NAare always sorted at the end
select()
select() is used to select columns (variables). Some ways to fill the arguments
A,B,Calias = Arenames the columnA:Bselects all columns betweenAandB(inclusive)-(A:B)selects all columns except those fromAtoBstarts_with("abc")matches names that begin withabcends_with("xyz")matches names that end withxyzcontains("ijk")matches names that containijkmatches("(.)\\1")selects variables that match a Regular Expressionnum_range("x", 1:3)matchesx1,x2andx3everything()matches all columns, only useful when you want to move some columns to the start
mutate()
mutate() adds new columns at the end of the data frame.
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
gain_per_hour = gain / hours # refers to a column just created
)
mutate() keeps the old columns. To only keep the new variables, use transmute().
summarise()
summarise() or summarize() collapses a data frame to a single row, with values calculated by functions that take the whole data as the argument, like mean and sum.
summarise() is more useful paired with function group_by(). group_by() returns a grouped data frame (you can think it adds a new column group). When applied to a grouped data frame, summarise() will summarize each group.
Grouped data frame has a groups attribute, while summarise() will remove it.
If the data frame is grouped by multiple variables, summarise() will remove one (the last one) group variable.
You can also use ungroup() to manually remove the groups.
When calculating the proportions, pay attention to the groups. If you want to calculate the absolute proportion, ungroup first; if you want to calculate the relative proportion within the group, be careful when summarise removes the group.
suppressMessages(library(tidyverse))
df <- as.data.frame(Titanic)
df2 <- df %>%
group_by(Class, Survived) %>%
summarize(Freq = sum(Freq)) %>%
ungroup() %>% # very important for absolute prop
mutate(prop = Freq/sum(Freq))
print(df2)
df3 <- df %>%
group_by(Class, Survived) %>%
summarize(Freq = sum(Freq)) %>% # summersize removes the Survived group
mutate(prop = Freq/sum(Freq)) # relative proportion in Classes
print(df3)
Useful summary functions:
-
mean(),median(),sum()- use option
na.rm = TRUEto remove the NAs before aggregation
- use option
-
n(),n_distinct(var)for countn()includes NAs, usesum(!is.na(x))to count non-NAs
-
min(),max(),quantile(x, 0.25)
by zcysxy