source: 5 Data transformation | R for Data Science


Data transformation is a pre-process of getting the data in the right form to present. Typically, in R, we use dplyr package to do this task.

Main functions in dplyr are called verbs and work similarly:

  1. The first argument is a data frame
  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes)
  3. The result is a new data frame


filter() allows you to subset observations based on their values.
The first argument is the name of the data frame.
The second and subsequent arguments are the expressions that filter the data frame (returned data frame should satisfy these expressions).

jan1 <- filter(flights, month == 1, day == 1)


arrange() takes a data frame and a set of column names (or more complicated expressions) to order rows. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns.

arrange(flights, year, desc(month), day)


select() is used to select columns (variables). Some ways to fill the arguments


mutate() adds new columns at the end of the data frame.

  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60,
  gain_per_hour = gain / hours # refers to a column just created

mutate() keeps the old columns. To only keep the new variables, use transmute().


summarise() or summarize() collapses a data frame to a single row, with values calculated by functions that take the whole data as the argument, like mean and sum.

summarise() is more useful paired with function group_by(). group_by() returns a grouped data frame (you can think it adds a new column group). When applied to a grouped data frame, summarise() will summarize each group.


Grouped data frame has a groups attribute, while summarise() will remove it.
If the data frame is grouped by multiple variables, summarise() will remove one (the last one) group variable.

You can also use ungroup() to manually remove the groups.

Grouped Percentages

When calculating the proportions, pay attention to the groups. If you want to calculate the absolute proportion, ungroup first; if you want to calculate the relative proportion within the group, be careful when summarise removes the group.

df <-
df2 <- df %>%  
    group_by(Class, Survived) %>%
    summarize(Freq = sum(Freq)) %>%
    ungroup() %>% # very important for absolute prop
    mutate(prop = Freq/sum(Freq))

df3 <- df %>%
    group_by(Class, Survived) %>%
    summarize(Freq = sum(Freq)) %>% # summersize removes the Survived group
    mutate(prop = Freq/sum(Freq)) # relative proportion in Classes

Useful summary functions:

Creative Commons License by zcysxy