In this section, we’ll discuss Data Wrangling/Transformation via the dplyr package. We’ll explore ways to

  1. choose subsets of data,
  2. aggregate data to create summaries,
  3. make new variables,
  4. and sort your data frames.

It is recommended you also explore the RStudio Cheatsheet on Data Transformation as we discuss this content.

Back to gapminder

Here is a look at the gapminder data frame in the gapminder package.

library(tidyverse)
library(gapminder)
gapminder

We recommend always checking out your data first before you hop into your analyses. Simply running the name of the object like we have with gapminder above is helpful. The View() command (note the capital “V”! R is case sensitive!) is also helpful. Try it out with View(gapminder).

Say we wanted mean life expectancy across all years for Asia

# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
## [1] 60.0649
library(dplyr)
gapminder %>% 
  filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))

The pipe %>%

   

The Five Main Verbs (5MV) of data wrangling

filter()
summarize()
group_by()
mutate()
arrange()


filter()

  • Select a subset of the rows of a data frame.
  • The arguments are the “filters” that you’d like to apply.
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
gap_2007
  • Use == to compare a variable to a value.
  • Remember to include quotes around strings, but they are not needed around numbers.

Your turn

Exercise 1.1

Choose only Asian rows in the gapminder data and provide the name of asian_countries to that result.

# Enter and try your answer here.

Logical operators

  • Use | to check for any in multiple filters being true:
gapminder %>% 
  filter(year == 2002 | continent == "Europe")
  • Use & or , to check for all of multiple filters being true:
gapminder %>% 
  filter(year == 2002, continent == "Europe")
  • Use %in% to check for any being true (shortcut to using | repeatedly with ==)
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))

Your turn

Exercise 1.2

Choose only rows corresponding to African or European nations.

# Enter and try your answer here.

summarize()

  • Any numerical summary that you want to apply to a column of a data frame is specified within summarize().
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997

Combining summarize() with group_by()

When you’d like to determine a numerical summary for all levels of a different categorical variable

gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp)) 
max_exp_1997_by_cont

Without the %>%

It’s hard to appreciate the %>% without seeing what the code would look like without it:

max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont

mutate()

  • Allows you to
    1. create a new variable based on other variables OR
    2. change the contents of an existing variable
  1. create a new variable based on other variables
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp

mutate()

  1. change the contents of an existing variable
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird

arrange()

  • Reorders the rows in a data frame based on the values of one or more variables
gapminder %>%
  arrange(year, country)
  • Can also put into descending order
gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))

Other useful dplyr verbs

  • select
  • top_n
  • sample_n
  • slice
  • glimpse
  • rename

Your turn

Exercise 1.3

Determine which African country had the highest GDP per capita in 1982 using the gapminder data in the gapminder package.

#Space for your answer here.
gapminder %>% 
  filter(year == 1982) %>% 
  filter(continent == "Africa") %>% 
  arrange(desc(gdpPercap))

Challenge - Exercise 1.3C

For both of these tasks below, use the bechdel data frame in the fivethirtyeight package:

  • Use the count function in the dplyr package to determine how many movies in 2013 fell into each of the different categories for clean_test.
  • Determine the percentage of movies that received the value of "ok" across all years.

Your turn

Exercise 1.4

Determine the top five movies in terms of domestic return on investment for 2013 scaled data using the bechdel data frame in the fivethirtyeight package.

#Space for your answer here.

What you learned in this section

Different ways in the dplyr package to

  • choose subsets of data using the filter() verb
  • aggregate data to create summaries using the summarize() and group_by verbs
  • make new variables using the mutate() verb
  • sort your data frames using the arrange() verb
LS0tCnRpdGxlOiAiUGFydCAxIC0gSW50cm8gdG8gYGRwbHlyYCIKYXV0aG9yOiAiQ2hlc3RlciBJc21heSIKb3V0cHV0OiAKICBodG1sX2RvY3VtZW50OgogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogICAgY29kZV9mb2xkaW5nOiBoaWRlCiAgICBkZl9wcmludDogcGFnZWQKLS0tCgpJbiB0aGlzIHNlY3Rpb24sIHdlJ2xsIGRpc2N1c3MgRGF0YSBXcmFuZ2xpbmcvVHJhbnNmb3JtYXRpb24gdmlhIHRoZSBgZHBseXJgIHBhY2thZ2UuICBXZSdsbCBleHBsb3JlIHdheXMgdG8KCjEuIGNob29zZSBzdWJzZXRzIG9mIGRhdGEsIAoxLiBhZ2dyZWdhdGUgZGF0YSB0byBjcmVhdGUgc3VtbWFyaWVzLCAKMS4gbWFrZSBuZXcgdmFyaWFibGVzLCAKMS4gYW5kIHNvcnQgeW91ciBkYXRhIGZyYW1lcy4gIAoKSXQgaXMgcmVjb21tZW5kZWQgeW91IGFsc28gZXhwbG9yZSB0aGUgUlN0dWRpbyBDaGVhdHNoZWV0IG9uIFtEYXRhIFRyYW5zZm9ybWF0aW9uXShodHRwczovL2dpdGh1Yi5jb20vcnN0dWRpby9jaGVhdHNoZWV0cy9yYXcvbWFzdGVyL2RhdGEtdHJhbnNmb3JtYXRpb24ucGRmKSBhcyB3ZSBkaXNjdXNzIHRoaXMgY29udGVudC4KCiMjIyBCYWNrIHRvIGBnYXBtaW5kZXJgCgpIZXJlIGlzIGEgbG9vayBhdCB0aGUgYGdhcG1pbmRlcmAgZGF0YSBmcmFtZSBpbiB0aGUgYGdhcG1pbmRlcmAgcGFja2FnZS4KCmBgYHtyIG1lc3NhZ2U9RkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KGdhcG1pbmRlcikKZ2FwbWluZGVyCmBgYAoKV2UgcmVjb21tZW5kIGFsd2F5cyBjaGVja2luZyBvdXQgeW91ciBkYXRhIGZpcnN0IGJlZm9yZSB5b3UgaG9wIGludG8geW91ciBhbmFseXNlcy4gU2ltcGx5IHJ1bm5pbmcgdGhlIG5hbWUgb2YgdGhlIG9iamVjdCBsaWtlIHdlIGhhdmUgd2l0aCBgZ2FwbWluZGVyYCBhYm92ZSBpcyBoZWxwZnVsLiBUaGUgYFZpZXcoKWAgY29tbWFuZCAobm90ZSB0aGUgY2FwaXRhbCAiViIhIFIgaXMgY2FzZSBzZW5zaXRpdmUhKSBpcyBhbHNvIGhlbHBmdWwuIFRyeSBpdCBvdXQgd2l0aCBgVmlldyhnYXBtaW5kZXIpYC4KClNheSB3ZSB3YW50ZWQgbWVhbiBsaWZlIGV4cGVjdGFuY3kgYWNyb3NzIGFsbCB5ZWFycyBmb3IgQXNpYQoKYGBge3J9CiMgQmFzZSBSCmFzaWEgPC0gZ2FwbWluZGVyW2dhcG1pbmRlciRjb250aW5lbnQgPT0gIkFzaWEiLCBdCm1lYW4oYXNpYSRsaWZlRXhwKQpgYGAKCmBgYHtyfQpsaWJyYXJ5KGRwbHlyKQpnYXBtaW5kZXIgJT4lIAogIGZpbHRlcihjb250aW5lbnQgPT0gIkFzaWEiKSAlPiUKICBzdW1tYXJpemUobWVhbl9leHAgPSBtZWFuKGxpZmVFeHApKQpgYGAKCiMjIFRoZSBwaXBlIGAlPiVgCgpgciBrbml0cjo6aW5jbHVkZV9ncmFwaGljcygiZmlncy9waXBlLnBuZyIpYCAmZW1zcDsgJmVtc3A7YHIga25pdHI6OmluY2x1ZGVfZ3JhcGhpY3MoImZpZ3MvTWFncml0dGVQaXBlLmpwZyIpYAoKLSBBIHdheSB0byBjaGFpbiB0b2dldGhlciBjb21tYW5kcwotIFRoaXMgaGVscHMgd2l0aCBtYWtpbmcgeW91ciBhbmFseXNlcyByZWFkYWJsZSBhbmQgZXhwbG9yYXRvcnkgYXQgdGhlIHNhbWUgdGltZQoKCiMgVGhlIEZpdmUgTWFpbiBWZXJicyAoNU1WKSBvZiBkYXRhIHdyYW5nbGluZwoKIyMjIGBmaWx0ZXIoKWAgPGJyPiBgc3VtbWFyaXplKClgIDxicj4gYGdyb3VwX2J5KClgIDxicj4gYG11dGF0ZSgpYCA8YnI+IGBhcnJhbmdlKClgCgotLS0KCiMjIGBmaWx0ZXIoKWAKCi0gU2VsZWN0IGEgc3Vic2V0IG9mIHRoZSByb3dzIG9mIGEgZGF0YSBmcmFtZS4gCi0gVGhlIGFyZ3VtZW50cyBhcmUgdGhlICJmaWx0ZXJzIiB0aGF0IHlvdSdkIGxpa2UgdG8gYXBwbHkuCgpgYGB7cn0KbGlicmFyeShnYXBtaW5kZXIpOyBsaWJyYXJ5KGRwbHlyKQpnYXBfMjAwNyA8LSBnYXBtaW5kZXIgJT4lIGZpbHRlcih5ZWFyID09IDIwMDcpCmdhcF8yMDA3CmBgYAoKLSBVc2UgYD09YCB0byBjb21wYXJlIGEgdmFyaWFibGUgdG8gYSB2YWx1ZS4KLSBSZW1lbWJlciB0byBpbmNsdWRlIHF1b3RlcyBhcm91bmQgc3RyaW5ncywgYnV0IHRoZXkgYXJlIG5vdCBuZWVkZWQgYXJvdW5kIG51bWJlcnMuCgojIyBZb3VyIHR1cm4KCiMjIyBFeGVyY2lzZSAxLjEKCkNob29zZSBvbmx5IEFzaWFuIHJvd3MgaW4gdGhlIGBnYXBtaW5kZXJgIGRhdGEgYW5kIHByb3ZpZGUgdGhlIG5hbWUgb2YgYGFzaWFuX2NvdW50cmllc2AgdG8gdGhhdCByZXN1bHQuCgpgYGB7cn0KIyBFbnRlciBhbmQgdHJ5IHlvdXIgYW5zd2VyIGhlcmUuCmBgYAoKCiMjIExvZ2ljYWwgb3BlcmF0b3JzCgotIFVzZSBgfGAgdG8gY2hlY2sgZm9yIGFueSBpbiBtdWx0aXBsZSBmaWx0ZXJzIGJlaW5nIHRydWU6CgpgYGB7cn0KZ2FwbWluZGVyICU+JSAKICBmaWx0ZXIoeWVhciA9PSAyMDAyIHwgY29udGluZW50ID09ICJFdXJvcGUiKQpgYGAKCi0gVXNlIGAmYCBvciBgLGAgdG8gY2hlY2sgZm9yIGFsbCBvZiBtdWx0aXBsZSBmaWx0ZXJzIGJlaW5nIHRydWU6CgpgYGB7cn0KZ2FwbWluZGVyICU+JSAKICBmaWx0ZXIoeWVhciA9PSAyMDAyLCBjb250aW5lbnQgPT0gIkV1cm9wZSIpCmBgYAoKLSBVc2UgYCVpbiVgIHRvIGNoZWNrIGZvciBhbnkgYmVpbmcgdHJ1ZSAoc2hvcnRjdXQgdG8gdXNpbmcgYHxgIHJlcGVhdGVkbHkgd2l0aCBgPT1gKQoKYGBge3J9CmdhcG1pbmRlciAlPiUgCiAgZmlsdGVyKGNvdW50cnkgJWluJSBjKCJBcmdlbnRpbmEiLCAiQmVsZ2l1bSIsICJNZXhpY28iKSwKICAgICAgICAgeWVhciAlaW4lIGMoMTk4NywgMTk5MikpCmBgYAoKIyMgWW91ciB0dXJuCgojIyMgRXhlcmNpc2UgMS4yCgpDaG9vc2Ugb25seSByb3dzIGNvcnJlc3BvbmRpbmcgdG8gQWZyaWNhbiBvciBFdXJvcGVhbiBuYXRpb25zLgoKYGBge3J9CiMgRW50ZXIgYW5kIHRyeSB5b3VyIGFuc3dlciBoZXJlLgpgYGAKCgojIyBgc3VtbWFyaXplKClgCgotIEFueSBudW1lcmljYWwgc3VtbWFyeSB0aGF0IHlvdSB3YW50IHRvIGFwcGx5IHRvIGEgY29sdW1uIG9mIGEgZGF0YSBmcmFtZSBpcyBzcGVjaWZpZWQgd2l0aGluIGBzdW1tYXJpemUoKWAuCgpgYGB7cn0KbWF4X2V4cF8xOTk3IDwtIGdhcG1pbmRlciAlPiUgCiAgZmlsdGVyKHllYXIgPT0gMTk5NykgJT4lIAogIHN1bW1hcml6ZShtYXhfZXhwID0gbWF4KGxpZmVFeHApKQptYXhfZXhwXzE5OTcKYGBgCgojIyMgQ29tYmluaW5nIGBzdW1tYXJpemUoKWAgd2l0aCBgZ3JvdXBfYnkoKWAKCldoZW4geW91J2QgbGlrZSB0byBkZXRlcm1pbmUgYSBudW1lcmljYWwgc3VtbWFyeSBmb3IgYWxsCmxldmVscyBvZiBhIGRpZmZlcmVudCBjYXRlZ29yaWNhbCB2YXJpYWJsZQoKYGBge3J9CmdhcG1pbmRlciAlPiUgCiAgZmlsdGVyKHllYXIgPT0gMTk5NykgJT4lIAogIGdyb3VwX2J5KGNvbnRpbmVudCkgJT4lCiAgc3VtbWFyaXplKG1heF9leHAgPSBtYXgobGlmZUV4cCkpIAptYXhfZXhwXzE5OTdfYnlfY29udApgYGAKCgojIyMgV2l0aG91dCB0aGUgYCU+JWAKCkl0J3MgaGFyZCB0byBhcHByZWNpYXRlIHRoZSBgJT4lYCB3aXRob3V0IHNlZWluZyB3aGF0IHRoZSBjb2RlIHdvdWxkIGxvb2sgbGlrZSB3aXRob3V0IGl0OgoKYGBge3J9Cm1heF9leHBfMTk5N19ieV9jb250IDwtIAogIHN1bW1hcml6ZSgKICAgIGdyb3VwX2J5KAogICAgICBmaWx0ZXIoCiAgICAgICAgZ2FwbWluZGVyLCAKICAgICAgICAgIHllYXIgPT0gMTk5NyksIAogICAgICBjb250aW5lbnQpLAogICAgbWF4X2V4cCA9IG1heChsaWZlRXhwKSkKbWF4X2V4cF8xOTk3X2J5X2NvbnQKYGBgCgojIyBgbXV0YXRlKClgCgotIEFsbG93cyB5b3UgdG8gCiAgICAxLiBjcmVhdGUgYSBuZXcgdmFyaWFibGUgYmFzZWQgb24gb3RoZXIgdmFyaWFibGVzIE9SCiAgICAyLiBjaGFuZ2UgdGhlIGNvbnRlbnRzIG9mIGFuIGV4aXN0aW5nIHZhcmlhYmxlCgoKMS4gY3JlYXRlIGEgbmV3IHZhcmlhYmxlIGJhc2VkIG9uIG90aGVyIHZhcmlhYmxlcwoKYGBge3J9CmdhcF93X2dkcCA8LSBnYXBtaW5kZXIgJT4lIG11dGF0ZShnZHAgPSBwb3AgKiBnZHBQZXJjYXApCmdhcF93X2dkcApgYGAKCiMjIGBtdXRhdGUoKWAKCjIuIGNoYW5nZSB0aGUgY29udGVudHMgb2YgYW4gZXhpc3RpbmcgdmFyaWFibGUKCmBgYHtyfQpnYXBfd2VpcmQgPC0gZ2FwbWluZGVyICU+JSBtdXRhdGUocG9wID0gcG9wICsgMTAwMCkKZ2FwX3dlaXJkCmBgYAoKIyMgYGFycmFuZ2UoKWAKCi0gUmVvcmRlcnMgdGhlIHJvd3MgaW4gYSBkYXRhIGZyYW1lIGJhc2VkIG9uIHRoZSB2YWx1ZXMgb2Ygb25lIG9yIG1vcmUgdmFyaWFibGVzCgpgYGB7cn0KZ2FwbWluZGVyICU+JQogIGFycmFuZ2UoeWVhciwgY291bnRyeSkKYGBgCgotIENhbiBhbHNvIHB1dCBpbnRvIGRlc2NlbmRpbmcgb3JkZXIKCmBgYHtyIGRlc2N9CmdhcG1pbmRlciAlPiUKICBmaWx0ZXIoeWVhciA+IDIwMDApICU+JQogIGFycmFuZ2UoZGVzYyhsaWZlRXhwKSkKYGBgCgojIyBPdGhlciB1c2VmdWwgYGRwbHlyYCB2ZXJicwoKLSBgc2VsZWN0YAotIGB0b3BfbmAKLSBgc2FtcGxlX25gCi0gYHNsaWNlYAotIGBnbGltcHNlYAotIGByZW5hbWVgCgojIyBZb3VyIHR1cm4KCiMjIyBFeGVyY2lzZSAxLjMKCkRldGVybWluZSB3aGljaCBBZnJpY2FuIGNvdW50cnkgaGFkIHRoZSBoaWdoZXN0IEdEUCBwZXIgY2FwaXRhIGluIDE5ODIgdXNpbmcgdGhlIGBnYXBtaW5kZXJgIGRhdGEgaW4gdGhlIGBnYXBtaW5kZXJgIHBhY2thZ2UuCgpgYGB7cn0KI1NwYWNlIGZvciB5b3VyIGFuc3dlciBoZXJlLgpnYXBtaW5kZXIgJT4lIAogIGZpbHRlcih5ZWFyID09IDE5ODIpICU+JSAKICBmaWx0ZXIoY29udGluZW50ID09ICJBZnJpY2EiKSAlPiUgCiAgYXJyYW5nZShkZXNjKGdkcFBlcmNhcCkpCmBgYAoKCiMjIyBDaGFsbGVuZ2UgLSBFeGVyY2lzZSAxLjNDCgpGb3IgYm90aCBvZiB0aGVzZSB0YXNrcyBiZWxvdywgdXNlIHRoZSBgYmVjaGRlbGAgZGF0YSBmcmFtZSBpbiB0aGUgYGZpdmV0aGlydHllaWdodGAgcGFja2FnZToKCi0gVXNlIHRoZSBgY291bnRgIGZ1bmN0aW9uIGluIHRoZSBgZHBseXJgIHBhY2thZ2UgdG8gZGV0ZXJtaW5lIGhvdyBtYW55IG1vdmllcwppbiAyMDEzIGZlbGwgaW50byBlYWNoIG9mIHRoZSBkaWZmZXJlbnQgY2F0ZWdvcmllcyBmb3IgYGNsZWFuX3Rlc3RgLgotIERldGVybWluZSB0aGUgcGVyY2VudGFnZSBvZiBtb3ZpZXMgdGhhdCByZWNlaXZlZCB0aGUgdmFsdWUgb2YgYCJvayJgIGFjcm9zcyBhbGwgeWVhcnMuCgoKLS0tCgoKIyMgWW91ciB0dXJuCgojIyMgRXhlcmNpc2UgMS40CgpEZXRlcm1pbmUgdGhlIHRvcCBmaXZlIG1vdmllcyBpbiB0ZXJtcyBvZiBkb21lc3RpYyByZXR1cm4gb24gaW52ZXN0bWVudCBmb3IgMjAxMyBzY2FsZWQgZGF0YSB1c2luZyB0aGUgYGJlY2hkZWxgIGRhdGEgZnJhbWUgaW4gdGhlIGBmaXZldGhpcnR5ZWlnaHRgIHBhY2thZ2UuCgoKYGBge3J9CiNTcGFjZSBmb3IgeW91ciBhbnN3ZXIgaGVyZS4KYGBgCgoKIyMgV2hhdCB5b3UgbGVhcm5lZCBpbiB0aGlzIHNlY3Rpb24KCkRpZmZlcmVudCB3YXlzIGluIHRoZSBgZHBseXJgIHBhY2thZ2UgdG8KCi0gY2hvb3NlIHN1YnNldHMgb2YgZGF0YSB1c2luZyB0aGUgYGZpbHRlcigpYCB2ZXJiCi0gYWdncmVnYXRlIGRhdGEgdG8gY3JlYXRlIHN1bW1hcmllcyB1c2luZyB0aGUgYHN1bW1hcml6ZSgpYCBhbmQgYGdyb3VwX2J5YCB2ZXJicwotIG1ha2UgbmV3IHZhcmlhYmxlcyB1c2luZyB0aGUgYG11dGF0ZSgpYCB2ZXJiCi0gc29ydCB5b3VyIGRhdGEgZnJhbWVzIHVzaW5nIHRoZSBgYXJyYW5nZSgpYCB2ZXJi