9 Selecting and renaming columns

9.1 Introduction

Today we will begin our exploration of the {dplyr} package! Our first verb on the list is select which allows to keep or drop variables from your dataframe. Choosing your variables is the first step in cleaning your data.

Let’s go !

9.2 Learning objectives

You can keep or drop columns from a dataframe using the dplyr::select() function from the {dplyr} package.
You can select a range or combination of columns using operators like the colon (:), the exclamation mark (!), and the c() function.
You can select columns based on patterns in their names with helper functions like starts_with(), ends_with(), contains(), and everything().
You can use rename() and select() to change column names.

9.3 The Yaounde COVID-19 dataset

In this lesson, we analyse results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from Zenodo, and the paper can be viewed here.

Spend some time browsing through this dataset. Each line corresponds to one patient surveyed. There are some demographic, socio-economic and COVID-related variables. The results of the IgG and IgM antibody tests are in the columns igg_result and igm_result.

yaounde <- read_csv(here::here("data/yaounde_data.csv"))
yaounde

# A tibble: 5 × 53
  id     date_surveyed   age age_category age_category_3 sex   highest_education
  <chr>  <date>        <dbl> <chr>        <chr>          <chr> <chr>            
1 BRIQU… 2020-10-22       45 45 - 64      Adult          Fema… Secondary        
2 BRIQU… 2020-10-24       55 45 - 64      Adult          Male  University       
3 BRIQU… 2020-10-24       23 15 - 29      Adult          Male  University       
4 BRIQU… 2020-10-22       20 15 - 29      Adult          Fema… Secondary        
5 BRIQU… 2020-10-22       55 45 - 64      Adult          Fema… Primary          
# ℹ 46 more variables: occupation <chr>, weight_kg <dbl>, height_cm <dbl>,
#   is_smoker <chr>, is_pregnant <chr>, is_medicated <chr>, neighborhood <chr>,
#   household_with_children <chr>, breadwinner <chr>, source_of_revenue <chr>,
#   has_contact_covid <chr>, igg_result <chr>, igm_result <chr>,
#   symptoms <chr>, symp_fever <chr>, symp_headache <chr>, symp_cough <chr>,
#   symp_rhinitis <chr>, symp_sneezing <chr>, symp_fatigue <chr>,
#   symp_muscle_pain <chr>, symp_nausea_or_vomiting <chr>, …

Left: the Yaounde survey team. Right: an antibody test being administered.

9.4 Introducing `select()`

Fig: the `select()` function. (Drawing adapted from Allison Horst).

dplyr::select() lets us pick which columns (variables) to keep or drop.

We can select a column by name:

yaounde %>% select(age)

# A tibble: 5 × 1
    age
  <dbl>
1    45
2    55
3    23
4    20
5    55

Or we can select a column by position:

yaounde %>% select(3) # `age` is the 3rd column

# A tibble: 5 × 1
    age
  <dbl>
1    45
2    55
3    23
4    20
5    55

To select multiple variables, we separate them with commas:

yaounde %>% select(age, sex, igg_result)

# A tibble: 971 × 3
     age sex    igg_result
   <dbl> <chr>  <chr>     
 1    45 Female Negative  
 2    55 Male   Positive  
 3    23 Male   Negative  
 4    20 Female Positive  
 5    55 Female Positive  
 6    17 Female Negative  
 7    13 Female Positive  
 8    28 Male   Negative  
 9    30 Male   Negative  
10    13 Female Positive  
# ℹ 961 more rows

Practice

Select the weight and height variables in the yaounde data frame.

Select the 16th and 22nd columns in the yaounde data frame.

For the next part of the tutorial, let’s create a smaller subset of the data, called yao.

yao <-
  yaounde %>% select(age,
                     sex,
                     highest_education,
                     occupation,
                     is_smoker,
                     is_pregnant,
                     igg_result,
                     igm_result)
yao

# A tibble: 5 × 8
    age sex    highest_education occupation     is_smoker is_pregnant igg_result
  <dbl> <chr>  <chr>             <chr>          <chr>     <chr>       <chr>     
1    45 Female Secondary         Informal work… Non-smok… No          Negative  
2    55 Male   University        Salaried work… Ex-smoker <NA>        Positive  
3    23 Male   University        Student        Smoker    <NA>        Negative  
4    20 Female Secondary         Student        Non-smok… No          Positive  
5    55 Female Primary           Trader--Farmer Non-smok… No          Positive  
# ℹ 1 more variable: igm_result <chr>

9.4.1 Selecting column ranges with `:`

The : operator selects a range of consecutive variables:

yao %>% select(age:occupation) # Select all columns from `age` to `occupation`

# A tibble: 5 × 4
    age sex    highest_education occupation     
  <dbl> <chr>  <chr>             <chr>          
1    45 Female Secondary         Informal worker
2    55 Male   University        Salaried worker
3    23 Male   University        Student        
4    20 Female Secondary         Student        
5    55 Female Primary           Trader--Farmer

We can also specify a range with column numbers:

yao %>% select(1:4) # Select columns 1 to 4

# A tibble: 5 × 4
    age sex    highest_education occupation     
  <dbl> <chr>  <chr>             <chr>          
1    45 Female Secondary         Informal worker
2    55 Male   University        Salaried worker
3    23 Male   University        Student        
4    20 Female Secondary         Student        
5    55 Female Primary           Trader--Farmer

Practice

With the yaounde data frame, select the columns between symptoms and sequelae, inclusive. (“Inclusive” means you should also include symptoms and sequelae in the selection.)

9.4.2 Excluding columns with `!`

The exclamation point negates a selection:

yao %>% select(!age) # Select all columns except `age`

# A tibble: 5 × 7
  sex   highest_education occupation is_smoker is_pregnant igg_result igm_result
  <chr> <chr>             <chr>      <chr>     <chr>       <chr>      <chr>     
1 Fema… Secondary         Informal … Non-smok… No          Negative   Negative  
2 Male  University        Salaried … Ex-smoker <NA>        Positive   Negative  
3 Male  University        Student    Smoker    <NA>        Negative   Negative  
4 Fema… Secondary         Student    Non-smok… No          Positive   Negative  
5 Fema… Primary           Trader--F… Non-smok… No          Positive   Negative

To drop a range of consecutive columns, we use, for example,!age:occupation:

yao %>% select(!age:occupation) # Drop columns from `age` to `occupation`

# A tibble: 5 × 4
  is_smoker  is_pregnant igg_result igm_result
  <chr>      <chr>       <chr>      <chr>     
1 Non-smoker No          Negative   Negative  
2 Ex-smoker  <NA>        Positive   Negative  
3 Smoker     <NA>        Negative   Negative  
4 Non-smoker No          Positive   Negative  
5 Non-smoker No          Positive   Negative

To drop several non-consecutive columns, place them inside !c():

yao %>% select(!c(age, sex, igg_result))

# A tibble: 5 × 5
  highest_education occupation      is_smoker  is_pregnant igm_result
  <chr>             <chr>           <chr>      <chr>       <chr>     
1 Secondary         Informal worker Non-smoker No          Negative  
2 University        Salaried worker Ex-smoker  <NA>        Negative  
3 University        Student         Smoker     <NA>        Negative  
4 Secondary         Student         Non-smoker No          Negative  
5 Primary           Trader--Farmer  Non-smoker No          Negative

Practice

From the yaounde data frame, remove all columns between highest_education and consultation, inclusive.

9.5 Helper functions for `select()`

dplyr has a number of helper functions to make selecting easier by using patterns from the column names. Let’s take a look at some of these.

9.5.1 `starts_with()` and `ends_with()`

These two helpers work exactly as their names suggest!

yao %>% select(starts_with("is_")) # Columns that start with "is"

# A tibble: 5 × 2
  is_smoker  is_pregnant
  <chr>      <chr>      
1 Non-smoker No         
2 Ex-smoker  <NA>       
3 Smoker     <NA>       
4 Non-smoker No         
5 Non-smoker No

yao %>% select(ends_with("_result")) # Columns that end with "result"

# A tibble: 5 × 2
  igg_result igm_result
  <chr>      <chr>     
1 Negative   Negative  
2 Positive   Negative  
3 Negative   Negative  
4 Positive   Negative  
5 Positive   Negative

9.5.2 `contains()`

contains() helps select columns that contain a certain string:

yaounde %>% select(contains("drug")) # Columns that contain the string "drug"

# A tibble: 5 × 12
  drugsource       is_drug_parac is_drug_antibio is_drug_hydrocortisone
  <chr>                    <dbl>           <dbl>                  <dbl>
1 Self or familial             1               0                      0
2 <NA>                        NA              NA                     NA
3 <NA>                        NA              NA                     NA
4 Self or familial             0               1                      0
5 <NA>                        NA              NA                     NA
# ℹ 8 more variables: is_drug_other_anti_inflam <dbl>, is_drug_antiviral <dbl>,
#   is_drug_chloro <dbl>, is_drug_tradn <dbl>, is_drug_oxygen <dbl>,
#   is_drug_other <dbl>, is_drug_no_resp <dbl>, is_drug_none <dbl>

9.5.3 `everything()`

Another helper function, everything(), matches all variables that have not yet been selected.

## First, `is_pregnant`, then every other column.
yao %>% select(is_pregnant, everything())

# A tibble: 5 × 8
  is_pregnant   age sex    highest_education occupation     is_smoker igg_result
  <chr>       <dbl> <chr>  <chr>             <chr>          <chr>     <chr>     
1 No             45 Female Secondary         Informal work… Non-smok… Negative  
2 <NA>           55 Male   University        Salaried work… Ex-smoker Positive  
3 <NA>           23 Male   University        Student        Smoker    Negative  
4 No             20 Female Secondary         Student        Non-smok… Positive  
5 No             55 Female Primary           Trader--Farmer Non-smok… Positive  
# ℹ 1 more variable: igm_result <chr>

It is often useful for establishing the order of columns.

Say we wanted to bring the is_pregnant column to the start of the yao data frame, we could type out all the column names manually:

yao %>% select(is_pregnant, 
               age, 
               sex, 
               highest_education, 
               occupation, 
               is_smoker, 
               igg_result, 
               igm_result)

# A tibble: 5 × 8
  is_pregnant   age sex    highest_education occupation     is_smoker igg_result
  <chr>       <dbl> <chr>  <chr>             <chr>          <chr>     <chr>     
1 No             45 Female Secondary         Informal work… Non-smok… Negative  
2 <NA>           55 Male   University        Salaried work… Ex-smoker Positive  
3 <NA>           23 Male   University        Student        Smoker    Negative  
4 No             20 Female Secondary         Student        Non-smok… Positive  
5 No             55 Female Primary           Trader--Farmer Non-smok… Positive  
# ℹ 1 more variable: igm_result <chr>

But this would be painful for larger data frames, such as our original yaounde data frame. In such a case, we can use everything():

## Bring `is_pregnant` to the front of the data frame
yaounde %>% select(is_pregnant, everything())

# A tibble: 5 × 53
  is_pregnant id           date_surveyed   age age_category age_category_3 sex  
  <chr>       <chr>        <date>        <dbl> <chr>        <chr>          <chr>
1 No          BRIQUETERIE… 2020-10-22       45 45 - 64      Adult          Fema…
2 <NA>        BRIQUETERIE… 2020-10-24       55 45 - 64      Adult          Male 
3 <NA>        BRIQUETERIE… 2020-10-24       23 15 - 29      Adult          Male 
4 No          BRIQUETERIE… 2020-10-22       20 15 - 29      Adult          Fema…
5 No          BRIQUETERIE… 2020-10-22       55 45 - 64      Adult          Fema…
# ℹ 46 more variables: highest_education <chr>, occupation <chr>,
#   weight_kg <dbl>, height_cm <dbl>, is_smoker <chr>, is_medicated <chr>,
#   neighborhood <chr>, household_with_children <chr>, breadwinner <chr>,
#   source_of_revenue <chr>, has_contact_covid <chr>, igg_result <chr>,
#   igm_result <chr>, symptoms <chr>, symp_fever <chr>, symp_headache <chr>,
#   symp_cough <chr>, symp_rhinitis <chr>, symp_sneezing <chr>,
#   symp_fatigue <chr>, symp_muscle_pain <chr>, …

This helper can be combined with many others.

## Bring columns that end with "result" to the front of the data frame
yaounde %>% select(ends_with("result"), everything())

# A tibble: 5 × 53
  igg_result igm_result id       date_surveyed   age age_category age_category_3
  <chr>      <chr>      <chr>    <date>        <dbl> <chr>        <chr>         
1 Negative   Negative   BRIQUET… 2020-10-22       45 45 - 64      Adult         
2 Positive   Negative   BRIQUET… 2020-10-24       55 45 - 64      Adult         
3 Negative   Negative   BRIQUET… 2020-10-24       23 15 - 29      Adult         
4 Positive   Negative   BRIQUET… 2020-10-22       20 15 - 29      Adult         
5 Positive   Negative   BRIQUET… 2020-10-22       55 45 - 64      Adult         
# ℹ 46 more variables: sex <chr>, highest_education <chr>, occupation <chr>,
#   weight_kg <dbl>, height_cm <dbl>, is_smoker <chr>, is_pregnant <chr>,
#   is_medicated <chr>, neighborhood <chr>, household_with_children <chr>,
#   breadwinner <chr>, source_of_revenue <chr>, has_contact_covid <chr>,
#   symptoms <chr>, symp_fever <chr>, symp_headache <chr>, symp_cough <chr>,
#   symp_rhinitis <chr>, symp_sneezing <chr>, symp_fatigue <chr>,
#   symp_muscle_pain <chr>, symp_nausea_or_vomiting <chr>, …

Practice

Select all columns in the yaounde data frame that start with “is_”.

Move the columns that start with “is_” to the beginning of the yaounde data frame.

9.6 Change column names with `rename()`

Fig: the `rename()` function. (Drawing adapted from Allison Horst)

dplyr::rename() is used to change column names:

## Rename `age` and `sex` to `patient_age` and `patient_sex`
yaounde %>% 
  rename(patient_age = age, 
         patient_sex = sex)

# A tibble: 5 × 53
  id           date_surveyed patient_age age_category age_category_3 patient_sex
  <chr>        <date>              <dbl> <chr>        <chr>          <chr>      
1 BRIQUETERIE… 2020-10-22             45 45 - 64      Adult          Female     
2 BRIQUETERIE… 2020-10-24             55 45 - 64      Adult          Male       
3 BRIQUETERIE… 2020-10-24             23 15 - 29      Adult          Male       
4 BRIQUETERIE… 2020-10-22             20 15 - 29      Adult          Female     
5 BRIQUETERIE… 2020-10-22             55 45 - 64      Adult          Female     
# ℹ 47 more variables: highest_education <chr>, occupation <chr>,
#   weight_kg <dbl>, height_cm <dbl>, is_smoker <chr>, is_pregnant <chr>,
#   is_medicated <chr>, neighborhood <chr>, household_with_children <chr>,
#   breadwinner <chr>, source_of_revenue <chr>, has_contact_covid <chr>,
#   igg_result <chr>, igm_result <chr>, symptoms <chr>, symp_fever <chr>,
#   symp_headache <chr>, symp_cough <chr>, symp_rhinitis <chr>,
#   symp_sneezing <chr>, symp_fatigue <chr>, symp_muscle_pain <chr>, …

Watch Out

The fact that the new name comes first in the function (rename(NEWNAME = OLDNAME)) is sometimes confusing. You should get used to this with time.

9.6.1 Rename within `select()`

You can also rename columns while selecting them:

## Select `age` and `sex`, and rename them to `patient_age` and `patient_sex`
yaounde %>% 
  select(patient_age = age, 
         patient_sex = sex)

# A tibble: 5 × 2
  patient_age patient_sex
        <dbl> <chr>      
1          45 Female     
2          55 Male       
3          23 Male       
4          20 Female     
5          55 Female

9.7 Wrap up

I hope this first lesson has allowed you to see how intuitive and useful the {dplyr} verbs are! This is the first of a series of basic data wrangling verbs: see you in the next lesson to learn more.

LAURE VANCAUWENBERGHE
Data analyst, the GRAPH Network
A firm believer in science for good, striving to ally programming, health and education

ANDREE VALLE CAMPOS
R Developer and Instructor, the GRAPH Network
Motivated by reproducible science and education

KENE DAVID NWOSU
Data analyst, the GRAPH Network
Passionate about world improvement

References

Some material in this lesson was adapted from the following sources:

Horst, A. (2021). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original work published 2020)
Subset columns using their names and types—Select. (n.d.). Retrieved 31 December 2021, from https://dplyr.tidyverse.org/reference/select.html

Artwork was adapted from:

Horst, A. (2021). R & stats illustrations by Allison Horst. https://github.com/allisonhorst/stats-illustrations (Original work published 2018)

9.8 Solutions

.SOLUTION_Q_weight_height()

 
  yaounde %>% select(weight_kg, height_cm)

.SOLUTION_Q_cols_16_22()

 
  yaounde %>% select(16, 22)

.SOLUTION_Q_symp_to_sequel()

 
  yaounde %>% select(symptoms:sequelae)

.SOLUTION_Q_educ_consult()

 
  yaounde %>% select(!c(highest_education:consultation))

.SOLUTION_Q_starts_with_is()

 
  yaounde %>% select(starts_with("is"))

.SOLUTION_Q_rearrange()

 
  yaounde %>% select(starts_with("is_"), everything())

9.1 Introduction

9.2 Learning objectives

9.3 The Yaounde COVID-19 dataset

9.4 Introducing select()

9.4.1 Selecting column ranges with :

9.4.2 Excluding columns with !

9.5 Helper functions for select()

9.5.1 starts_with() and ends_with()

9.5.2 contains()

9.5.3 everything()

9.6 Change column names with rename()

9.6.1 Rename within select()

9.7 Wrap up

References

9.8 Solutions

9.4 Introducing `select()`

9.4.1 Selecting column ranges with `:`

9.4.2 Excluding columns with `!`

9.5 Helper functions for `select()`

9.5.1 `starts_with()` and `ends_with()`

9.5.2 `contains()`

9.5.3 `everything()`

9.6 Change column names with `rename()`

9.6.1 Rename within `select()`