Adapted by Lorena Pantano. Original materials at hbctraning
Thanks them on Twitter for their work: @bioinfocore
Approximate time: 30 minutes
The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr’ package and data visualization with the ‘ggplot2’ package.
The Tidyverse suite of packages introduces users to a set of data structures, functions and operators to make working with data more intuitive, but is slightly different from the way we do things in base R. Two important new concepts we will focus on are pipes and tibbles.
Before we get started with pipes or tibbles, let’s load the library:
library(tidyverse)
Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.
To make R code more human readable, the Tidyverse tools use the pipe, %>%
, which was acquired from the magrittr
package and is now part of the dplyr
package that is installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.
NOTE: Shortcut to write the pipe is shift + command + M
An example of using the pipe to run multiple commands:
## A single command
sqrt(83)
## [1] 9.11
## Base R method of running more than one command
round(sqrt(83), digit = 2)
## [1] 9.11
## Running more than one command with piping
sqrt(83) %>% round(digit = 2)
## [1] 9.11
The pipe represents a much easier way of writing and deciphering R code, and so we will be taking advantage of it, when possible, as we work through the remaining lesson.
A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame
, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R.
Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g. data.frame
) be treated equally, and that special designation of a column as rownames
should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column()
and column_to_rownames()
.
Tibbles can be created directly using the tibble()
function or data frames can be converted into tibbles using as_tibble(name_of_df)
.
NOTE: The function
as_tibble()
will ignore row names, so if a column representing the row names is needed, then the functionrownames_to_column(name_of_df)
should be run prior to turning the data.frame into a tibble. Also,as_tibble()
will not coerce character vectors to factors by default.
We’re going to explore the Tidyverse suite of tools to wrangle our data to prepare it for visualization.
The dataset:
The functional analysis that we will focus on involves gene ontology (GO) terms, which:
Goal: Visually compare the most significant biological processes (BP) for significance values and number of associated differentially expressed genes.
We are going to use the Tidyverse suite of tools to wrangle and visualize our data through several steps:
While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate more deeply the reading (readr
), wrangling (dplyr
), and plotting (ggplot2
) tools.
While the base R packages have perfectly fine methods for reading in data, the readr
and readxl
Tidyverse packages offer additional methods for reading in data. Let’s read in our tab-delimited functional analysis results using read_delim()
:
dir.create("data", showWarnings = F)
# Read in the functional analysis results
download.file("https://github.com/pilm-bioinformatics/pilmbc104-best-of-r/raw/master/data/gprofiler_results_Mov10oe.tsv", "data/gprofiler_results_Mov10oe.tsv")
functional_GO_results <- read_tsv(file = "data/gprofiler_results_Mov10oe.tsv")
## Parsed with column specification:
## cols(
## query.number = col_double(),
## significant = col_logical(),
## p.value = col_double(),
## term.size = col_double(),
## query.size = col_double(),
## overlap.size = col_double(),
## recall = col_double(),
## precision = col_double(),
## term.id = col_character(),
## domain = col_character(),
## subgraph.number = col_double(),
## term.name = col_character(),
## relative.depth = col_double(),
## intersection = col_character()
## )
download.file("https://github.com/pilm-bioinformatics/pilmbc104-best-of-r/raw/master/data/Mov10oe_DE.csv", "data/Mov10oe_DE.csv")
de_results <- read_csv(file = "data/Mov10oe_DE.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## baseMean = col_double(),
## log2FoldChange = col_double(),
## lfcSE = col_double(),
## stat = col_double(),
## pvalue = col_double(),
## padj = col_double()
## )
# Take a look at the results
functional_GO_results
## # A tibble: 3,644 x 14
## query.number significant p.value term.size query.size overlap.size
## <dbl> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 1 TRUE 0.00434 111 5850 52
## 2 1 TRUE 0.0033 110 5850 52
## 3 1 TRUE 0.0297 39 5850 21
## 4 1 TRUE 0.0193 70 5850 34
## 5 1 TRUE 0.0148 26 5850 16
## 6 1 TRUE 0.0187 22 5850 14
## 7 1 TRUE 0.0226 48 5850 25
## 8 1 TRUE 0.0491 17 5850 11
## 9 1 TRUE 0.00798 164 5850 71
## 10 1 TRUE 0.0439 19 5850 12
## # … with 3,634 more rows, and 8 more variables: recall <dbl>,
## # precision <dbl>, term.id <chr>, domain <chr>, subgraph.number <dbl>,
## # term.name <chr>, relative.depth <dbl>, intersection <chr>
de_results
## # A tibble: 23,368 x 7
## X1 baseMean log2FoldChange lfcSE stat pvalue padj
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/2-SBSRNA4 45.7 0.268 0.188 1.42 0.154 0.264
## 2 A1BG 61.1 0.209 0.174 1.20 0.229 0.357
## 3 A1BG-AS1 176. -0.0519 0.124 -0.418 0.676 0.781
## 4 A1CF 0.238 0.0130 0.0491 0.265 0.791 NA
## 5 A2LD1 89.6 0.345 0.160 2.16 0.0310 0.0722
## 6 A2M 5.86 -0.274 0.179 -1.53 0.126 0.226
## 7 A2ML1 2.42 0.240 0.137 1.75 0.0809 NA
## 8 A2MP1 1.32 0.0811 0.101 0.804 0.421 NA
## 9 A4GALT 64.5 0.798 0.172 4.65 0.00000336 0.0000240
## 10 A4GNT 0.191 0.00952 0.0421 0.226 0.821 NA
## # … with 23,358 more rows
Notice that the results were automatically read in as a tibble and the output gives the number of rows, columns and the data type for each of the columns.
NOTE: A large number of tidyverse functions will work with both tibbles and dataframes, and the data structure of the output will be identical to the input. However, there are some functions that will return a tibble (without row names), whether or not a tibble or dataframe is provided.
Now that we have our data, we will need to wrangle it into a format ready for plotting. For all of our data wrangling steps we will be using tools from the dplyr package, which is a swiss-army knife for data wrangling of data frames.
To extract the biological processes of interest, we only want those rows where the domain
is equal to BP
, which we can do using the filter()
function.
To filter rows of a data frame/tibble based on values in different columns, we give a logical expression as input to the filter()
function to return those rows for which the expression is TRUE.
Now let’s return only those rows that have a domain
of BP
:
# Return only GO biological processes
bp_oe <- functional_GO_results %>%
filter(domain == "BP")
# View(bp_oe)
head(bp_oe)
## # A tibble: 6 x 14
## query.number significant p.value term.size query.size overlap.size recall
## <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 TRUE 0.00434 111 5850 52 0.009
## 2 1 TRUE 0.0033 110 5850 52 0.009
## 3 1 TRUE 0.0297 39 5850 21 0.004
## 4 1 TRUE 0.0193 70 5850 34 0.006
## 5 1 TRUE 0.0148 26 5850 16 0.003
## 6 1 TRUE 0.0187 22 5850 14 0.002
## # … with 7 more variables: precision <dbl>, term.id <chr>, domain <chr>,
## # subgraph.number <dbl>, term.name <chr>, relative.depth <dbl>,
## # intersection <chr>
Now we have returned only those rows with a domain
of BP
.
For visualization purposes, we are only interested in the columns related to the GO terms, the significance of the terms, and information about the number of genes associated with the terms.
To extract columns from a data frame/tibble we can use the select()
function. In contrast to base R, we do not need to put the column names in quotes for selection.
# Selecting columns to keep
bp_oe %>%
select(query.number, term.id, term.name, p.value, significant,
query.size, term.size, overlap.size, intersection)
## # A tibble: 1,102 x 9
## query.number term.id term.name p.value significant query.size term.size
## <dbl> <chr> <chr> <dbl> <lgl> <dbl> <dbl>
## 1 1 GO:003… type I i… 0.00434 TRUE 5850 111
## 2 1 GO:003… regulati… 0.0033 TRUE 5850 110
## 3 1 GO:003… negative… 0.0297 TRUE 5850 39
## 4 1 GO:003… positive… 0.0193 TRUE 5850 70
## 5 1 GO:001… cell com… 0.0148 TRUE 5850 26
## 6 1 GO:008… cell com… 0.0187 TRUE 5850 22
## 7 1 GO:003… aorta de… 0.0226 TRUE 5850 48
## 8 1 GO:003… activati… 0.0491 TRUE 5850 17
## 9 1 GO:005… T cell r… 0.00798 TRUE 5850 164
## 10 1 GO:004… tetrahyd… 0.0439 TRUE 5850 19
## # … with 1,092 more rows, and 2 more variables: overlap.size <dbl>,
## # intersection <chr>
The select()
function also allows for negative selection. So we could have alternately removed columns with negative selection. Note that we need to put the column names inside of the combine (c()
) function with a -
preceding it for this functionality.
# DO NOT RUN
# Selecting columns to remove
bp_oe %>%
select(-c(query.number, significant, recall,
precision, subgraph.number, relative.depth))
## # A tibble: 1,102 x 8
## p.value term.size query.size overlap.size term.id domain term.name
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 0.00434 111 5850 52 GO:003… BP type I i…
## 2 0.0033 110 5850 52 GO:003… BP regulati…
## 3 0.0297 39 5850 21 GO:003… BP negative…
## 4 0.0193 70 5850 34 GO:003… BP positive…
## 5 0.0148 26 5850 16 GO:001… BP cell com…
## 6 0.0187 22 5850 14 GO:008… BP cell com…
## 7 0.0226 48 5850 25 GO:003… BP aorta de…
## 8 0.0491 17 5850 11 GO:003… BP activati…
## 9 0.00798 164 5850 71 GO:005… BP T cell r…
## 10 0.0439 19 5850 12 GO:004… BP tetrahyd…
## # … with 1,092 more rows, and 1 more variable: intersection <chr>
Now that we have only the rows and columns of interest, let’s arrange these by significance, which is denoted by the adjusted p-value.
Let’s sort the rows by adjusted p-value with the arrange()
function.
# Order by adjusted p-value ascending
bp_oe %>%
arrange(p.value)
## # A tibble: 1,102 x 14
## query.number significant p.value term.size query.size overlap.size
## <dbl> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 1 TRUE 1.56e-75 8276 5850 3171
## 2 1 TRUE 3.51e-66 6428 5850 2535
## 3 1 TRUE 6.71e-66 5257 5850 2142
## 4 1 TRUE 2.85e-64 6866 5850 2671
## 5 1 TRUE 3.27e-64 10105 5850 3695
## 6 1 TRUE 1.18e-61 5103 5850 2073
## 7 1 TRUE 5.87e-61 9000 5850 3339
## 8 1 TRUE 2.49e-58 5731 5850 2271
## 9 1 TRUE 4.57e-58 6104 5850 2394
## 10 1 TRUE 7.28e-57 4597 5850 1881
## # … with 1,092 more rows, and 8 more variables: recall <dbl>,
## # precision <dbl>, term.id <chr>, domain <chr>, subgraph.number <dbl>,
## # term.name <chr>, relative.depth <dbl>, intersection <chr>
NOTE: If you wanted to arrange in descending order, then you could have run the following instead:
# Order by adjusted p-value descending
bp_oe <- bp_oe %>%
arrange(desc(p.value))
While not necessary for our visualization, renaming columns more intuitively can help with our understanding of the data using the rename()
function. The syntax is new_name
= old_name
.
Let’s rename the term.id
and term.name
columns.
# Provide better names for columns
bp_oe %>%
dplyr::rename(GO_id = term.id,
GO_term = term.name) %>%
select(GO_id, GO_term)
## # A tibble: 1,102 x 2
## GO_id GO_term
## <chr> <chr>
## 1 GO:0034199 activation of protein kinase A activity
## 2 GO:0043558 regulation of translational initiation in response to stress
## 3 GO:0051031 tRNA transport
## 4 GO:0006370 7-methylguanosine mRNA capping
## 5 GO:0072666 establishment of protein localization to vacuole
## 6 GO:0001961 positive regulation of cytokine-mediated signaling pathway
## 7 GO:1901654 response to ketone
## 8 GO:0033044 regulation of chromosome organization
## 9 GO:0001934 positive regulation of protein phosphorylation
## 10 GO:1903008 organelle disassembly
## # … with 1,092 more rows
NOTE: In the case of two packages with identical function names, you can use
::
with the package name before and the function name after (e.gstats::filter()
) to ensure that the correct function is implemented. The::
can also be used to bring in a function from a library without loading it first.In the example above, we wanted to use the
rename()
function specifically from thedplyr
package, and not any of the other packages (or base R) which may have therename()
function.
Finally, before we plot our data, we need to create a couple of additional metrics. The mutate()
function enables you to create a new column from an existing column.
Let’s generate gene ratios to reflect the number of DE genes associated with each GO process relative to the total number of DE genes.
# Create gene ratio column based on other columns in dataset
bp_oe %>%
mutate(gene_ratio = overlap.size / query.size) %>%
select(gene_ratio, overlap.size, query.size)
## # A tibble: 1,102 x 3
## gene_ratio overlap.size query.size
## <dbl> <dbl> <dbl>
## 1 0.00188 11 5850
## 2 0.00188 11 5850
## 3 0.00308 18 5850
## 4 0.00308 18 5850
## 5 0.00308 18 5850
## 6 0.00308 18 5850
## 7 0.0120 70 5850
## 8 0.00974 57 5850
## 9 0.0528 309 5850
## 10 0.00786 46 5850
## # … with 1,092 more rows
If you have a column that has multiple values, like intersection
that all genes are together separatd by the character ,
.
bp_oe %>%
select(intersection, term.name) %>%
separate_rows(intersection, sep=",")
## # A tibble: 361,143 x 2
## intersection term.name
## <chr> <chr>
## 1 prkar2b activation of protein kinase A activity
## 2 prkaca activation of protein kinase A activity
## 3 prkar1a activation of protein kinase A activity
## 4 prkar2a activation of protein kinase A activity
## 5 adcy7 activation of protein kinase A activity
## 6 prkacb activation of protein kinase A activity
## 7 adcy9 activation of protein kinase A activity
## 8 prkacg activation of protein kinase A activity
## 9 adcy5 activation of protein kinase A activity
## 10 adcy6 activation of protein kinase A activity
## # … with 361,133 more rows
Look at separete
function to split one column in multiples.
left_join
function allows two join to tables by one or multiple columns.
If the name of the column is the same, the parameter looks like this:
by = "gene"
If the name of the column is different in both tables:
by = c("gene" = "gene_name")
If there are multiple columns to use:
by = c("chr", "position", "strand")
de_results %>%
mutate(gene = tolower(X1))
## # A tibble: 23,368 x 8
## X1 baseMean log2FoldChange lfcSE stat pvalue padj gene
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1/2-SBS… 45.7 0.268 0.188 1.42 1.54e-1 2.64e-1 1/2-sb…
## 2 A1BG 61.1 0.209 0.174 1.20 2.29e-1 3.57e-1 a1bg
## 3 A1BG-AS1 176. -0.0519 0.124 -0.418 6.76e-1 7.81e-1 a1bg-a…
## 4 A1CF 0.238 0.0130 0.0491 0.265 7.91e-1 NA a1cf
## 5 A2LD1 89.6 0.345 0.160 2.16 3.10e-2 7.22e-2 a2ld1
## 6 A2M 5.86 -0.274 0.179 -1.53 1.26e-1 2.26e-1 a2m
## 7 A2ML1 2.42 0.240 0.137 1.75 8.09e-2 NA a2ml1
## 8 A2MP1 1.32 0.0811 0.101 0.804 4.21e-1 NA a2mp1
## 9 A4GALT 64.5 0.798 0.172 4.65 3.36e-6 2.40e-5 a4galt
## 10 A4GNT 0.191 0.00952 0.0421 0.226 8.21e-1 NA a4gnt
## # … with 23,358 more rows
left_join(
bp_oe %>%
select(intersection, term.name) %>%
separate_rows(intersection, sep=","),
de_results %>%
mutate(gene = tolower(X1)),
by = c("intersection" = "gene")
)
## # A tibble: 361,143 x 9
## intersection term.name X1 baseMean log2FoldChange lfcSE stat
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 prkar2b activati… PRKA… 978. -0.255 0.0667 -3.82
## 2 prkaca activati… PRKA… 2758. -0.415 0.0655 -6.34
## 3 prkar1a activati… PRKA… 6793. -0.169 0.0422 -4.00
## 4 prkar2a activati… PRKA… 1469. -0.457 0.0600 -7.61
## 5 adcy7 activati… ADCY7 311. 0.534 0.111 4.80
## 6 prkacb activati… PRKA… 1157. -0.528 0.0669 -7.90
## 7 adcy9 activati… ADCY9 923. 0.179 0.0667 2.68
## 8 prkacg activati… PRKA… 105. -0.679 0.152 -4.47
## 9 adcy5 activati… ADCY5 138. 0.501 0.135 3.71
## 10 adcy6 activati… ADCY6 1351. 0.210 0.0620 3.39
## # … with 361,133 more rows, and 2 more variables: pvalue <dbl>, padj <dbl>
Now that we have our results ready for plotting, we can use the ggplot2 package to plot our results.
The ggplot2 package uses a syntax for plotting based on The Grammar for Graphics, and the next lesson will dive into how to use the special graphics syntax to create our detailed custom plots.
If you would like to explore additional Tidyverse packages and functions, we have additional materials available.
Click here to go to next lesson
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.