Load the main library to manipulate tables.

R> library(tidyverse)

Set up files

R> # download.file("https://github.com/pilm-bioinformatics/pilmbc211-rnaseq-ii/raw/master/salmon.tar.gz", "salmon.tar.gz")
R> system("tar -xvf salmon.tar.gz")

Load files

Normally we need two types of files:

metadata: information for each sample
expression: abundance for each gene

We will use an extra file that will help to have more information for each gene:

row data: table with gene id and gene name

Samples metadata

R> metadata = read.csv("salmon/metadata.csv", row.names = 1)
R> metadata

Gene abundance

Typically, there are two type of pipelines:

Quantification from BAM files with genomic coordinates: featureCounts
Quantification from FASTQ files and transcriptome FASTA: salmon

We end up with a table named counts that contains the gene abundances.

FeatureCounts pipeline

Usually you have one file, where columns are samples and rows are genes.

R> counts = read.csv("salmon/merged_gene_reads.csv", row.names = 1) %>% 
+     as.matrix() %>% 
+     round()
R> head(counts)

##                    SRR9166077_E12.5 SRR9166078_E13.5 SRR9166085_D3
## ENSMUSG00000000001               49               21            32
## ENSMUSG00000000003                0                0             0
## ENSMUSG00000000028               33               30            13
## ENSMUSG00000000031             2090             2945          7812
## ENSMUSG00000000037                1                0             0
## ENSMUSG00000000049               64               78          2662
##                    SRR9166086_D5
## ENSMUSG00000000001            41
## ENSMUSG00000000003             0
## ENSMUSG00000000028             8
## ENSMUSG00000000031          9530
## ENSMUSG00000000037             1
## ENSMUSG00000000049          2947

Salmon pipeline

There is a folder for each sample with information of the quantification at transcript level. We need to load all the samples and trasnform to gene quantification. LINK

R> library(tximport)
R> tx2gene = read.csv("salmon/tx2gene.csv", col.names = c("tx", "gene_id", "gene_name"))
R> # files = list.files("salmon", pattern = "quant.sf",
R> #                    recursive = T, full.names = T)
R> # names(files) = basename(dirname(files))
R> files = as.character(metadata$files)
R> names(files) = rownames(metadata)
R> txi = tximport(files, type = "salmon", tx2gene = tx2gene)
R> counts = txi$abundance
R> head(counts)

##                    SRR9166077_E12.5 SRR9166078_E13.5 SRR9166085_D3
## ENSMUSG00000000001          49.2751          21.1011       31.6676
## ENSMUSG00000000003           0.0000           0.0000        0.0000
## ENSMUSG00000000028          33.4757          30.0024       13.4473
## ENSMUSG00000000031        2090.0798        2945.1847     7812.2230
## ENSMUSG00000000037           0.7831           0.3421        0.3153
## ENSMUSG00000000049          64.3562          77.8887     2662.3920
##                    SRR9166086_D5
## ENSMUSG00000000001        41.253
## ENSMUSG00000000003         0.000
## ENSMUSG00000000028         7.801
## ENSMUSG00000000031      9529.720
## ENSMUSG00000000037         0.661
## ENSMUSG00000000049      2947.200

Prepare gene data

Create a table that has gene_id and gene_name and it is in the same order than the counts table.

R> rows = data.frame(tx2gene[,2:3]) %>% 
+     distinct()
R> rownames(rows) = rows$gene_id
R> rows = rows[rownames(counts),]
R> head(rows)

Create Summarized Object

Full documentation is at Bioc.

SummarizedExperiment

Load into object

R> library(SummarizedExperiment)
R> gse = SummarizedExperiment(assays = list(counts = counts),
+                           colData = metadata,
+                           rowData = rows)
R> dir.create("data", showWarnings = FALSE)
R> write_rds(gse, "data/gse.rds")
R> gse

## class: SummarizedExperiment 
## dim: 54459 4 
## metadata(0):
## assays(1): counts
## rownames(54459): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000118392 ENSMUSG00000118393
## rowData names(2): gene_id gene_name
## colnames(4): SRR9166077_E12.5 SRR9166078_E13.5 SRR9166085_D3
##   SRR9166086_D5
## colData names(3): files names group

RDS files are R data files with the object that can be loaded in other R session.

Next lesson

These materials have been developed by members of the teaching team at the PILM - MIT Bioinformatics Core (PILMBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Load files from RNAseq pipelines

Lorena Pantano

2019-07-08