Load the main library to manipulate tables.
Normally we need two types of files:
We will use an extra file that will help to have more information for each gene:
Typically, there are two type of pipelines:
featureCounts
salmon
We end up with a table named counts
that contains the gene abundances.
Usually you have one file, where columns are samples and rows are genes.
R> counts = read.csv("salmon/merged_gene_reads.csv", row.names = 1) %>%
+ as.matrix() %>%
+ round()
R> head(counts)
## SRR9166077_E12.5 SRR9166078_E13.5 SRR9166085_D3
## ENSMUSG00000000001 49 21 32
## ENSMUSG00000000003 0 0 0
## ENSMUSG00000000028 33 30 13
## ENSMUSG00000000031 2090 2945 7812
## ENSMUSG00000000037 1 0 0
## ENSMUSG00000000049 64 78 2662
## SRR9166086_D5
## ENSMUSG00000000001 41
## ENSMUSG00000000003 0
## ENSMUSG00000000028 8
## ENSMUSG00000000031 9530
## ENSMUSG00000000037 1
## ENSMUSG00000000049 2947
There is a folder for each sample with information of the quantification at transcript level. We need to load all the samples and trasnform to gene quantification. LINK
R> library(tximport)
R> tx2gene = read.csv("salmon/tx2gene.csv", col.names = c("tx", "gene_id", "gene_name"))
R> # files = list.files("salmon", pattern = "quant.sf",
R> # recursive = T, full.names = T)
R> # names(files) = basename(dirname(files))
R> files = as.character(metadata$files)
R> names(files) = rownames(metadata)
R> txi = tximport(files, type = "salmon", tx2gene = tx2gene)
R> counts = txi$abundance
R> head(counts)
## SRR9166077_E12.5 SRR9166078_E13.5 SRR9166085_D3
## ENSMUSG00000000001 49.2751 21.1011 31.6676
## ENSMUSG00000000003 0.0000 0.0000 0.0000
## ENSMUSG00000000028 33.4757 30.0024 13.4473
## ENSMUSG00000000031 2090.0798 2945.1847 7812.2230
## ENSMUSG00000000037 0.7831 0.3421 0.3153
## ENSMUSG00000000049 64.3562 77.8887 2662.3920
## SRR9166086_D5
## ENSMUSG00000000001 41.253
## ENSMUSG00000000003 0.000
## ENSMUSG00000000028 7.801
## ENSMUSG00000000031 9529.720
## ENSMUSG00000000037 0.661
## ENSMUSG00000000049 2947.200
Create a table that has gene_id
and gene_name
and it is in the same order than the counts
table.
R> rows = data.frame(tx2gene[,2:3]) %>%
+ distinct()
R> rownames(rows) = rows$gene_id
R> rows = rows[rownames(counts),]
R> head(rows)
Full documentation is at Bioc.
R> library(SummarizedExperiment)
R> gse = SummarizedExperiment(assays = list(counts = counts),
+ colData = metadata,
+ rowData = rows)
R> dir.create("data", showWarnings = FALSE)
R> write_rds(gse, "data/gse.rds")
R> gse
## class: SummarizedExperiment
## dim: 54459 4
## metadata(0):
## assays(1): counts
## rownames(54459): ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000118392 ENSMUSG00000118393
## rowData names(2): gene_id gene_name
## colnames(4): SRR9166077_E12.5 SRR9166078_E13.5 SRR9166085_D3
## SRR9166086_D5
## colData names(3): files names group
RDS
files are R data files with the object that can be loaded in other R session.
These materials have been developed by members of the teaching team at the PILM - MIT Bioinformatics Core (PILMBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.