MetaSRA: Downloading sequence data for selected samples

The MetaSRA website allows you to download the raw and standardized metadata for a set of SRA samples, selected via the MetaSRA’s query interface. In addition, it provides SRA accessions for sequencing studies, experiments, and runs associated with the selected samples. These accessions may be used to retrieve raw or processed sequencing data from the SRA or other databases. Below, we provide examples of how to access such data.

Raw sequence data via the SRA toolkit

The SRA Toolkit is the primary low-level interface for downloading raw sequence data from the SRA. The SRA Tookit tools are run from the command line. These tools operate on SRA run accessions. After querying the MetaSRA, click on the “Download” button and then select “Runs: ID List” to download a file containing a list of run accessions associated with the selected samples. To retrieve FASTQ-formatted sequence data for these runs, you will use the fastq-dump tool from the command-line. This tool accepts multiple run accessions as arguments. To give all of the run accessions in the run list file to the fastq-dump command, you can use the xargs command:

cat run_list.txt | xargs fastq-dump

## Read 10198022 spots for SRR091670
## Written 10198022 spots for SRR091670
## Read 10676183 spots for SRR091671
## Written 10676183 spots for SRR091671
## Read 14433257 spots for SRR091672
## Written 14433257 spots for SRR091672
## Read 13355599 spots for SRR091673
## Written 13355599 spots for SRR091673
## Read 48663061 spots total
## Written 48663061 spots total

The fastq-dump tool will download the sequence data from the SRA and convert it to FASTQ format. After running the tool, you will find a number of FASTQ files in your current directory:

ls *.fastq

## SRR091670.fastq
## SRR091671.fastq
## SRR091672.fastq
## SRR091673.fastq

Raw sequence data via the SRAdb R package

The SRAdb R package within Bioconductor provides a convenient interface within R to download raw sequence data from the SRA. To use this package, you must first download the SRAdb database (if you have not done so already) and create a connection to this database:

library(SRAdb)
sqlfile <- 'SRAmetadb.sqlite'
if(!file.exists('SRAmetadb.sqlite')) sqlfile <<- getSRAdbFile()
sra_con <- dbConnect(SQLite(),sqlfile)

To download sequence data files, you will need a list of SRA run accessions for your selected samples. You can either obtain this list by downloading the “Run list” file as described in the previous section, or we can access this list programmatically by downloading the information directly into an R data frame:

api_url <- "http://metasra.biostat.wisc.edu/api/v01/"
query <- "?and=UBERON:0002371,DOID:9952,CL:0000084&sampletype=primary%20cells&species=human&assay=RNA-seq"
experiments_url <- paste0(api_url, "runs.csv", query)
samples_url <- paste0(api_url, "samples.csv", query)

experiments <- read.csv(experiments_url, stringsAsFactors=FALSE)
samples <- read.csv(samples_url, stringsAsFactors=FALSE)

run_accessions <- experiments$sra_run_id

To download FASTQ files from the EBI for the runs, you can use the getSRAfile command:

getSRAfile(run_accessions, sra_con, fileType='fastq')

Processed RNA-seq data via recount

The recount project provides uniformly processed RNA-seq data in the form of gene, exon, and junction counts for many of the human samples assayed by RNA-seq in the SRA. These processed data can be accessed manually via the recount website or programatically via the recount Bioconductor package. The recount data are grouped by study, and therefore to obtain processed data for a set of runs, one must first retrieve the processed data for the studies that include those runs.

library(recount)

## Our experiments table contains runs from a single study
study_accession <- experiments$sra_study_id[1]

## Download the gene-level RangedSummarizedExperiment data for this study
download_study(study_accession)

## Load the gene-level RangedSummarizedExperiment data,
## referenced by variable rse_gene
load(file.path(study_accession, 'rse_gene.Rdata'))
colnames(rse_gene)

##  [1] "SRR091652" "SRR091653" "SRR091654" "SRR091655" "SRR091656"
##  [6] "SRR091657" "SRR091658" "SRR091659" "SRR091660" "SRR091661"
## [11] "SRR091662" "SRR091663" "SRR091664" "SRR091665" "SRR091666"
## [16] "SRR091667" "SRR091668" "SRR091669" "SRR091670" "SRR091671"
## [21] "SRR091672" "SRR091673" "SRR627491" "SRR627493" "SRR627494"
## [26] "SRR627495" "SRR627496" "SRR627497" "SRR627498" "SRR627499"
## [31] "SRR627500" "SRR627501" "SRR627502" "SRR627503" "SRR627504"

To construct a dataset with only data from our selected runs, we can subset the study-level data using our list of run accessions.

my_runs_rse_gene <- rse_gene[, run_accessions]
colnames(my_runs_rse_gene)

## [1] "SRR091670" "SRR091671" "SRR091672" "SRR091673"

To access the raw counts, we pull out the “counts” assay of the RangedSummarizedExperiment object.

head(assay(my_runs_rse_gene))

##                    SRR091670 SRR091671 SRR091672 SRR091673
## ENSG00000000003.14       180        18         0        36
## ENSG00000000005.5          0        18         0         0
## ENSG00000000419.12       846      2412       901       702
## ENSG00000000457.13       348      1182       374        18
## ENSG00000000460.16       396      1242      1224       360
## ENSG00000000938.12      1026       990     37451       504

Processed RNA-seq data via refine.bio

The refine.bio project provides uniformly processed RNA-seq and microarray data from publicly-available sources, including the SRA. The MetaSRA provides the ability to create a dataset of processed RNA-seq data from refine.bio for a selected set of studies. After defining a set of studies through a MetaSRA query, click on the “Download” button and then the “Download from refine.bio” button in the popup to access the processed dataset from refine.bio. To download the data at refine.bio, click the "Move to Dataset" button. Once pressed, a "Download" button will appear that will enable you to download the dataset.

Once the dataset (as a zip file) has been downloaded and unzipped, the processed data and metadata will be available as individual files for each study (referred to as an “experiment” by refine.bio). A full description of the structure and content of a refine.bio download can be found in the refine.bio documentation. In many applications, a single count matrix, aggregated across all studies, is desired. Below is a function that will construct such a matrix given a path to a refine.bio download.

count_matrix_from_refinebio_download <- function(path) {
  studies <- list.dirs(path, full.names = FALSE, recursive = FALSE)
  count_tsv_files <- setNames(file.path(path, studies, paste0(studies, ".tsv")), studies)
  count_tables <- lapply(count_tsv_files, read.delim, row.names="Gene")
  gene_sets <- lapply(count_tables, rownames)
  common_genes <- Reduce(intersect, gene_sets)
  common_gene_count_tables <- lapply(count_tables, `[`, common_genes, , drop=FALSE)
  count_matrix <- data.frame(unname(common_gene_count_tables))
}

As an example, the sample query for human, RNA-seq samples with the terms “glioblastoma multiforme” and “brain” yields this refine.bio dataset. After downloading this dataset and unzipping it to the path “40d888ff-e58e-403f-b6c7-83d747f3c202” within the current working directory, we can extract the count matrix with the following function call:

refinebio_dataset_path <- "40d888ff-e58e-403f-b6c7-83d747f3c202"
count_matrix <- count_matrix_from_refinebio_download(refinebio_dataset_path)
head(count_matrix[,1:5])

##                 ERR2534072 ERR2534073 ERR2534074 ERR2534075 ERR2534076
## ENSG00000000003  5.5648550  5.7440350  7.7696204  7.6729731  8.1854119
## ENSG00000000005  0.1454237  0.1513254  0.1503293  0.1567658  0.1390681
## ENSG00000000419  5.5054798  5.3964631  5.9045027  5.7387685  5.3954875
## ENSG00000000457  2.6892826  2.6733949  2.5772988  2.6284545  2.7869851
## ENSG00000000460  4.6851534  5.0158943  4.1982188  4.8529580  3.5995593
## ENSG00000000938  0.8295156  0.1513254  0.1503293  0.1567658  0.1390681

There are few aspects of the refine.bio processed data that are worth noting:

Not all studies that are in MetaSRA have processed data in refine.bio. After clicking the “Download” button and waiting for the refine.bio links to be generated, the popup will specify how many of the selected studies will appear in the refine.bio dataset.
Different studies (experiments in refine.bio) may have been processed with slightly different pipelines by refine.bio, resulting in possibly different gene sets across studies. The provided count_matrix_from_refinebio_download function creates a single count matrix using the intersection of the gene sets from all studies in the refine.bio dataset.

Downloading sequence data for MetaSRA selected samples

Raw sequence data via the SRA toolkit

Raw sequence data via the SRAdb R package

Processed RNA-seq data via recount

Processed RNA-seq data via refine.bio