The MetaSRA website allows you to download the raw and standardized metadata for a set of SRA samples, selected via the MetaSRA’s query interface. In addition, it provides SRA accessions for sequencing studies, experiments, and runs associated with the selected samples. These accessions may be used to retrieve raw or processed sequencing data from the SRA or other databases. Below, we provide examples of how to access such data.

Raw sequence data via the SRA toolkit

The SRA Toolkit is the primary low-level interface for downloading raw sequence data from the SRA. The SRA Tookit tools are run from the command line. These tools operate on SRA run accessions. After querying the MetaSRA, click on the “Download” button and then select “Run list” to download a file containing a list of run accessions associated with the selected samples. To retrieve FASTQ-formatted sequence data for these runs, you will use the fastq-dump tool from the command-line. This tool accepts multiple run accessions as arguments. To give all of the run accessions in the run list file to the fastq-dump command, you can use the xargs command:

cat run_list.txt | xargs fastq-dump
## Read 10198022 spots for SRR091670
## Written 10198022 spots for SRR091670
## Read 10676183 spots for SRR091671
## Written 10676183 spots for SRR091671
## Read 14433257 spots for SRR091672
## Written 14433257 spots for SRR091672
## Read 13355599 spots for SRR091673
## Written 13355599 spots for SRR091673
## Read 48663061 spots total
## Written 48663061 spots total

The fastq-dump tool will download the sequence data from the SRA and convert it to FASTQ format. After running the tool, you will find a number of FASTQ files in your current directory:

ls *.fastq
## SRR091670.fastq
## SRR091671.fastq
## SRR091672.fastq
## SRR091673.fastq

Raw sequence data via the SRAdb R package

The SRAdb R package within Bioconductor provides a convenient interface within R to download raw sequence data from the SRA. To use this package, you must first download the SRAdb database (if you have not done so already) and create a connection to this database:

library(SRAdb)
sqlfile <- 'SRAmetadb.sqlite'
if(!file.exists('SRAmetadb.sqlite')) sqlfile <<- getSRAdbFile()
sra_con <- dbConnect(SQLite(),sqlfile)

To download sequence data files, you will need a list of SRA run accessions for your selected samples. You can either obtain this list by downloading the “Run list” file as described in the previous section, or we can access this list programmatically by downloading the information directly into an R data frame:

api_url <- "http://metasra.biostat.wisc.edu/api/v01/"
query <- "?and=DOID:9952,CL:0000084&sampletype=primary cells"
experiments_url <- paste0(api_url, "experiments.csv", query)
samples_url <- paste0(api_url, "samples.csv", query)

experiments <- read.csv(experiments_url, stringsAsFactors=FALSE)
samples <- read.csv(samples_url, stringsAsFactors=FALSE)

run_accessions <- experiments$sra_run_id

To download FASTQ files from the EBI for the runs, you can use the getSRAfile command:

getSRAfile(run_accessions, sra_con, fileType='fastq')

Alternatively, to download SRA-formatted files from the NCBI, change the fileType option to “sra”:

getSRAfile(run_accessions, sra_con, fileType='sra')

You will then need to use the fastq-dump command-line tool described in the previous section, to convert the SRA files into your desired format.

Processed RNA-seq data via recount

The recount project provides uniformly processed RNA-seq data in the form of gene, exon, and junction counts for many of the human samples assayed by RNA-seq in the SRA. These processed data can be accessed manually via the recount website or programatically via the recount Bioconductor package. The recount data are grouped by study, and therefore to obtain processed data for a set of runs, one must first retrieve the processed data for the studies that include those runs.

library(recount)

## Our experiments table contains runs from a single study
study_accession <- experiments$sra_study_id[1]

## Download the gene-level RangedSummarizedExperiment data for this study
download_study(study_accession)

## Load the gene-level RangedSummarizedExperiment data,
## referenced by variable rse_gene
load(file.path(study_accession, 'rse_gene.Rdata'))
colnames(rse_gene)
##  [1] "SRR091652" "SRR091653" "SRR091654" "SRR091655" "SRR091656"
##  [6] "SRR091657" "SRR091658" "SRR091659" "SRR091660" "SRR091661"
## [11] "SRR091662" "SRR091663" "SRR091664" "SRR091665" "SRR091666"
## [16] "SRR091667" "SRR091668" "SRR091669" "SRR091670" "SRR091671"
## [21] "SRR091672" "SRR091673" "SRR627491" "SRR627493" "SRR627494"
## [26] "SRR627495" "SRR627496" "SRR627497" "SRR627498" "SRR627499"
## [31] "SRR627500" "SRR627501" "SRR627502" "SRR627503" "SRR627504"

To construct a dataset with only data from our selected runs, we can subset the study-level data using our list of run accessions.

my_runs_rse_gene <- rse_gene[, run_accessions]
colnames(my_runs_rse_gene)
## [1] "SRR091670" "SRR091671" "SRR091672" "SRR091673"