Publication

The publication describing the MetaSRA can be found at Bioinformatics

Manually annotated data used in the study

To evaluated the methods described in the paper, we manually annotated raw metadata from the Sequence Read Archive. All raw metadata was retrieved from the SRAdb.

422 samples were manually annotated with mapped ontology terms, real-value properties. This dataset was used as a test set to evaluate the ontology mapping pipeline. More specifically, this data was used as ground-truth for results shown in Figure 4.

701 samples were manually labelled with their sample-types by consulting their publications and other external descriptions of the experiments in which they were used. This dataset was used to train the sample-type classifier:

367 samples were used to evaluate the sample-type classifier's recall on each category and to generate the row-normalized confusion matrix and reliability diagram in Figure 5.A and 5.B. These samples are a subset of the 422 samples used to evaluate the ontology term mapping pipeline.

552 samples were used to evaluate the sample-type classifier's precision on each category and to generate the transpose of the column-normalized confusion matrix in Figure 5.C.

Cell line metadata

MetaSRA uses cell line information scraped from the ATCC website. We mapped cell lines found in the Cellosaurus to ontology terms: