Results In RefGene, PIK3CA has only one transcript named NM_006218. [28] systematically compared the human annotations present in RefSeq, Ensembl, and AceView on diverse transcriptomic and genetic analyses. The UCSC Known Genes dataset is based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from GenBank, and serves as a foundation for the UCSC Genome Browser. May, 2023. Vinga S, Almeida J. Alignment-free sequence comparison-a review. 2015;31:156976. Metagenomic clustering of ocean and human metagenomes using Mash. Additional file 1: Figure S6 includes this Mash tree with five additional mammals of increasing divergence. Lippert RA, Huang H, Waterman MS. Distributional regimes for the number of k-word matches between two random sequences. Wu et al. Figure S6 shows the gene definition difference for PIGY in Ensembl and RefGene, and accordingly explains why the gene quantification results dramatically differ from each other. Acquiring transcriptome expression profiles requires researchers to choose a genome annotation for RNA-Seq data analysis. BMC Genomics. Increasing this threshold enables the two-stage MinHash filter strategy, which is based on tracking both the k-mer hashes in the current sketch and a secondary set of candidate hashes. Location of the variants (e.g. Current RNA-Seq approaches use shotgun sequencing technologies such as Illumina, in which millions or even billions of short reads are generated from a randomly fragmented cDNA library. Mash is written in C++ and has been tested on Linux and Mac OS X. AMP conceived the project, designed the methods, and wrote the paper with input from BDO, TJT, SK, and PM. At this point, the sketch max has changed and the candidate set can be pruned to contain only values less than the new sketch maximum. Nature. Eukaryotic and plasmid components are shown in Additional file 1: Figures S4 and S5, but would require alternate parameters for species-specific clustering due to their varying characteristics. 1987;4:40625. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Figure S2. News. Nature. Thus, Mash avoids the quadratic barrier usually associated with all-pairs comparisons and scales well to many samples. Gough B. GNU scientific library reference manual. The resulting combined sketch file totaled just 3.4MB in size, compared to the 20GB FASTA input. The read mapping summary for 16 tissue samples in the b Human metagenomic samples combined from the HMP and MetaHIT projects clustered by Mash from 888 sequencing runs (bottom left) and 879 assemblies (bottom right). In this paper, we systematically characterized the impact of genome annotation choice on read mapping and transcriptome quantification by analyzing a RNA-Seq dataset generated by the Human Body Map 2.0 Project. For human samples, Affymetrix GeneChip HT HG-U133+ PM arrays are one of the most popular microarray platforms for transcriptome profiling, and the genes covered by this chip overlap with RefGene very well, according to Zhao et al. Wu P-Y, Phan JH, Wang MD. The read length is 75bp. However, in Ensembl, LUZP6 is only 177bp long, and is completely within MTPN. The RefSeq genome with the smallest significant distance, with ties broken by P value, was also reported. Lastly, for sketching raw sequencing reads, Mash provides both a two-stage MinHash and Bloom filter strategy to remove erroneous k-mers. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. By using this website, you agree to our In: COM 00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. Mash: fast genome and metagenome distance estimation using MinHash. For assembled genomes, the correct strain was identified as the best hit in a few seconds. As shown in Figure5, there were many genes for which the number of reads mapped to them was 0 in one gene model, but many in others. Mash was run in parallel with the same parameters used for the GOS datasets and the resulting sketches merged with Mash paste. Each graph node represents a genome. GENCODE: the reference human genome annotation for the ENCODE project. CAS 1, for small k and large n there can be a high probability of a random k-mer appearing by chance. Ensembl Variant Effect Predictor (VEP) VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. In rare cases this strategy resulted in over-separation due to database mislabeling. Proc Natl Acad Sci U S A. How independent are the appearances of n-mers in different genomes? The corresponding Mash distances were taken from the all-vs-all distance table as described above. a Comparison of Global Ocean Survey (GOS) clustering using Mash (top left) and COMMET (top right) using raw Sanger sequencing data. The graph should show a perfect diagonal line if the choice of a gene model has no effect on differential analysis. It is approved and funded by the government of the United States. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. However, if a new hash is smaller than the current sketch max, it is checked against the candidate set. Google Scholar. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. The two files are sorted by chromosome and gene start, although in the . the fraction of shared k-mers), a P value, and the Mash distance, which estimates the rate of sequence mutation under a simple evolutionary model [22] (see Methods). Narayanan M, Karp RM. In the transcriptome only mode, more reads are mapped in Ensembl than in RefGene and UCSC (left panel), and more reads become multiple-mapped in Ensembl than in RefGene and UCSC (right panel). PubMed Nat Rev Genet. 2012;19:45577. Kuhner MK, Felsenstein J. Google Scholar. As ANI drops further, the Jaccard index rapidly becomes very small and larger sketches are required for accurate estimates. Clearly RefGene has fewest unique genes, while more that 50% of genes in Ensembl are unique. The percentage of multiple-mapped reads in Ensembl is higher than in RefGene or UCSC. DHS does not endorse any products or commercial services mentioned in this publication. To illustrate the utility of Mash, we sketched and clustered all of NCBI RefSeq Release 70 [25], totaling 54,118 organisms and 618 Gbp of genomic sequence. To further mitigate the problem of erroneous k-mers, Mash can filter low-abundance k-mers from raw sequencing data to improve accuracy. SZ carried out the experimental design, performed the data analysis, and wrote the manuscript. Your US state privacy rights, Therefore, to quantify the effect of a gene model on mapping of RNA-Seq reads, we only compared the results from transcriptome only mode with those from the None mode in Stage #2. We build upon past applications of MinHash by deriving a new significance test to differentiate chance matches when searching a database, and derive a new distance metric, the Mash distance, which estimates the mutation rate between two sequences directly from their MinHash sketches. When tested on the Ebola virus MinION dataset, the Zaire ebolavirus reference genome was matched with a Mash P value of 1010 after processing the first 227,445 bases of sequencing data, which were collected by the MinION after just 770s of sequencing. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Nat Biotechnol. A molecular phylogeny of living primates. To account for this uncertainly, we applied lowest common ancestor (LCA) classification (see Methods), which was correct in all cases, albeit with reduced resolution. Among 25,958 common genes, the expressions of 2038 genes (i.e., 9.3%) differed by 50% or more when choosing one annotation over the other. COMMET was tested on three read sets (SAMN00038294, SAMN00146305, and SAMN00037421), which were smaller than the average HMP sample size and required an average of 655 CPU s per pairwise comparison. Google Scholar. . This equation does not account for compositional characteristics like GC bias, but it is useful in practice for ruling out clearly insignificant results (especially for small values of k and j). Broder AZ. The impact of a reference transcriptome on the mapping of RNA-Seq reads is attenuated in the transcriptome+genome mapping mode because every unmapped read has a second chance to be mapped to a genome. Shanrong Zhao. by monitoring the stability of a sketch as additional data are processed). Each dataset listed in Table3 was compared against the full RefSeq Mash database using the following command for assemblies: which enabled the Bloom filter to remove erroneous, single-copy k-mers. to triage and cluster sequence data, assign species labels, build large guide trees, identify mis-tracked samples, and search genomic databases. 2013;19(4):47989. A reference genome is a high-quality sequence published in a database that provides a representative example of a species; these sequences are reviewed and validated extensively. To demonstrate the effect of gene models on differential analysis, the fold changes between heart and liver samples were calculated using RefGene and Ensembl annotations. However using the remote blast service can be slow. Each chunked sketch file was then compared against the combined sketch file, again in parallel, using: This required 6.9 CPU h to create pairwise distance tables for all chunks. The corresponding read lengths are 75bp and 50bp, respectively. PubMed 2008;40(12):14135. Figure S5 highlights the gene definition difference for SLC30A1 in Ensembl and RefGene. Thus, sketches comprising just a few hundred values can be used to approximate the similarity of arbitrarily large datasets. Maillet N, Collet G, Vannier T, Lavenier D, Peterlongo P. COMMET: comparing and combining multiple metagenomic datasets. 5: Mash uses Eq. Among the 21,958 common genes, about 20% of genes had no expression at all in both annotations. ANI considers only the core genome). Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. the correspondence between two files is correct. 2009;8(3):17483. Without loss of generality, here we will assume a nucleotide alphabet ={A,C,G,T}. With a pre-computed sketch database, Mash is able to rapidly identify isolated genomes from both assemblies and raw sequencing reads. Nat Methods. When a gene model is used in conjunction with a reference genome, by default, OSA maps RNA-Seq reads in three consecutive steps: (1) all reads are mapped to the reference transcriptome; (2) for mapped reads with mismatches, OSA aligns them with the reference genome and chooses the best hits; and (3) for unmapped reads, OSA maps them to reference genome. We have found the parameters k=21 and s=1000 give accurate estimates in most cases (including metagenomes), so this is set as the default and still requires just 8 kB per sketch. Due to Capn Proto requirements, a C++11 compatible compiler is required to build from source, but precompiled binaries are distributed for convenience. In this paper, we performed a comprehensive evaluation of different annotations on RNA-Seq data analysis, including RefGene, UCSC, and Ensembl. The ratio was calculated as Max(#C1,#C2)/Min(#C1,#C2). 2010;464:5965. However, in the worst case, if all k-mers in the input occur less than the coverage threshold c, no hashes would escape the candidate set and memory use would increase with each new k-mer processed. Two genomes are connected by an edge if their Mash distance D 0.05 and P value 1010. Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. 2013. Mash can also replicate the function of k-mer based metagenomic comparison tools, but in a fraction of the time previously required. In the liver sample, the expression levels for these exemplary genes for both Ensembl and RefGene were summarized in Table2 (read length=75bp). A measure of the similarity of sets of sequences not requiring sequence alignment. However, there are tradeoffs to consider when filtering or correcting low-coverage datasets (e.g. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. 2008;36:e33. Gapped local similarity search with provable guarantees. To avoid this phenomenon, it is sufficient to choose a value of k that minimizes the probability of observing a random k-mer. A Little Book of R for Bioinformatics; Preface to version 2.0; 1 . Thus, the mapping fidelity for a sequence read increases with its length, and this is especially true for junction reads. Figure7 shows another example of a remarkably different gene model defined in Ensembl versus that in RefGene. 2004;11:73452. The percentage of genes with expression levels differing by 5% or more was only 11.3%, which was much less than the corresponding 28% between Ensembl and RefGene. As genome databases increase in size and whole-genome sequencing becomes routine, it will become impractical to manually assign taxonomic labels for all genomes. For both sequencing reads and assemblies, Mash successfully clusters samples by body site and appropriately clusters MetaHIT and HMP stool samples together, even though these samples are from different projects with different protocols. After sketching, computing pairwise distances is near instantaneous. For a large-scale test, samples from the Human Microbiome Project [36] (HMP) and Metagenomics of the Human Intestinal Tract [37] (MetaHIT) were combined to create a ~10TB 888-sample dataset. PubMedGoogle Scholar. Haubold B, Klotzl F, Pfaffelhuber P. andi: fast and accurate estimation of evolutionary distances between closely related genomes. Without using a gene model, an average of 53% of junction reads remained mapped to the same genomic regions, 30% of failed to map to any genomic region, and 1015% of them mapped alternatively. 2004;3240:7486. For example, Mash does not cluster MetaHIT samples by health status, as previously reported [37], and MetaHIT samples appear to preferentially cluster with one another. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Genome Res. This required 0.6 CPU h (37 CPU min) and 19.6GB of RAM with Bloom filtering or 8MB without. HMP samples that did not pass HMP QC requirements [36] were removed from Fig. Mash is freely released under a BSD license (https://github.com/marbl/mash). Approximately 28.1% of genes expression levels differed by 5% or higher, and among them, 9.3% of genes (equivalent to 2038) differed by 50% or greater. Although there are multiple genome annotations available, researchers need to choose a genome annotation (or gene model) while performing RNA-Seq data analysis. Therefore, to fairly assess the impact of a gene model on RNA-Seq read mapping, only those reads covered by a gene model were used. The definition of PIK3CA gene in Ensembl seems more accurate than the one in RefGene, based upon the mapping profile of the sequence reads. The resulting sketches total only 93MB (Additional file 1: Supplementary Note 1), yielding a compression factor of more than 7000-fold versus the uncompressed FASTA (674GB). While in the transcriptome+genome mapping mode, reads were first mapped to a reference transcriptome, and then the unmapped ones were mapped to the reference genome. Thus, Mash combines the high specificity of matching-based approaches with the dimensionality reduction of statistical approaches, enabling accurate all-pairs comparisons between many large genomes and metagenomes. In contrast, Mash sets t to the average genome size n, thereby penalizing for genome size differences and measuring resemblance (e.g. For the RNA-Seq dataset with a read length of 75bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. Chen G, Wang C, Shi L, Qu X, Chen J, Yang J, et al. In Figure3A, we divided uniquely mapped reads into two classes, i.e., non-junction reads and junction reads, and investigated the impact of a gene model on their mapping. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. 1. The impact of a gene model on mapping of non-junction reads is different from junction reads. MurmurHash3. The Human Body Map 2.0 Project generated RNA-Seq data for 16 different human tissues (adipose, adrenal, brain, breast, colon, heart, kidney, leukocyte, liver, lung, lymph node, ovary, prostate, skeletal muscle, testis, and thyroid). When conducting more exploratory research, a more complex genome annotation, such as Ensembl, should be chosen. For example, clustering the treeshrew, mouse, rat, guinea pig, and rabbit genomes alongside the primate genomes causes the tarsier to become misplaced (Additional file 1: Figure S6). Ideally, we should get identical numbers of mapped reads for all common genes, regardless of the choice of a gene model; however, this was clearly not the case. Article PubMed Improvements in database construction are also expected. Similar alignment-free methods have a long history in bioinformatics [13, 14]. Patrascu M, Thorup M. The power of simple tabulation hashing. 2011;8(6):46977. Sketching all genomes and computing all ~1.5 billion pairwise distances required just 26.1 and 6.9 CPU h, respectively. 2014;15:40718. Springer Nature. This correlation begins to degrade for more divergent genomes because the variance of the Mash estimate grows with distance. With decreasing sequencing cost, RNA-Seq is becoming an attractive approach to profile gene expression or transcript abundance, and to evaluate differential expression among biological conditions. BMC Bioinformatics. This large difference in the mapping rates between the two modes suggests the incompleteness of gene models: there are many reads that were mapped to the genomic regions without annotations. This required 26.1 CPU h on a heterogeneous cluster of AMD processors. UGENE 47.0 has been released June, 2022. The PIK3CA gene definition in both Ensembl and RefGene, and the mapping profile of RNA-Seq reads were shown in Figure6. Using the dataset with the read length of 75bp, we compared the gene quantification results in RefGene and Ensembl annotations, and obtained identical counts for an average of 16.3% (about one sixth) of genes. Multiple human genome annotation databases exist, including RefGene (RefSeq Gene), Ensembl, and the UCSC annotation database. Proc Natl Acad Sci U S A. For k=16, this corresponds to a Mash distance between 0.12 and 0.09. Plasmids and organelles were grouped with their corresponding nuclear genomes when available; otherwise they were kept as separate entries. There are 17,057 entries representing various types of RNAs, including rRNA (566), snoRNA (1549), snRNA (2067), miRNA (3361), misc_RNA (2174), and lincRNA (7340). 1). RNA. Theor Comput Sci. The RefGene and UCSC consistently had the highest percentage of uniquely mapped reads; while the percentage of non-uniquely mapped reads was much higher in Ensembl (samples colored in blue in Figure2). 2012;40:D1305. These approaches assume that redundancy in the data (e.g. RefSeq Complete release 70 was downloaded from NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov). This illustrates the incremental scalability of Mash where the primary overhead is sketching, which occurs only once per each sample. This annotation was created by using the Liftoff program [1] (v1.6.3, with options -copies -sc 0.95 -polish -exclude_partial -chroms) to map across all human genes in RefSeq [2] annotation release 110 from the GRCh38.p14 genome to the CHM13v2.0 genome, a complete, gap-free human genome published by the Telomere-to-Telomere (T2T . Nucleic Acids Res. 1). Sample groups are identified and colored using the same key as in Rusch et al. CAS Since Mash relies only on comparing length k substrings, or k-mers, the inputs can be whole genomes, metagenomes, nucleotide sequences, amino acid sequences, or raw sequencing reads. Seq read mapping (read length 2009;10(1):5763. The impact of a gene model on RNA- In this paper, we demonstrated that the choice of a gene model has an effect on the quantification results. Importantly, the error of this computation depends only on the size of the sketch and is independent of the genome size. Thus, Mash can also act as an alternate QC method to identify mis-tracked or low-quality samples. bioRxiv. However, accurate alignment of high-throughput short RNA-Seq reads remains challenging, mainly because of junction (i.e., exon-exon spanning) reads and the ambiguity of multiple-mapping reads. In RefGene, a bi-cistronic transcript encodes the products of both the MTPN (myotrophin) and LUZP6 (leucine zipper protein 6) genes, which are located on chromosome 7. 2003;19:51323. For most RNA-Seq sequencing projects, only mRNAs are presumably enriched and sequenced, and there is no point in mapping sequence reads to RNAs such as miRNAs or lincRNAs. As a result, all sequence reads originating from LUZP6 are assigned to MTPN instead. and In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): IEEE. Further, the probability that the i-th hash of the genome will enter the sketch is s/i, so the expected runtime of the algorithm is O(n+s log s log n) [4], which becomes nearly linear when n >>s. As demonstrated by Fig. 2012;13 Suppl 19:S10. Borozan I, Watt SN, Ferretti V. Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-Seq. The mapping summaries for the data in Additional file 1: Tables S1 and S2 were shown in Figure1 and Additional file 1: Figure S1, respectively. PubMed Central CAS In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 2023 BioMed Central Ltd unless otherwise stated. All mapped reads are equally distributed to these two genes. These ambiguous mappings directly translate to an increase in the percentage of non-uniquely mapped reads. Gigascience. Nat Genet. Mash enables scalable whole-genome clustering, which is an important application for the future of genomic data management, but currently infeasible with alignment-based approaches. For example, the accession number NC_001477 is for the DEN-1 Dengue virus genome sequence. This functionality is designed to support the analysis of real-time data streams, as is expected from nanopore-based sequencing sensors [24]. 2007;5:e77. 2013;14(4):R36. IEEE. This transcript is 3909bp long with a very short exon #21 (only 616bp, located at chr 3:178,951,882-178,952,497). A junction read could be either mapped as a non-junction read, or remain mapped as a junction read but with different start, end, and splicing positions; (3) Multiple, a uniquely mapped read became a multiple-mapped one. Genome Biol. COMMET v24/07/2014 was run with default parameters (t=2, m=all, k=33) as: where read_sets.txt points to the gzipped FASTQ files. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the DHS or S&T. Figure S6. Article Genome Res. Comparison and de novo clustering of all RefSeq genomes using Mash. 2007;35(Database):D615. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. Additional file 1: Figure S8 shows the original heatmap generated by COMMET on this dataset. As more genes are annotated in a gene model, a higher percentage of reads will be mapped in the Transcriptome only mapping mode. The percentage of junction reads dropped to 16% when the read length was 50bp (see Additional file 1: Figure S3A and Additional file 1: Table S6). Article 2014;9(1):e78644. 2015;13:e1002195. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. It supports flexible integration of all the common types of genomic data and metadata, investigator-generated or publicly available, loaded from local or cloud sources. The authors thank Konstantin Berlin, Ben Langmead, Michael Schatz, and Nicolas Maillet for their helpful suggestions; Brian Walenz and Torsten Seemann for reviewing the draft; Jiarong Guo, Sherine Awad, C. Titus Brown, and an anonymous referee for their constructive reviews; and Philip Ashton, Aleksey Jironkin, and Nicholas Loman for providing early feedback on the software. Accessed 31 May 2016. 2014;9(7):e101374. These samples are the only ones that fail to group by body site. 2006;7 Suppl 1:114. On this reduced dataset COMMET required 10 CPU h (598 CPU min). Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation . Our research focused on: (1) comparing the coverage and incompleteness of different gene models; (2) quantifying the impact of gene models on the mapping of both junction and non-junction reads; and (3) evaluating the effect of genome annotation choice on gene quantification and differential analysis. 2012;22(9):176074. . If a new hash would have otherwise been inserted in the sketch but was not found in the Bloom filter, it is inserted into the Bloom filter so that subsequent appearances of the hash will pass. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Consequently, more junction reads will be generated by short-gun sequencing technologies. This test included the K12 MG1655 reference genome as well as assembled and unassembled sequencing runs from the ABI 3730, Roche 454, Ion PGM, Illumina MiSeq, PacBio RSII, and Oxford Nanopore MinION instruments. If present with a count less than c 1, its counter is incremented. Efficient private matching and set intersection. In bioinformatics, and indeed in other data intensive research fields, databases are often categorised as primary or secondary (Table 2). To quantify the concordance between RefGene and Ensembl annotations, we first calculated the ratio of mapped read for each gene.
Which Is Most Likely To Be A Mixture?, B Permit Switzerland Lost Job, When Does Mgm Lazy River Open 2023, Elliot Hospital Departments, Extra Thick Poster Board, Articles R