Thus, a gene is considered to be predicted correctly if just one of the isoforms major or not is correctly predicted Figure 5 , Supplementary Table S3. This is a rather liberal way of computing an Sn value on gene level. The main role of ProtHint is generation of a list of coordinates as well as confidence scores of potential borders between coding and non-coding regions in a novel genome.
Specific thresholds on confidence scores are defined to select subsets of hints e. The GeneMark-EP training procedure can tolerate a high number of false positive intron hints since only a subset, the anchored introns, is used in training. It is important that the set of all mapped hints would have high Sn with respect to true gene elements, while the Sp level could be lower.
The value of Sn was dropping down steadily as evolutionary distance to reference proteins was increasing. Accuracy of ProtHint for the D. The results are shown for all reported hints or just high-confidence hints. The Sn and Sp values are computed based on genome annotation of a full complement of introns, gene starts and stops, including alternative isoforms.
Results for all tested species are shown in Supplementary Table S7. Here, the largest Sn value the fraction of correct intron hints was observed for N. This level remained high even for the smallest sets of reference proteins, proteins from the species outside the phylum of interest Table 5 , Supplementary Table S In case of C.
For all other species, a decrease in Sn upon transition from all mapped to high-confidence hints was small in comparison with the simultaneous increase in Sp. The distribution of the score vectors Figure 4a as well as the behavior of Sp—Sn curves Figure 4b depends on selection of the set of reference proteins genus or order or phylum excluded; Supplementary Figure S2 , left and middle panels.
We assessed the extent of this effect for A. It was shown that the best average prediction accuracy was achieved with IBA threshold set to 0. Similar analysis produced necessary thresholds for high-confidence hints to gene starts and stops. This fraction increased significantly as more proteins were excluded from the reference set e. This fraction reached Similar trends were observed for C.
In the set of all reported intron hints, the fraction of introns mapped to regions coding for conserved domains was lower than that in the set of high-confidence intron hints Supplementary Table S12 ; however, the proportion of introns mapped into conserved domain regions also increased upon removing proteins from closely and moderately closely related species.
Fractions of D. The hints were generated from sets of reference proteins having different evolutionary distance to D. Out of 41 D. Still, for C. It was well expected that iterative ab initio parameterization of statistical models as done in GeneMark-ES would become more precise, especially for large genomes, if we find an efficient method to add data on protein footprints into training and prediction steps.
In this respect, the new pipeline features a new method, ProtHint, developed to find multiple proteins homologous to a gene initially predicted in a genomic locus and then to derive reliable hints to the true gene exon—intron structure by constructing and processing multiple protein footprints. Another earlier developed method, GeneMark-ET 2 , extended GeneMark-ES to use external evidence generated from transcriptome sequence data, when it is available along with a newly assembled genome.
Existing methods, such as GenomeThreader 7 , rely on mapping proteins from closely related species as well as mapping gene elements from aligned genomic sequence of the close species to produce predicted exon—intron structures.
However, its prediction accuracy is dropping fast with increase of evolutionary distance between species 6. Use of multiple homologous proteins proved to be important for keeping decent accuracy of prediction with increase of evolutionary distance between species with known genomes and the species of interest. Particularly, due to corroboration of footprints originated from multiple homologous proteins, we observed enrichment of high-confidence introns in regions coding for conserved domains Table 6.
Use of anchored elements of gene structure was important for integration of signals originated from different sources sites predicted from genomic sequence alone and sites identified by protein footprints. Use of partial protein footprints, when a target protein mapping could contribute less than full exon—intron structure, was another important feature of the new method.
Partial footprints were useful for improving training sets; they also added confident corrections at gene prediction steps Supplementary Figure S8. Use of anchored elements was most beneficial for large genomes S. Mapping of N- and C-terminals of target proteins allowed for better discrimination between introns and intergenic regions than it could be done by an ab initio algorithm. This improvement led to significant reduction of errors in gene merging when intergenic regions were predicted as introns though reduction in error rate of gene splitting when introns were predicted as intergenic regions was smaller Table 4.
The most significant improvement in comparison with GeneMark-ES, observed in all species but fungi, N. For N. Gene-level accuracy for D. Notably, the genes in fish genome have a rather large, 8. Under independence of error assumption, a gene with a large number of introns would be improbable target for accurate prediction. Even though the independence assumption does not hold in the presence of external evidence, the gene error rate increases with the increase in number of introns data not shown.
Annotation of genes encoding principal protein isoforms is available for D. This question is difficult to address in a general setting. Young pseudogenes with one or two mutations that make them dysfunctional still have all the sequence patterns that could be used in training.
Old pseudogenes that accumulated many mutations would harm statistical models if included in training. We argue that old pseudogenes will not be predicted by GeneMark-ES in the course of self-training and therefore they have little or no chance to be included in a training set of anchored elements.
On the other hand, elements of young pseudogenes could be identified by GeneMark-ES while the frameshifted exons from spliced alignments will be detected and scored unfavorably by ProtHint. This additional run is recommended if an increase in run-time is not a concern.
We used OrthoDB as a database of reference proteins, DIAMOND 25 for the database search for proteins targets homologous to the seed proteins and Spaln 9 for spliced alignment of target proteins to genome. To accelerate the pipeline run, we limited the DIAMOND output by 25 target proteins per seed protein Supplementary Figure S9 ; choice of Spaln was also practical from the standpoint of run-time reduction.
Additionally, we verified that use of GeneMark-ES for generating seeds was a faster and more efficient method in comparison with the six-frame translation with Procompart and ProSplign tools 8. This discussion section would be incomplete if we do not mention limitations of the new method. GeneMark-EP does not support multiple models needed for genomes with heterogeneous nucleotide composition, like genomes of mammals and some plants grasses, e. We realize that use of taxonomic divisions for selecting or out-selecting of reference proteins is just the first step in accurate modeling of real-life distributions of evolutionary distances to database orthologs for genes and proteins existing in a novel species.
Similarly, one would expect that such modeling would lead to improvement in selecting thresholds for introns and site mapping. Another limitation of the current method is the search for a single optimal genomic sequence parse that leads to prediction of a single gene and a single protein isoform in each locus. Importance of genes with alternative splicing has been debated recently, as the evidence was accumulated that alternative splicing mainly operates with UTR regions rather than with translated regions of pre-mRNA.
Moreover, the claims were made that when a translated region could be alternatively spliced, then only one among the protein isoforms, the major one, is expressed in the largest number of tissues Such comparison, done for C. Nonetheless, general tools able to predict all alternative isoforms are of significant interest for community.
A new tool, GeneMark-ETP, will combine into gene prediction protein and transcript data paper in preparation. Software is compiled for Linux and Mac OS operating systems. In our experiments, the run-time grew linearly with respect to both genome length and number of genes. Hoff K. Google Scholar.
Lomsadze A. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res.
Foissac S. Genome annotation in plants and fungi: EuGene as a model platform. Sallet E. EuGene: an automated integrative gene finder for eukaryotes and prokaryotes. Methods Mol. Behr J. Next generation genome annotation with mGene.
BMC Bioinformatics. Birney E. GeneWise and Genomewise. Genome Res. Gremme G. Engineering a software tool for gene structure prediction in higher organisms. Software Technol. Kiryutin B. Google Preview. Gotoh O. Direct mapping and alignment of protein sequences onto genomic sequence. Keller O. A novel hybrid gene prediction method employing protein multiple sequence alignments. Keilwagen J. Using intron position conservation for homology-based gene prediction.
Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. Burge C. Prediction of complete gene structures in human genomic DNA. Lukashin A. Stanke M. Gene prediction with a hidden Markov model and a new intron submodel.
Parra G. GeneID in Drosophila. Souvorov A. Gnomon:NCBI eukaryotic gene prediction tool. National Center for Biotechnology Information.
Haas B. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. Aken B. The Ensembl gene annotation system. Gene identification in novel eukaryotic genomes by self-training algorithm. Ter-Hovhannisyan V. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Kriventseva E.
OrthoDB v sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Share This Paper. Background Citations. Methods Citations. Figures and Topics from this paper. Citation Type. Has PDF. Publication Type. More Filters. An overview of the current status of eukaryote gene prediction strategies. View 2 excerpts, cites background. Genome and Genomics: From Archaea to Eukaryotes. This chapter provides an in depth study on the structure, composition, and organization of viral genomes, their classification into double stranded and single stranded DNA viruses, positive and … Expand.
Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Next-generation genome annotation: we still struggle to get it right. Genome Biol. Mudge JM, Harrow J. The state of play in higher eukaryote gene annotation. Nat Rev Genet. No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects. Microb Biotechnol. Direct RNA sequencing. Nanopore native RNA sequencing of a human poly a transcriptome.
Nat Methods. Computational inference of homologous gene structures in the human genome. Genome Res. Birney E. GeneWise and Genomewise. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biology. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinform. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct. CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.
BMC Genomics. Long-read annotation: automated eukaryotic genome annotation based on long-read cDNA sequencing. Plant Physiol. Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction. Comput Struct Biotechnol J. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. Interpolated Markov models for eukaryotic gene finding. Prediction of gene structure.
Korf I. Gene finding in novel genomes. Article Google Scholar. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Lomsadze A. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.
GeneValidator: identify problems with protein-coding gene predictions. Evaluating genome assemblies and gene models using gVolante. In: Kollmar M, editor.
Gene prediction. New York: Springer New York; Chapter Google Scholar. DOGMA: a web server for proteome and transcriptome quality assessment. Computational discovery and annotation of conserved small open reading frames in fungal genomes. RefSeq curation and annotation of stop codon recoding in vertebrates. Evaluation of gene structure prediction programs. Evaluation of gene-finding programs on mammalian sequences. Guigo R. An assessment of gene prediction accuracy in large DNA sequences.
Evaluating high-throughput Ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLoS One. The UniProt Consortium. UniProt: the universal protein knowledgebase. The Ensembl genome database project.
Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models. Annotation error in public databases: Misannotation of molecular function in enzyme Superfamilies.
PLoS Comput Biol. Yandell M, Ence D. Large-scale analyses of human microbiomes reveal thousands of small, novel genes. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models.
BMC Bioinformatics. Matera AG, Wang Z. A day in the life of the spliceosome. Nat Rev Mol Cell Biol. Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach.
Predicting splicing from primary sequence with deep learning. OrthoInspector 3. Kozak M. Possible role of flanking nucleotides in recognition of the AUG initiator codon by eukaryotic ribosomes. Human branch point consensus sequence is yUnAy. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
Mol Biol Evol. Download references. You can also search for this author in PubMed Google Scholar. NS developed the benchmark, performed the program benchmarking, and produced all graphical presentations. AJG and PC advised on the feature content of the test sets and supervised the comparative analyses. OP and JDT supervised the production and exploitation of the benchmark. All authors participated in the definition of the original study concept. All authors read and approved the final manuscript.
Correspondence to Julie D. Not applicable. All data presented in this article was extracted from publicly available sources. Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.
If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Reprints and Permissions. Scalzitti, N. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics 21, Download citation. Received : 29 November Accepted : 30 March Published : 09 April Anyone you share the following link with will be able to read this content:.
Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. Skip to main content. Search all BMC articles Search. Download PDF.
Abstract Background The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. Results We describe the construction of a new benchmark, called G3PO benchmark for Gene and Protein Prediction PrOgrams , designed to represent many of the typical challenges faced by current genome annotation projects.
Conclusions The experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. Background The plunging costs of DNA sequencing [ 1 ] have made de novo genome sequencing widely accessible for an increasingly broad range of study systems with important applications in agriculture, ecology, and biotechnologies amongst others [ 2 ]. Results The presentation of the results is divided into 3 sections, describing i the data sets included in the G3PO benchmark, ii the overall prediction quality of the five gene prediction programs tested and iii the effects of various factors on gene prediction quality.
Benchmark data sets The G3PO benchmark contains proteins from a diverse set of organisms Additional file 1 : Table S1 , which can be used for the evaluation of gene prediction programs. Phylogenetic distribution of benchmark sequences The protein sequences used in the construction of the G3PO benchmark were identified in phylogenetically diverse eukaryotic organisms, ranging from human to protists Fig. Full size image.
Table 1 Main characteristics of the gene prediction programs evaluated in this study. Table 2 Effect of protein sequence quality measured at the protein level. Discussion Several recent reviews [ 3 , 22 , 23 ] have highlighted the fact that automated genome annotation strategies still have difficulty correctly identifying protein-coding genes. Conclusions The complexity of the genome annotation process and the recent activity in the field mean that it is timely to perform an extensive benchmark study of the main computational methods employed, in order to obtain a more detailed knowledge of their advantages and disadvantages in different situations.
Schematic view of the pipeline used to construct the benchmark.
0コメント