Filtrele




...
Arrhythmia Channelopathy Variants

Examines variants associated with arrhythmia diseases such as Brugada Syndrome and Long QT Syndrome.

...
<p>Explores variants associated with arrhythmia diseases such as Brugada Syndrome and Long QT Syndrome, found on SCN5A and KCNH2 genes.&nbsp;</p> <p><strong>SCN5A Background</strong></p> <p>SCN5A is a 2016 amino acid gene. It encodes NaV1.5, the main voltage-gated sodium channel in the heart. Coding-altering variants in SCN5A have been linked to many arrhythmia and cardiac conditions, including Brugada Syndrome Type 1 (BrS1 https://www.omim.org/entry/601144), Long QT Syndrome Type 3 (LQT3 https://www.omim.org/entry/603830), dilated cardiomyopathy (https://www.omim.org/entry/601154), cardiac conduction disease (https://www.omim.org/entry/113900), and Sick Sinus Syndrome (https://www.omim.org/entry/608567). Loss of function variants in SCN5A are associated with Brugada Syndrome and other cardiac conduction defects, and gain of function variants are associated with Long QT Syndrome. The risk of sudden cardiac death from these conditions can often be prevented with drug therapy or implantation of a defibrillator. SCN5A variants are often studied in vitro in heterologous expression systems using patch clamp electrophysiology. One challenge with SCN5A-related diseases is the issue of incomplete penetrance&mdash;only a fraction of variant carriers have disease phenotypes. Therefore, we believe that curating published patient data and in vitro functional data can contribute to a better understanding of each variant&rsquo;s disease risk.</p> <p><strong>KCNH2 Background</strong></p> <p>KCNH2 (also known as the human ether a-go-go related gene, hERG) encodes a 1159 amino acid protein, KV11.1, a voltage-gated potassium channel in the heart. Coding-altering variants in KCNH2 have been mostly linked to the heart arrhythmias, Long QT Syndrome Type 2 (LQT2 https://www.omim.org/entry/613688) and Short QT Syndrome (SQT1; https://www.omim.org/entry/609620). Loss-of-function variants in KCNH2 are associated LQT2 and gain-of-function variants are associated with short QT Syndrome. The risk of sudden cardiac death from these conditions can often be prevented with drug therapy or implantation of a defibrillator. KCNH2 variants are often studied in vitro in heterologous expression systems using patch clamp electrophysiology. One challenge with KCNH2-related diseases is the issue of incomplete penetrance&mdash;only a fraction of variant carriers have disease phenotypes. Therefore, we believe that curating published patient data and in vitro functional data can contribute to a better understanding of each variant&rsquo;s disease risk.</p> <p><strong>SCN5A Dataset</strong></p> <p>The dataset described on this website is a dataset of patient data and in vitro patch clamp data. This dataset was first described in Kroncke and Glazer et al. 2018, Circulation: Genomic and Precision Medicine (https://pubmed.ncbi.nlm.nih.gov/29728395/). The data were curated from a comprehensive literature review from papers written about SCN5A (or Nav1.5, the protein product of SCN5A). We quantified the number of carriers presenting with and without disease for 1,712 reported SCN5A variants. For 356 variants, data were also available for five NaV1.5 electrophysiologic parameters: peak current, late/persistent current, steady state V1/2 of activation and inactivation, and recovery from inactivation. We found that peak and late current significantly associated with BrS1 (p &lt; 0.001, rho = -0.44, Spearman&rsquo;s rank test) and LQT3 disease penetrance (p &lt; 0.001, rho = 0.37). Steady state V1/2 activation and recovery from inactivation also associated significantly with BrS1 and LQT3 penetrance, respectively.</p> <p><strong>KCNH2 Dataset</strong></p> <p>The dataset described on this website is a dataset of patient data and in vitro patch clamp data. This dataset was first described in Kozek et al. (to be published soon). The data were curated from a comprehensive literature review from papers written about KCNH2 (or Kv11.1, the protein product of KCNH2). In addition, five centers that hold cardiology clinics and conduct research gathered clinical phenotypes and genotypes for individuals heterozygous for KCNH2 variants, including Unit&eacute; de Rythmologie, Centre de R&eacute;f&eacute;rence Maladies Cardiaques H&eacute;r&eacute;ditaires, Service de Cardiologie, H&ocirc;pital Bichat, Paris, France; the Center for Cardiac Arrhythmias of Genetic Origin Istituto Auxologico Italiano IRCCS, Milan, Italy; Shiga University of Medical Science Department of Cardiovascular and Respiratory Medicine, Shiga, Japan; National Cerebral and Cardiovascular Center, Osaka, Japan; Nagasaki University, Nagasaki, Japan. We quantified the number of carriers presenting with and without disease for 871 reported KCNH2 variants (an additional 266 KCNH2 inframe/missense variants coming from the international cohort). For ### variants, data were also available for six KV11.1 electrophysiologic parameters: steady state maximum current, peak tail current, steady state V1/2 of activation and inactivation, recovery from inactivation, and deactivation time. All six of these parameters are found in the literature collected homozygously and heterozygously. We found that heterozygously collected peak tail current significantly associated with LQT2 (p &lt; 0.001, rho = -0.62, Spearman&rsquo;s rank test). This relationship persisted across the literature and cohort datasets.</p> <p><strong>Updates to the Datasets</strong></p> <p>This dataset was updated with papers published through January 2020. The description of the revised dataset published in Kroncke et al, 2020, PLOS Genetics. This paper also includes an updated Bayesian method for estimating the penetrance of each variant.</p> <p><strong>Calculating Penetrance</strong></p> <p>In this work, penetrance is an estimate of the probability for long QT diagnosis for each variant using a Bayesian method that integrates together patient data and variant features (changes in variant function, protein structure, and in silico predictions). More information at https://oates.app.vumc.org/vancart/SCN5A/SCN5A-report.html and https://oates.app.vumc.org/vancart/KCNH2/KCNH2-report.html.&nbsp;</p> <p>Information from https://oates.app.vumc.org/vancart/SCN5A/about.php and</p>
<p>Examines variants associated with arrhythmia diseases such as Brugada Syndrome and Long QT Syndrome.</p>
...
ALFA: Allele Frequency Aggregator

The goal of the ALFA project is to make frequency data from over 1M dbGaP subjects open-access in future releases to facil...

<p><strong>ALFA at a glance:</strong></p> <p>* The aim is to provide allele frequency from more than 1 million subjects by adding 100-200K new subjects available in dbGaP with each ALFA quarterly release.</p> <p>* The initial release of ~100 thousand subjects included allele counts and frequency for 447 million rs site including 4 million novel ones aggregated from 551 billion genotypes.</p> <p>* The dbGaP studies include chip array, exome, and genomic sequencing data with subjects from 12 diverse populations including European, African, Asian, Latin American, and others.</p> <p>* The data will be integrated with dbSNP regular build release with assigned RS accessions for variants and available for access by web, FTP, and API.</p> <p><strong>Background</strong></p> <p>NCBI database of Genotypes and Phenotypes (dbGaP) contains the results of over 1,200 studies that have investigated the interaction of genotype and phenotype. The database has over two million subjects and hundreds of millions of variants along with thousands of phenotypes and molecular assay data. This unprecedented volume and variety of data promise huge opportunities to identify genetic factors that influence health and disease. NIH has recently lifted the restriction on Genomic Summary Results (GSR) access for responsible sharing and use of the data. In fulfilling this updated GSR policy and to promote research toward identifying genetic variants that contribute to health and disease, NCBI developed the Allele Frequency Aggregator (ALFA) pipeline to compute allele frequency for variants in dbGaP across approved un-restricted studies and to provide the data as open-access to the public through dbSNP. The goal of the ALFA project is to make frequency data from over 1M dbGaP subjects open-access in future releases to facilitate discoveries and interpretations of common and rare variants with biological impacts or causing diseases. Toward that goal, over 925K dbGaP subjects with genotype data have been analyzed using GRAF-pop as candidates for the ALFA project, pending study approval and processing.</p> <p><strong>Populations</strong></p> <p>[Sample](https://www.ncbi.nlm.nih.gov/snp/docs/gsr/data_inclusion/#Sample) ancestries are validated using [GRAF-pop](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi) and assigned to [12 major populations](https://www.ncbi.nlm.nih.gov/snp/docs/gsr/data_inclusion/#population) including European, Hispanic, African, Asian, and others ([Jin et al., 2019](https://www.g3journal.org/content/9/8/2447)).</p> <p>Information from https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/#citing-this-project</p>
<p>The goal of the ALFA project is to make frequency data from over 1M dbGaP subjects open-access in future releases to facilitate discoveries and interpretations of common and rare variants with biological impacts or causing diseases.</p>
<p>.</p>
...
CADD

CADD is a tool for scoring the deleteriousness of single nucleotide variants in the human genome.

...
<p>CADD is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome.</p> <p>While many variant annotation and scoring tools are around, most annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Thus, a broadly applicable metric that objectively weights and integrates diverse information is needed. Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations.</p> <p>C-scores strongly correlate with allelic diversity, pathogenicity of both coding and non-coding variants, and experimentally measured regulatory effects, and also highly rank causal variants within individual genome sequences. Finally, C-scores of complex trait-associated variants from genome-wide association studies (GWAS) are significantly higher than matched controls and correlate with study sample size, likely reflecting the increased accuracy of larger GWAS.</p> <p>CADD can quantitatively prioritize functional, deleterious, and disease causal variants across a wide range of functional categories, effect sizes and genetic architectures and can be used prioritize causal variation in both research and clinical settings.&nbsp;</p> <p><strong>Short method summary</strong></p> <p>Fixed or nearly fixed recent evolutionary changes were identified as differences between 1000 Genomes and the Ensembl Compara inferred human-chimpanzee ancestral genome (derived allele frequency (DAF) of at least 95%, 14.9 million SNVs and 1.7 million indels). To simulate an equivalent number of mutations, we used an empirical model of sequence evolution with CpG dinucleotide-specific rates and mutation rates locally scaled in megabase windows. For annotation, we used the Ensembl Variant Effect Predictor (VEP), data from the ENCODE project and information from UCSC genome browser tracks. These annotations span a wide range of data types including conservation metrics like GERP, phastCons, and phyloP; functional genomic data like DNase hypersensitivity and transcription factor binding; transcript information like distance to exon-intron boundaries or expression levels in commonly studied cell lines; and protein-level scores like Grantham, SIFT, and PolyPhen.&nbsp;</p> <p><strong>Notes on using scaled vs. unscaled C-scores</strong></p> <p>We believe that CADD scores are useful in two distinct forms, namely &quot;raw&quot; and &quot;scaled&quot;, and we provide both in our output files. &quot;Raw&quot; CADD scores come straight from the model, and are interpretable as the extent to which the annotation profile for a given variant suggests that the variant is likely to be &quot;observed&quot; (negative values) vs &quot;simulated&quot; (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or &quot;not observed&quot;) and therefore more likely to have deleterious effects.</p> <p>Since the raw scores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a &quot;normalized&quot; and now externally comparable unit of analysis. In our case, we scored and ranked all ~8.6 billion SNVs of the GRCh37/hg19 reference and then &quot;PHRED-scaled&quot; those values by expressing the rank in order of magnitude terms rather than the precise rank itself. For example, reference genome single nucleotide variants at the 10th-% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, etc. The results of this transformation are the &quot;scaled&quot; CADD scores.&nbsp;</p> <p>Information from https://cadd.gs.washington.edu/info</p>
<p>CADD is a tool for scoring the deleteriousness of single nucleotide variants in the human genome.</p>
<p>CADD scores are freely available for all non-commercial applications.If you are planning on using them in a commercial application, please obtain a license.</p>
...
CADD Exome

CADD is a tool for scoring the deleteriousness of single nucleotide variants in the human genome.

...
<p>CADD is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome.</p> <p>While many variant annotation and scoring tools are around, most annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Thus, a broadly applicable metric that objectively weights and integrates diverse information is needed. Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations.</p> <p>C-scores strongly correlate with allelic diversity, pathogenicity of both coding and non-coding variants, and experimentally measured regulatory effects, and also highly rank causal variants within individual genome sequences. Finally, C-scores of complex trait-associated variants from genome-wide association studies (GWAS) are significantly higher than matched controls and correlate with study sample size, likely reflecting the increased accuracy of larger GWAS.</p> <p>CADD can quantitatively prioritize functional, deleterious, and disease causal variants across a wide range of functional categories, effect sizes and genetic architectures and can be used prioritize causal variation in both research and clinical settings.&nbsp;</p> <p><strong>Short method summary</strong></p> <p>Fixed or nearly fixed recent evolutionary changes were identified as differences between 1000 Genomes and the Ensembl Compara inferred human-chimpanzee ancestral genome (derived allele frequency (DAF) of at least 95%, 14.9 million SNVs and 1.7 million indels). To simulate an equivalent number of mutations, we used an empirical model of sequence evolution with CpG dinucleotide-specific rates and mutation rates locally scaled in megabase windows. For annotation, we used the Ensembl Variant Effect Predictor (VEP), data from the ENCODE project and information from UCSC genome browser tracks. These annotations span a wide range of data types including conservation metrics like GERP, phastCons, and phyloP; functional genomic data like DNase hypersensitivity and transcription factor binding; transcript information like distance to exon-intron boundaries or expression levels in commonly studied cell lines; and protein-level scores like Grantham, SIFT, and PolyPhen.&nbsp;</p> <p>## Notes on using scaled vs. unscaled C-scores</p> <p>We believe that CADD scores are useful in two distinct forms, namely &quot;raw&quot; and &quot;scaled&quot;, and we provide both in our output files. &quot;Raw&quot; CADD scores come straight from the model, and are interpretable as the extent to which the annotation profile for a given variant suggests that the variant is likely to be &quot;observed&quot; (negative values) vs &quot;simulated&quot; (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or &quot;not observed&quot;) and therefore more likely to have deleterious effects.</p> <p>Since the raw scores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a &quot;normalized&quot; and now externally comparable unit of analysis. In our case, we scored and ranked all ~8.6 billion SNVs of the GRCh37/hg19 reference and then &quot;PHRED-scaled&quot; those values by expressing the rank in order of magnitude terms rather than the precise rank itself. For example, reference genome single nucleotide variants at the 10th-% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, etc. The results of this transformation are the &quot;scaled&quot; CADD scores.&nbsp;<br /> &nbsp;</p>
<p>CADD is a tool for scoring the deleteriousness of single nucleotide variants&nbsp;in the human genome.</p>
<p>CADD scores are freely available for all non-commercial applications.&nbsp;If you are planning on using them in a commercial application, please obtain a license.</p>
...
Cancer Genome Interpreter

Flags validated oncogenic alterations and genomic biomarkers of drug response, while predicting cancer drivers among mutat...

<p>Cancer Genome Interpreter (CGI) is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable. CGI relies on existing knowledge collected from several resources and on computational methods that annotate the alterations in a tumor according to distinct levels of evidence.</p> <p>CGI Flags validated oncogenic alterations, and predicts cancer drivers among mutations of unknown significance. In addition to genomic biomarkers of drug response with different levels of clinical relevance.</p> <p><strong>CGI Framework</strong></p> <p>With a list of genomic alterations in a tumor of a given cancer type as input, the CGI automatically recognizes the format, remaps the variants as needed and standardizes the annotation for downstream compatibility. Next, it identifies known driver alterations and annotates and classifies the remaining variants of unknown significance. Finally, alterations that are biomarkers of drug effect are identified according to current evidences.</p> <p><strong>Identification of Driver Events</strong></p> <p>Alterations that are clinically or experimentally validated to drive tumor phenotypes &ndash;previously culled from public sources-- are identified by the CGI, whereas the effect of the remaining alterations of uncertain significance are predicted using in silico approaches, such as OncodriveMUT (for mutations).</p> <p><strong>OncodriveMUT</strong></p> <p>OncodriveMUT is a bioinformatics method to identify the most likely driver mutations of a tumor. Its main innovation with respect to other existing tools with a similar purpose is the incorporation of features characterizing the genes (or regions within genes) where the mutations occur, derived from the analysis of cohorts of tumors (6,792 samples across 28 cancer types⁠) and samples from healthy donors (60,706 unrelated individuals⁠). This knowledge is combined with features that describe the impact of the mutation on the function of the protein it affects via a set of heuristic rules to predict the effect of the mutations of uncertain significance.</p> <p><strong>Cancer Biomarkers Database</strong></p> <p>The cancer biomarkers db integrates manually collected genomic biomarkers of drug sensitivity, resistance and severe toxicity. These biomarkers are classified by the cancer type in which they have been described according to different levels of clinical evidence supporting the association. The database is available for access and feedback by the community at www.cancergenomeinterpreter.org/biomarkers. The aggregation, curation and interpretation of the biomarkers follow the standard operating procedures developed under the umbrella of the H2020 MedBioinformatics project, thus ensuring the mid-term maintenance of these resources. The feedback from the community is also facilitated through the CGI web interface. Nevertheless, access to this type of data is both crucial for the advance of cancer precision medicine and highly complex to be comprehensively covered and updated by a single institution. This is why the [Variant Interpretation for Cancer Consortium](https://www.ga4gh.org/#/vicc), under the Global Alliance for Genomics &amp; Health framework (Global Alliance for Genomics and Health et al. 2016)⁠, has recently been launched with the aim to unify the curation efforts in several institutes, including ours.</p> <p>Information from https://www.cancergenomeinterpreter.org/faq<br /> &nbsp;</p>
<p>Flags validated oncogenic alterations and genomic biomarkers of drug response, while predicting cancer drivers among mutations of unknown significance.</p>
<p>Freely available for non-commercial use.</p>
...
Cancer Hotspots

A resource for statistically significant mutations in cancer.

...
<p>Our systematic computational, experimental, and clinical analysis of hotspot mutations in ~25,000 human cancers demonstrates that the long right tail of biologically and therapeutically significant mutant alleles is still incompletely characterized. Sharing prospective genomic data will accelerate hotspot identification, thereby expanding the reach of precision oncology in cancer patients</p> <p><strong>Methods</strong></p> <p>To identify novel hotspot mutations of biological and potentially therapeutic importance, we analyzed somatic mutational data from 24,592 patients. This cohort consisted of 14,256 retrospectively sequenced predominantly primary untreated human cancers and 10,336 prospectively sequenced active, advanced cancer patients with recurrent and metastatic disease (43% of specimens were obtained from metastatic sites) (8). These samples represent 322 cancer types spanning 32 organ sites, the annotation of which was standardized to conform to an open-source structured disease classification (http://oncotree.mskcc.org/). We analyzed each of the 32 organ sites independently as well as the full cohort (pan-cancer) to enhance the probability of identifying hotspots that occur rarely in multiple organ types of different mutational burdens and processes(9). To do this, organ type-specific, gene-specific, and context-specific background mutation rates were computed. We also developed a first-of-its-kind computational approach to identify hotspots of candidate oncogenic small in-frame insertions or deletions, which are more challenging to identify than substitutions due to the variability of mutant allele length and position from tumor to tumor.</p> <p><strong>About</strong></p> <p>This resource is maintained by the Kravis Center for Molecular Oncology at Memorial Sloan Kettering Cancer Center. It provides information about statistically significantly recurrent mutations identified in large scale cancer genomics data.</p> <p>Information from https://cancerdiscovery.aacrjournals.org/content/candisc/early/2017/12/15/2159-8290.CD-17-0321.full.pdf<br /> &nbsp;</p>
<p>A resource for statistically significant mutations in cancer.</p>
...
CardioBoost

Predicts pathogenicity of missense variants for inherited cardiac conditions

...
<p>CardioBoost is a disease-specific machine learning classifier to predict the pathogenicity of rare (gnomAD Allele Frequency &lt;=0.1%) missense variant in genes associated with cardiomyopathies and arrhythmias that outperforms existing genome-wide prediction tools.</p> <p><strong>Inherited Cardiac Conditions</strong></p> <p>We consider two types of conditions:<br /> * __Cardiomyopathies__: dilated cardiomyopathy and hypertrophic cardiomyopathy<br /> * __Inherited Arrhythmia Syndromes__: Long QT syndrome and Brugada syndrome</p> <p><strong>Inherited Cardiac Conditions Related Genes</strong></p> <p>The following tables display the genes related to the conditions and only the genes with known pathogenic variants in our curated data sets would be included.</p> <p><strong>&nbsp;__Cardiomyopathies__</strong></p> <p>|Gene Symbol|&nbsp;&nbsp; &nbsp; Ensemble Gene ID|&nbsp;&nbsp; &nbsp;Ensemble Transcript ID|&nbsp;&nbsp; &nbsp;Ensemble Protein ID|<br /> | ----------|:-------------------|:-----------------------|:--------------------|<br /> |ACTC1|&nbsp;&nbsp; &nbsp;ENSG00000159251|&nbsp;&nbsp; &nbsp;ENST00000290378| ENSP00000290378|<br /> |DES|&nbsp;&nbsp; &nbsp;ENSG00000175084|&nbsp;&nbsp; &nbsp;ENST00000373960| ENSP00000363071|<br /> |GLA|&nbsp;&nbsp; &nbsp;ENSG00000102393|&nbsp;&nbsp; &nbsp;ENST00000218516|&nbsp;&nbsp; &nbsp;ENSP00000218516|<br /> |LAMP2|&nbsp;&nbsp; &nbsp;ENSG00000005893|&nbsp;&nbsp; &nbsp;ENST00000200639|&nbsp;&nbsp; &nbsp;ENSP00000200639|<br /> |LMNA|&nbsp;&nbsp; &nbsp;ENSG00000160789|&nbsp;&nbsp; &nbsp;ENST00000368300|&nbsp;&nbsp; &nbsp;ENSP00000357283|<br /> |MYBPC3|&nbsp;&nbsp; &nbsp;ENSG00000134571|&nbsp;&nbsp; &nbsp;ENST00000545968|&nbsp;&nbsp; &nbsp;ENSP00000442795|<br /> |MYH7|&nbsp;&nbsp; &nbsp;ENSG00000092054| ENST00000355349|&nbsp;&nbsp; &nbsp;ENSP00000347507|<br /> |MYL2|&nbsp;&nbsp; &nbsp;ENSG00000111245|&nbsp;&nbsp; &nbsp;ENST00000228841|&nbsp;&nbsp; &nbsp;ENSP00000228841|<br /> |MYL3|&nbsp;&nbsp; &nbsp;ENSG00000160808|&nbsp;&nbsp; &nbsp;ENST00000395869|&nbsp;&nbsp; &nbsp;ENSP00000379210|<br /> |PLN|&nbsp;&nbsp; &nbsp;ENSG00000198523|&nbsp;&nbsp; &nbsp;ENST00000357525|&nbsp;&nbsp; &nbsp;ENSP00000350132|<br /> |PRKAG2|&nbsp;&nbsp; &nbsp;ENSG00000106617|&nbsp;&nbsp; &nbsp;ENST00000287878|&nbsp;&nbsp; &nbsp;ENSP00000287878|<br /> |PTPN11|&nbsp;&nbsp; &nbsp;ENSG00000179295|&nbsp;&nbsp; &nbsp;ENST00000351677|&nbsp;&nbsp; &nbsp;ENSP00000340944|<br /> |SCN5A|&nbsp;&nbsp; &nbsp;ENSG00000183873|&nbsp;&nbsp; &nbsp;ENST00000333535|&nbsp;&nbsp; &nbsp;ENSP00000328968|<br /> |TNNI3|&nbsp;&nbsp; &nbsp;ENSG00000129991|&nbsp;&nbsp; &nbsp;ENST00000344887|&nbsp;&nbsp; &nbsp;ENSP00000341838|<br /> |TNNT2|&nbsp;&nbsp; &nbsp;ENSG00000118194|&nbsp;&nbsp; &nbsp;ENST00000367318|&nbsp;&nbsp; &nbsp;ENSP00000356287|<br /> |TPM1|&nbsp;&nbsp; &nbsp;ENSG00000140416|&nbsp;&nbsp; &nbsp;ENST00000403994|&nbsp;&nbsp; &nbsp;ENSP00000385107|</p> <p><strong>__Inherited Arrhthymias Syndromes__</strong></p> <p>|Gene Symbol|&nbsp;&nbsp; &nbsp; Ensemble Gene ID|&nbsp;&nbsp; &nbsp;Ensemble Transcript ID|&nbsp;&nbsp; &nbsp;Ensemble Protein ID|<br /> | ----------|:-------------------|:-----------------------|:--------------------|<br /> |CACNA1C|&nbsp;&nbsp; &nbsp;ENSG00000151067|&nbsp;&nbsp; &nbsp;ENST00000399655|&nbsp;&nbsp; &nbsp;ENSP00000382563|<br /> |CALM1|&nbsp;&nbsp; &nbsp;ENSG00000198668|&nbsp;&nbsp; &nbsp;ENST00000356978|&nbsp;&nbsp; &nbsp;ENSP00000349467|<br /> |CALM2|&nbsp;&nbsp; &nbsp;ENSG00000143933| ENST00000272298|&nbsp;&nbsp; &nbsp;ENSP00000272298|<br /> |CALM3|&nbsp;&nbsp; &nbsp;ENSG00000160014|&nbsp;&nbsp; &nbsp;ENST00000291295|&nbsp;&nbsp; &nbsp;ENSP00000291295|<br /> |KCNH2|&nbsp;&nbsp; &nbsp;ENSG00000055118|&nbsp;&nbsp; &nbsp;ENST00000262186|&nbsp;&nbsp; &nbsp;ENSP00000262186|<br /> |KCNQ1|&nbsp;&nbsp; &nbsp;ENSG00000053918|&nbsp;&nbsp; &nbsp;ENST00000155840|&nbsp;&nbsp; &nbsp;ENSP00000155840|<br /> |SCN5A|&nbsp;&nbsp; &nbsp;ENSG00000183873|&nbsp;&nbsp; &nbsp;ENST00000333535|&nbsp;&nbsp; &nbsp;ENSP00000328968|</p> <p><strong>Classification Criteria</strong></p> <p>Variant classification is based on the pathogenic probability predicted by CardioBoost. According to the ACMG guidelines, we use Pr&gt;=0.9 as the high classification certainty threshold to classify variants. A variant with lower than 90% classification probability is considered as indeterminate with low classification confidence level. In short, a variant is classified given its predicted pathogenicity:</p> <p>* pathogenicity&gt;=0.9: Pathogenic/Likely pathogenic<br /> * pathogenicity&lt;=0.1: Benign/Likely benign<br /> * pathogenicity&gt;0.1 and &lt;0.9: Variant of Uncertain Significance (VUS)</p> <p><strong>Why does CardioBoost not output predictions on my input list of variants?</strong></p> <p>There are mainly three reasons that CardioBoost would not return any prediction:<br /> * The gene is not included as disease-related genes described above. Please check the gene lists above.<br /> * The mutation is not a valid missense change on the gene&#39;s canonical transcript (shown in the gene lists above).<br /> * The variant&#39;s gnomAD allele frequency is larger than 0.1%, which can be considered as a common variant and highly likely benign to cardiomyopathies and arrhythmias.</p> <p>Information from https://www.cardiodb.org/cardioboost/</p>
<p>Predicts pathogenicity of missense variants for inherited cardiac conditions</p>
...
CCR: Constrained Coding regions Model

The constrained coding regions model (CCR) uses the Genome Aggregation Database to reveal regions of protein coding genes ...

<p>The constrained coding regions model (CCR) uses the Genome Aggregation Database (gnomAD, version 2.0.1 in the paper) to reveal regions of protein coding genes that are likely to be under potentially purifiying selection. We used protein-altering variation from across 123,136 ostensibly healthy individuals&#39; exomes to reveal coding regions that are completely devoid of any protein-coding variation. We infer such coding regions to be constrained; the higher the constraint percentile, the more constrained we predict the region to be.</p> <p>The most constrained regions (&ge;90th percentile, and especially at or above the &ge;99th percentile) have been shown to be extremely enriched for pathogenic variation in ClinVar, de novo dominant mutations in patients with severe developmental disorders, and critical Pfam domains exome-wide. Even more exciting, 72% of genes harboring a CCR in the 99th percentile or higher have no known pathogenic variants. There is great opportunity for discovery of function in these understudied genes as well as their role in disease phenotypes or potentially in embryonic lethality when altered.</p> <p>Information from https://github.com/quinlan-lab/ccr</p>
<p>The constrained coding regions model (CCR) uses the Genome Aggregation Database to reveal regions of protein coding genes that are likely to be under potentially purifiying selection.</p> <p>&nbsp;</p>
...
Candidate cis-Regulatory Elements by ENCODE (SCREEN)

SCREEN allows the user to explore Candidate cis-Regulatory Elements (cCREs) and investigate how these elements relate to o...

<p>The ENCODE Encyclopedia comprises two levels of epigenomic and transcriptomic annotations. The ground level includes annotations such as peaks and quantifications for individual data types produced by the ENCODE uniform processing pipelines. The integrative level contains annotations generated by integrating multiple ground-level annotations. The core of the integrative level is the Registry of candidate cis-Regulatory Elements (cCREs) which are displayed in SCREEN, a web-based visualization engine designed specifically for the Registry. SCREEN allows the user to explore cCREs and investigate how these elements relate to other Encyclopedia annotations and raw ENCODE data.</p> <p><strong>The Registry of cCREs</strong></p> <p>cCREs are the subset of representative DNase hypersensitivity sites (rDHSs) supported by either histone modifications (H3K4me3 and H3K27ac) or CTCF-binding data. We start with 93 million individual DHSs across 706 DNase-seq profiles in human and 20 million individual DHSs from 173 DNase-seq profiles in mouse. For each respective species, we iteratively cluster the DHSs across all profiles and select the DHS with the highest signal (read depth normalized signal) as the rDHS for each cluster. This iterative clustering and selection process continues until it results in a list of non-overlapping rDHSs&mdash;2.2 million rDHSs in human and 1.2 rDHSs in mouse&mdash;representing all DHSs (Figure 3). We then further selected rDHSs with high DNase signal in at least one biosample (defined as a Z-score &gt; 1.64, see details on defining high signal below). Finally, from this subset of high signal rDHSs, we selected all elements that were also supported by high H3K4me3, H3K27ac, and/or CTCF ChIP-seq signals in concerted biosamples (i.e., samples with complementary assay coverage). This resulted in a total of 926,535 human cCREs.</p> <p><strong>Classification of cCREs</strong></p> <p>Many uses of cCREs are based on the regulatory role associated with their biochemical signatures. Thus, we putatively defined cCREs in one of the following annotation groups based on each element&rsquo;s dominant biochemical signals across all available biosamples. Analogous to GENCODE&#39;s catalog of genes, which are defined irrespective of their varying expression levels and alternative transcripts across different cell types, we provide a general, cell type-agnostic classification of cCREs based on the max-Zs as well as its proximity to the nearest annotated TSS:</p> <p>* cCREs with promoter-like signatures (cCRE-PLS) fall within 200 bp (center to center) of an annotated GENCODE TSS and have high DNase and H3K4me3 signals (evaluated as DNase and H3K4me3 max-Z scores, defined as the maximal DNase or H3K4me3 Z scores across all biosamples with data; see Methods).</p> <p>* cCREs with enhancer-like signatures (cCRE-ELS) have high DNase and H3K27ac max-Z scores and must additionally have a low H3K4me3 max-Z score if they are within 200 bp of an annotated TSS. The subset of cCREs-ELS within 2 kb of a TSS is denoted proximal (cCRE-pELS), while the remaining subset is denoted distal (cCRE-dELS).</p> <p>* DNase-H3K4me3 cCREs have high H3K4me3 max-Z scores but low H3K27ac max-Z scores and do not fall within 200 bp of a TSS.</p> <p>* CTCF-only cCREs have high DNase and CTCF max-Z scores and low H3K4me3 and H3K27ac max-Z scores.</p> <p>In addition to the cell type-agnostic classification described above, we evaluated the biochemical activity of each cCRE in each individual cell type using the corresponding DNase, H3K4me3, H3K27ac, and CTCF data. All cCREs with low DNase Z-scores in a particular cell type are bundled into one &ldquo;inactive&rdquo; state for that cell type; the remaining &ldquo;active&rdquo; cCREs are divided into eight states according to their epigenetic signal Z-scores, producing nine possible states in total. The three groups described above&mdash;cCRE-PLS, cCRE-ELS, and CTCF-only cCRE&mdash;apply to the active cCREs within a particular cell type. Two additional groups are defined with respect to individual cell types: an inactive group, containing all cCREs in the inactive state, and a DNase-only group, containing cCREs with high DNase Z-scores but low H3K4me3, H3K27ac, and CTCF Z-scores within the cell type. Importantly, while the classification schemes in Figures 2 and 3 place each cCRE into only one activity group, the signal strengths for all recorded epigenetic features are retained for each cCRE in the Registry, and these can be used for customized searches by users.</p> <p>We also attempt to make group assignments for cCREs in a particular biosample not fully covered by the four core assays, making some approximations. For samples with DNase data, we classify elements using the available marks. For example, if a sample lacks H3K27ac its cCREs will be assigned to the PLS and DNase-H3K4me3 groups but not the pELS or dELS groups. For biosamples lacking DNase data, we do not have the resolution to identify specific elements. Therefore, for these biosamples, we simply label the cCRE as having a high or low signal for every available assay. In these biosamples, cCREs with low H3K4me3, H3K27ac, or CTCF signals were labelled &ldquo;unclassified&rdquo; because we were unable to classify them as low-DNase without DNase data. In both SCREEN and in downloadable files biosamples lacking data are clearly labeled as such.</p> <p>Information from https://screen.wenglab.org/</p>
<p>SCREEN allows the user to explore Candidate cis-Regulatory Elements (cCREs) and investigate how these elements relate to other Encyclopedia annotations and raw ENCODE data.</p>
...
CGD: Clinical Genomic Database

A manually curated database of conditions with known genetic causes, focusing on medically significant genetic data with a...

<p>A key barrier to translating the power of genomic sequencing to clinically-oriented research analyses involves the time and resources required for clinically-relevant analysis. To help address this barrier, we constructed the Clinical Genomic Database (CGD), a manually curated database of conditions with known genetic causes, focusing on medically significant genetic data with available interventions.</p> <p>All conditions with identified genetic causes are included in the CGD. For each entry, the database includes the gene symbol, condition(s), allelic conditions, inheritance, age (pediatric or adult) in which interventions are indicated, clinical categorization, and a general description of interventions/rationale. The contents are not intended to serve as nor substitute for comprehensive clinical guidelines or to provide clinical direction, but are rather intended to briefly describe the types of interventions that might be considered so that this information can be used for further research purposes.</p> <p>The database includes only single gene alterations (it does not include contiguous gene syndromes, although some conditions with, for example, digenic inheritance are included), and does not include genetic associations or susceptibility factors related to more complex diseases, such as identified through association-based studies. Somatic alterations, such as commonly occur in cancerous processes, are not included unless a germline change in the same gene results in disease.</p> <p><strong>Clinical Categorization of the CGD</strong></p> <p>The CGD has been constructed to reflect the multisystemic nature of many genetic conditions in order to allow more comprehensive browsing by clinical categories. In the CGD, genes were first categorized into Manifestation categories, or the organ system(s) primarily affected by mutations in the corresponding gene. For many of these organ systems, recognition of the condition&#39;s effects and related supportive care may be clinically beneficial. Conditions not grouped within a specific organ system under the Manifestation categories were included in the General category.</p> <p>Next, genes were separately categorized under Intervention categories by the organ system(s) for which specific medical interventions were available. In determining the Intervention categories, the following points were considered: 1) the condition must be clinically significant (i.e., at least some manifestations must result in morbidity and mortality); 2) there must be a currently available, potentially beneficial intervention (this intervention may include preventive measures, surveillance, or medical and/or surgical treatments, though experimental/research-based interventions were not included); 3) there should be advantage to early (&ldquo;genomic&rdquo;) diagnosis as opposed to discovery of the condition on purely clinical grounds (i.e., without genetic/genomic testing). Regarding this last point, precise diagnosis is challenging for many conditions, and correct recognition based on genetic/genomic diagnosis may allow interventions related to specific manifestations. The efficacy of these interventions would be diminished or lost with later diagnosis, such as might occur based primarily upon clinical presentation.</p> <p>For the Intervention categories, all genes not meeting the above criteria were included in the General category. As described, for many such conditions, while a more specific intervention may not be currently available, genetic knowledge may be beneficial related to a number of issues, including the selection of optimal supportive care, prognostic considerations related to medical-decision making, informing reproductive decisions, and avoidance of unnecessary testing as part of the diagnostic process.</p> <p>Information from https://research.nhgri.nih.gov/CGD/</p>
<p>A manually curated database of conditions with known genetic causes, focusing on medically significant genetic data with available interventions.</p>
...
CHASMplus

CHASMplus is a machine learning algorithm that discriminates somatic missense mutations as either cancer drivers or p...

<p>CHASMplus is a computational tool to classify missense mutations as drivers or passengers in human cancers. Driver mutations provide a selective advantage to cancer cells, while passenger mutations do not. CHASMplus is based on a set of 95 features, characterizing mutational hotspots, evolutionary conservation/human germline variation, molecular function annotations (e.g., protein-protein interface annotations, sequence biased regions, and relevant covariates (e.g., replication timing). It was trained using somatic mutations from whole-exome sequencing of a larger number of tumors in The Cancer Genome Atlas (TCGA). CHASMplus can score mutations in either a cancer type-specific manner or &quot;pan-cancer&quot;, which is a useful default for many cancer types.&nbsp;</p> <p>Please check out the [CHASMplus website](https://chasmplus.readthedocs.io) for more information.</p> <p><strong>How do I interpret the results?</strong></p> <p>The CHASMplus output contains two main components: a score and a p-value. High scores reflect a greater likelihood that a mutation is a driver (scores range from 0.0 to 1.0). The P-value reflects the statistical significance of obtaining the acheived or higher CHASMplus score. We recommend that driver mutations are called based on the False Discovery Rate, preferably by using the Benjamini-Hochberg method through a function like the `p.adjust` function in the R programming language. Possible thresholds include a false discovery rate of 0.1 or 0.01, depending on the need to constrain false positives. NOTE: P-values are calibrated for whole-exome sequencing studies. If you are using a targeted gene panel or focusing on only a specific subset of genes, then please use the gene panel version of CHASMplus.</p> <p><strong>How are CHASMplus scores generated?</strong></p> <p>CHASMplus scores all possible missense mutations on all transcripts. Therefore keep in mind that the same genomic mutation may have slightly varying scores depending on the transcript. OpenCRAVAT decides which among many transcripts will be chosen as the default. All scores provided through OpenCRAVAT are weighted by their respective gene based on 20/20+ (gene weighted CHASMplus scores).&nbsp;</p> <p><strong>Cancer Type Abbreviations Guide</strong></p> <p>| Abbr. | Full | Abbr. | Full | Abbr. | Full | Abbr. | Full | Abbr. | Full |<br /> |--------------|------------------------------------------------------------------|--------------|---------------------------------------|--------------|------------------------------------|--------------|--------------------------------------|--------------|----------------|<br /> | ACC | Adrenocortical carcinoma | HNSC | Head and Neck squamous cell carcinoma | LUSC | Lung squamous cell carcinoma | SARC | Sarcoma | UVM | Uveal Melanoma |<br /> | BLCA | Bladder Urothelial Carcinoma | KICH | Kidney Chromophobe | MESO | Mesothelioma | SKCM | Skin Cutaneous Melanoma | &nbsp;| &nbsp;|<br /> | CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma | KIRC | Kidney renal clear cell carcinoma | OV | Ovarian serous cystadenocarcinoma | STAD | Stomach adenocarcinoma | &nbsp;| &nbsp;|<br /> | CHOL | Cholangiocarcinoma | KIRP | Kidney renal papillary cell carcinoma | PAAD | Pancreatic adenocarcinoma | TGCT | Testicular Germ Cell Tumors | &nbsp;| &nbsp;|<br /> | COAD | Colon adenocarcinoma | LAML | Acute Myeloid Leukemia | PANCAN | PAN Cancer | THCA | Thyroid carcinoma | &nbsp;| &nbsp;|<br /> | DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | LGG | Brain Lower Grade Glioma | PCPG | Pheochromocytoma and Paraganglioma | THYM | Thymoma | &nbsp;| &nbsp;|<br /> | ESCA | Esophageal carcinoma | LIHC | Liver hepatocellular carcinoma | PRAD | Prostate adenocarcinoma | UCEC | Uterine Corpus Endometrial Carcinoma | &nbsp;| &nbsp;|<br /> | GBM | Glioblastoma multiforme | LUAD | Lung adenocarcinoma | READ | Rectum adenocarcinoma | UCS | Uterine Carcinosarcoma | &nbsp;| &nbsp;|</p> <p><strong>Support</strong></p> <p>This work was supported by:</p> <p>* F31CA200266 (to Collin Tokheim)&nbsp;<br /> * U24CA204817 (to Rachel Karchin)</p> <p><strong>Citation</strong></p> <p>Tokheim C, Karchin R. CHASMplus reveals the scope of somatic missense mutations driving human cancers. bioRxiv. 2018:313296.</p> <p><strong>Contact Us</strong></p> <p>Collin Tokheim ctokhei1@alumni.jhu.edu<br /> Rachel Karchin karchin@jhu.edu</p>
<p>CHASMplus is a machine learning algorithm that discriminates somatic&nbsp;missense mutations as either cancer drivers or passengers. Predictions can be done in either a cancer type-specific manner or by a model considering multiple cancer&nbsp;types together (a useful default). Along with scoring each mutation, CHASMplus has&nbsp;a rigorous statistical model to evaluate the statistical significance of predictions.</p> <p>This OpenCRAVAT module represents the v1.0 precompute of CHASMplus (source code&nbsp;v1.0).</p>
<p>Freely available for non-commercial use.</p>
...
ClinGen Gene

ClinGen is a National Institutes of Health (NIH)-funded resource dedicated  to building a central resource that ...

<p><strong>Get Started With ClinGen</strong></p> <p>Funded in 2013 by the National Human Genome Research Institute, ClinGen is a growing collaborative effort, involving three grants, nine principal investigators and over 970 contributors from more than 29 countries. Explore our website to get to know our working groups, learn more about how we are meeting our goals and search our knowledge base. If you have any questions, please contact us at clingen@clinicalgenome.org.</p> <p><strong>What is ClinGen</strong></p> <p>While knowledge in the field of human genetics has greatly increased since the time of the Human Genome Project, we are still learning all of the ways in which changes in our DNA contribute to human health and disease. The Clinical Genome Resource, or ClinGen, is a National Institutes of Health funded initiative to increase the community&rsquo;s knowledge about the relationship between genes and health. We are dedicated to building a knowledge base that defines the clinical relevance of genes and variants for use in precision medicine and research. We do this by first encouraging the sharing of genetic and health data by our key stakeholder groups: patients, clinicians, laboratories, and researchers.</p> <p>We then use this data to answer a number of key curation questions:</p> <p>Is this gene associated with a disease, and by which mechanisms do variation cause this disease?</p> <p>Is this variant causative?</p> <p>Will this information affect medical management?</p> <p>Once we answer these questions via our various curation efforts, we make this information publicly available, building a genomic knowledge base with the goal of improving patient care through genomic medicine.</p> <p><strong>Gene-Disease Validity</strong></p> <p>Our curators review genetic and experimental data in the scientific literature to identify genes in which pathogenic variants cause disease.</p> <p>Laboratories may use this type of information when deciding which genes to include in clinical testing panels.</p> <p>Clinicians may use this type of information when interpreting test results for their patients.</p>
<p>ClinGen is a National Institutes of Health (NIH)-funded resource dedicated&nbsp;&nbsp;to building a central resource that defines the clinical relevance of genes and&nbsp;variants for use in precision medicine and research.</p>
...
ClinVar Postprocess

ClinVar is an archive of reports of the relationships among human variations and phenotypes, with supporting evidence...

<p><strong>ClinVar:</strong> public archive of interpretations of clinically relevant variants<br /> ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.&nbsp;<br /> &nbsp;<br /> Clinical Significance includes: &nbsp;<br /> Benign, Likely benign, Uncertain significance, Likely pathogenic, Pathogenic, Drug response, Association, Risk factor, Protective, Affects, Conflicting data from submitters, Other and Not provided<br /> &nbsp;<br /> Review Status is the level of review supporting the assertion of clinical significance for the variation. There are 8 different statuses to consider, ranked from highest to lowest levels of review:</p> <p>1. Practice Guideline<br /> 2. Reviewed by expert panel&nbsp;<br /> 3. Criteria provided, multiple submitters, no conflicts&nbsp;<br /> 4. Criteria provided, conflicting interpretations<br /> 5. Criteria provided, single submitter<br /> 6. No assertion for the individual variant&nbsp;<br /> 7. No assertion criteria provided<br /> 8. No assertion provided</p> <p>See [Clinvar Review Status](https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/) for more information.</p> <p>![Screenshot](clinvar_screenshot_1.png)<br /> &lt;br /&gt;</p>
<p>ClinVar is an archive of reports of the relationships among human variations&nbsp;and phenotypes, with supporting evidence.</p>
...
Cardiovascular Disease Knowledge Portal

Produces an effect weight for the variant in which disease it is found. Weights are assigned according to the strength of ...

<p><strong>Cardiovascular Disease Knowledge Portal&nbsp;</strong><br /> The Cardiovascular Disease Knowledge Portal enables browsing, searching, and analysis of human genetic information linked to myocardial infarction, atrial fibrillation, and related traits, while protecting the integrity and confidentiality of the underlying data.</p> <p><strong>Data in the Cardiovascular Disease Knowledge Portal</strong></p> <p>Data in the CVDKP were generated by these consortia:</p> <p>[The Atrial Fibrillation Consortium (AFGen)](https://www.afgen.org/) seeks to identify the genetic basis of atrial fibrillation. Collaborators from more than 50 studies have contributed to ongoing projects, and a full list of AFGen investigators is available [here](https://www.afgen.org/members-c1zr5).</p> <p>[The Global Lipids Genetics Consortium (GLGC)](http://lipidgenetics.org/) studies the genetics of plasma lipids. The current collaborative research network involves over 200 investigators from more than 80 institutions.</p> <p>[The Heart Failure Molecular Epidemiology for Therapeutic Targets (HERMES)](https://www.hermesconsortium.org/) consortium is an international collaboration that aims to generate insights into the causal pathways leading to heart failure to inform new therapeutic approaches.</p> <p>[The Myocardial Infarction Genetics Consortium (MIGen)](http://www.kathiresanlab.org/collaborators/myocardial-infarction-genetics-exome-sequencing-consortium/) performs both genotyping and sequencing studies to understand the genetics of early-onset heart attack.</p> <p>[The CARDIoGRAMplusC4D Consortium](http://www.cardiogramplusc4d.org/) is a collaborative effort to combine data from multiple large scale genetic studies to identify risk loci for coronary artery disease and myocardial infarction.</p> <p><strong>The Knowledge Portal Framework</strong></p> <p>The Knowledge Portal framework is being developed as part of the Accelerating Medicines Partnership, a public-private partnership between the National Institutes of Health (NIH), the U.S. Food and Drug Administration (FDA), 10 biopharmaceutical companies, and multiple non-profit organizations that is managed through the Foundation for the NIH (FNIH). AMP seeks to harness collective capabilities, scale, and resources toward improving current efforts to develop new therapies for complex, heterogeneous diseases. The ultimate goal is to increase the number of new diagnostics and therapies for patients while reducing the time and cost of developing them, by jointly identifying and validating promising biological targets for several diseases, including type 2 diabetes.</p> <p>Knowledge Portals are intended to serve three key functions:</p> <p>* To be central repositories for large datasets of human genetic information linked to complex diseases and related traits.<br /> * To function as scientific discovery engines that can be harnessed by the community at large, and assist in the selection of new targets for drug design.<br /> * Eventually, to facilitate the conduct of customized analyses by any interested user around the world, doing so in a secure manner that provides high quality results while protecting the integrity of the data.</p> <p>Information from http://www.broadcvdi.org/</p>
<p>Produces an effect weight for the variant in which disease it is found. Weights are assigned according to the strength of their association with disease risk.</p>
...
DANN Coding

DANN is a functional prediction score retrained based on the training data of CADD using deep neural network.

...
<p><strong>DANN:</strong> a deep learning approach for annotating the pathogenicity of genetic variants</p> <p>Annotating genetic variants for the purpose of identifying pathogenic variants remains a challenge. Combined annotation-dependent depletion (CADD) is an algorithm designed to annotate coding variants, and has been shown to outperform other annotation algorithms. CADD trains a linear kernel support vector machine (SVM) to differentiate evolutionarily derived, likely benign, alleles from simulated, likely deleterious, variants. However, SVMs cannot capture non-linear relationships among the features, which can limit performance. To address this issue, we have developed DANN. DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features. &nbsp;DANN achieves about a 19% relative reduction in the error rate and about a 14% relative increase in the area under the curve (AUC) metric over CADD&rsquo;s SVM methodology.</p> <p>Information from https://academic.oup.com/bioinformatics/article/31/5/761/2748191</p> <p><br /> &nbsp;</p>
<p>DANN is a functional prediction score retrained based on the training data of CADD using deep neural network.</p>
<p>Freely available for non-commercial use.</p>
...
dbCID: Database of Cancer Driver InDels

The database of Cancer driver InDels (dbCID) is a highly curated database of driver indels (insertions and deletions) that...

<p><strong>&nbsp;dbCID: Database of Cancer Driver InDels</strong></p> <p>While recent advances in next generation sequencing technologies have enabled the creation of a multitude of databases in cancer genomic, there is no comprehensive database focusing on the annotation of driver indel yet. Therefore, we created the dbCID which is a collection of known indels that likely to be engaged in cancer development, progression or therapy. It currently contains experimentally supported and putative driver indels derived from manual curation of literature. For each indel, we have curated the position information (genomic, coding DNA, and protein levels), specific diseases, drug sensitivity information (partial) as well as evidence sentences. Evidence information is classified using the levels and rules of Evidence System. The database can be used to improve the training of prediction algorithms and evaluate the methods for predicting the effects of variations.</p> <p><strong>&nbsp;Rules for InDel Entry</strong></p> <p>1. Induced development, recurrence or metastasis of cancer.<br /> 2. Associated with increased sensitivity or resistance to a drug.<br /> 3. Induced change of function of gene product significantly.<br /> 4. Had a higher recurrence frequency in cancer patients compared to the case of healthy controls.<br /> 5. Located in an important region in gene or protein, such as a binding or catalytic site.</p> <p>Information from http://bioinfo.ahu.edu.cn:8080/dbCID/About.jsp</p>
<p>The database of Cancer driver InDels (dbCID) is a highly curated database of driver indels (insertions and deletions) that are likely to engage in cancer development, progression, or therapy.</p>
...
dbSNP Common

 Selection of SNPs with a minor allele frequency of 1% or greater is an attempt to identify variants that appear to b...

<p><strong>&nbsp;dbNSP Common</strong></p> <p>Contains information about a subset of the single nucleotide polymorphisms and small insertions and deletions (indels) &mdash; collectively Simple Nucleotide Polymorphisms &mdash; from [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) build 151, available from ftp.ncbi.nlm.nih.gov/snp. Only SNPs that have a minor allele frequency (MAF) of at least 1% and are mapped to a single location in the reference genome assembly are included in this subset. Frequency data are not available for all SNPs, so this subset is incomplete. Allele counts from all submissions that include frequency data are combined when determining MAF, so for example the allele counts from the 1000 Genomes Project and an independent submitter may be combined for the same variant.</p> <p>dbSNP provides download files in the Variant Call Format (VCF) that include a &quot;COMMON&quot; flag in the INFO column. That is determined by a different method, and is generally a superset of the UCSC Common set. dbSNP uses frequency data from the 1000 Genomes Project only, and considers a variant COMMON if it has a MAF of at least 0.01 in any of the five super-populations:</p> <p>* African (AFR)<br /> * Admixed American (AMR)<br /> * East Asian (EAS)<br /> * European (EUR)<br /> * South Asian (SAS)</p> <p>In build 151, dbSNP marks approximately 38M variants as COMMON; 23M of those have a global MAF &lt; 0.01. The remainder should be in agreement with UCSC&#39;s Common subset.</p> <p>The selection of SNPs with a minor allele frequency of 1% or greater is an attempt to identify variants that appear to be reasonably common in the general population. Taken as a set, common variants should be less likely to be associated with severe genetic diseases due to the effects of natural selection, following the view that deleterious variants are not likely to become common in the population. However, the significance of any particular variant should be interpreted only by a trained medical geneticist using all available information.</p> <p>Information from https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=784649537_AtjYqLFz0CTNkRh8qWf8vOHQpNXp&amp;c=chr1&amp;g=snp151Common</p>
<p>&nbsp;Selection of SNPs with a minor allele frequency of 1% or greater is an attempt to identify variants that appear to be reasonably common in the general population.</p>
...
DGIdb: The Drug Interaction Database

The goal of DGIdb is to help you annotate your genes of interest with respect to known drug-gene interactions and potentia...

<p><strong>&#39;DGIdb: The Drug Interaction Database&#39;</strong><br /> The goal of DGIdb is to help you annotate your genes of interest with respect to known drug-gene interactions and potential druggability.<br /> <strong>About</strong><br /> The druggable genome can be defined as the genes or gene products that are known or predicted to interact with drugs, ideally with a therapeutic benefit to the patient. Such genes are of particular interest to large-scale cancer profiling efforts such as TCGA, ICGC and others that identify lists of potential cancer driver genes from high-throughput sequence and other genome-wide data. In cancer therapy, the increasing number of targeted drugs--those designed to inactivate proteins carrying activating amino acid changes as determined by mutational analyses--make more compelling the need for a searchable database of drug-gene interactions. A similar paradigm exists in the research of other human diseases. Thus, a commonly asked question in such projects is whether potential driver genes are targeted by any known drugs or belong to any putatively druggable gene categories. Along these lines, recent high profile cancer marker papers have presented &ldquo;druggable gene&rdquo; analyses. These analyses attempt to prioritize genes for further study, functional experiments, and ultimately to help guide the design of clinical trials. Unfortunately, there remains a large knowledge gap between clinical domain experts and genomic researchers. The former are intimately familiar with the disease-specific pathways and targeted therapies being used in the field. However, the latter possess the technical expertise to detect the known and potentially novel driver events hidden in the molecular data of disease samples under study. There is a critical need for tools that bridge this gap to help both basic and clinical researchers to prioritize and interpret the results of genome-wide studies in the context of gene function, clinical phenotypes, treatment decisions and patient outcomes.</p> <p><strong>Gene Categories</strong><br /> Gene categories in DGIdb refers to a set of genes belonging to a group that is deemed to be potentially druggable. For example, kinases are generally deemed to have high potential value for development of targeted drugs. For more details on the sources of druggable gene category definitions, refer to the sources and background reading.</p> <p><strong>&nbsp;Interaction Type</strong><br /> An interaction type describes the nature of the association between a particular gene and drug. For example, TTD reports the drug-gene interaction, SUNITINIB-FLT3. The interaction type is reported as &#39;inhibitor&#39;. Interaction type, as used in DGIdb, is very broad. Dozens of interaction types are currently defined. Many interaction types describe the mechanism of action between a small molecule and a protein. However, other broader types of &#39;interaction&#39; might also be used. e.g. Gene X mediates &#39;resistance&#39; or &#39;sensitivity&#39; to drug Y. A full list of interaction types and their definition can be found on the [Interaction Types &amp; Directionalities page](https://dgidb.org/interaction_types).<br /> Interaction types can be loosely grouped into Activating interaction types and Inhibiting interaction types. Activating interactions are those where the drug increases the biological activity or expression of a gene target while Inhibiting interactions are those where the drug decreases the biological activity or expression of a gene target. To see which interaction types fall under which directionality check out the list of interaction types on the Interaction Types &amp; Directionalities page.</p> <p><strong>&nbsp;Interaction Score</strong><br /> Please see the [Interaction Score &amp; Query Score](https://dgidb.org/score) page for a detailed explanation of these terms.</p> <p><strong>How is DGIdb Different</strong><br /> There are many differences. DGIdb is limited to human genes only. DrugBank and TTD are databases that catalogue drugs and store detailed information about those drugs and the genes they target. DGIdb aggregates many such databases into a common framework. DGIdb adds considerable functionality for efficiently searching a list of input genes against these sources. DGIdb integrates both known drug-gene interactions and potentially druggable gene data. DGIdb allows the user to refine their query to certain gene families, types of interactions, etc. DGIdb is open source and available in a format that would allow you to create your own instance. DGIdb incorporates sources of drug-gene interaction data that were previously only available in inaccessible formats (e.g. tables in a PDF document). DGIdb is meant to be used in combination with the original sources of raw data. Wherever possible we link out to those original sources.</p> <p>Information from https://dgidb.org/f</p>
<p>The goal of DGIdb is to help you annotate your genes of interest with respect to known drug-gene interactions and potential druggability.</p>
...
DIDA: Digenic Diseases Database

DIDA is a novel database that provides for the first time detailed information on genes and associated genetic varian...

<p>&nbsp;<strong>DIDA</strong>: Digenic Diseases Database<br /> DIDA is a novel database that provides for the first time detailed information on genes and associated genetic variants involved in digenic diseases, the simplest form of oligogenic inheritance.</p> <p>The basis of DIDA is a digenic combination, found in a patient that is affected with a digenic disease. Each digenic combiantion has a unique ID (&ldquo;dd000&rdquo;) and is composed of two, three or four variants present in two genes, that are both linked to a digenic disease.</p> <p><strong>&nbsp;Disease Name (ORPHANET)</strong><br /> [Orphanet](https://www.orpha.net/consor/cgi-bin/index.php) is the reference portal for information on rare diseases and orphan drugs, for all audiences. Orphanet&rsquo;s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. This column represents the name of the disease as present in Orphanet.</p> <p><strong>Oligogenic effect&nbsp;</strong><br /> The majority of instances in DIDA are categorised into one of two simplified classes: either the digenic combination provides data on two variants in two genes that are both mandatory for the appearance of the disease, or a variant in one gene is enough to develop the disease but carrying a second one on another gene impacts the disease phenotype or affects the severity or age of onset. These two classes are a coarse-grained simplification of the original definition provided by Schaffer. The first class represents true digenic instances (labelled as &ldquo;on/off&rdquo; in the previous version of DIDA): mutations at both loci are required for disease, mutations at one of the two loci result in no phenotype. The second class we will refer to as the composite class as it includes different possibilities (labelled as &ldquo;severity&rdquo; in the previous version of DIDA): A composite instance in DIDA could refer to a dual molecular diagnoses, wherein mutations at each locus may segregate independently and result in expression of part/all of the phenotype, or a oligogenic mutational burden, when a driver mutation is necessary for phenotype but rare variants in other genes, usually related to the same pathway/organ system, may modify the phenotype. Throughout this paper the true digenic class will be annotated by TD and the composite class by CO. Further fine-tuning of these classes will become possible when more digenic diseases data become available. Yet for now we can limit ourselves to the current constraint, exploring the reason why a certain digenic combination belongs to the TD or CO class.</p> <p><strong>&nbsp;Gene Relationship</strong><br /> As already described in literature, digenic diseases are caused by mutations in two genes which often have a physical or functional relationship (1,2). For each digenic combination in DIDA we determined the relationship between the two genes carring the mutations. There are 5 different types of relationship:<br /> 1. **Direct interaction:** there is a direct protein-protein interaction between the proteins products of the two genes. This information was retrieved from protein-protein-interaction databases (BioGrid, IntAct and ConsensusPathDb).<br /> 2. **Indirect interaction:** there is an indirect or &ldquo;two step&rdquo; interaction between the protein products of the two genes. If protein &ldquo;A&rdquo; and protein &ldquo;B&rdquo; interact with protein &ldquo;C&rdquo;, protein A and protein B are indirectly interacting. In other words, they share a common interactor. This information was retrieved from protein-protein-interaction databases (BioGrid, IntAct and ConsensusPathDb).<br /> 3. **Pathway membership:** the protein products from both genes belong to the same pathway. This information was retrieved from pathway databases (KEGG and REACTOME).<br /> 4. **Co-expression:** the protein products from both genes are expressed in at least one common tissue or organ. This information was retrieved from GNF/Atlas.<br /> 5. **Similar function:** the protein products from both genes contain the same functional conserved motifs or conserved domains. This information was retrieved from protein domain databases (InterPro and Pfam).</p> <p><strong>Familial Evidence</strong><br /> When reading the original publication, in which the digenic combination was reported, we checked if the digenicity was supported by a familial study. In this study family members are genetically tested to determine their variant carrier status. Two values are possible: 1) YES when a family study provided evidence for digenicity or 2) NO when there was no family study conducted or the study was inconclusive.</p> <p><strong>&nbsp;Functional Evidence</strong><br /> When reading the original publication, in which the digenic combination was reported, we checked if the digenicity was supported by a functional study. In this study the combined functional effect of the two variants was tested. Two values are possible: 1) YES when a functional study provided evidence for digenicity or 2) NO when there was no functional study conducted or the study was inconclusive.</p> <p><strong>&nbsp;Biological Distance</strong><br /> The Human Gene Connectome (HGC) is the set of all biologically plausible routes, distances, and degrees of separation between all pairs of human genes. A gene-specific connectome contains the set of all available human genes sorted on the basis of their predicted biological proximity to the specific gene of interest.</p> <p>Information from http://dida.ibsquare.be/documentation/<br /> &nbsp;</p>
<p>DIDA is a novel database that provides for the first time detailed information on genes&nbsp;and associated genetic variants involved in digenic diseases, the simplest form of oligogenic inheritance.</p>
<p>Freely available for non-commercial use</p>
...
ENCODE TFBS

Human transcription factor binding sites based on ChIP-seq experiments generated by production groups in the ENCODE Consor...

<p><strong>Encode Transcription Factor Binding Sites</strong></p> <p><strong>&nbsp;Description</strong></p> <p>This annotator represents a comprehensive set of human transcription factor binding sites based on ChIP-seq experiments generated by production groups in the ENCODE Consortium from the inception of the project in September 2007, through the March 2012 internal data freeze. The annotator represents peak calls (regions of enrichment) that were generated by the ENCODE Analysis Working Group (AWG) based on a uniform processing pipeline developed for the ENCODE Integrative Analysis effort and published in a set of coordinated papers in September 2012. Peak calls from that effort, based on datasets from the January 2011 ENCODE data freeze) are available at the ENCODE Analysis Data Hub. This annotator is an update that includes newer data, and slightly modified methods for the peak calling.</p> <p>This annotator contains 690 ChIP-seq datasets representing 161 unique regulatory factors (generic and sequence-specific factors). The datasets span 91 human cell types and some are in various treatment conditions. These datasets were generated by the five ENCODE TFBS ChIP-seq production groups: Broad, Stanford/Yale/UC-Davis/Harvard, HudsonAlpha Institute, University of Texas-Austin and University of Washington, and University of Chicago. The University of Chicago ChIP-seq were performed with an alternative epitope-tagged ChIP-seq methodology.</p> <p><strong>Methods</strong></p> <p>All ChIP-seq experiments were performed at least in duplicate, and were scored against an appropriate control designated by the production groups (either input DNA or DNA obtained from a control immunoprecipitation).</p> <p><strong>Short Read Mapping</strong></p> <p>For each dataset, mapped reads in the form of BAM files were downloaded from the ENCODE UCSC DCC. These BAM files were generated by the ENCODE data production labs (using different mappers and mapping parameters), but all used a standardized version of the GRCh37 (hg19) reference human genome sequence with the following modifications:</p> <p>* Mitochondrial sequence was included.<br /> * Alternate sequences were excluded.<br /> * Random contigs were excluded.<br /> * The female version of the genome was represented by the autosomes and chrX, whereas the male genome was represented by the autosomes, chrX, and chrY with the PAR regions masked.</p> <p>In order to standardize the mapping protocol, custom unique-mappability tracks were used to only retain unique mapping reads, i.e. reads that map to exactly one location in the genome. Positional and PCR duplicates were also filtered out.</p> <p><strong>Quality Control</strong></p> <p>A number of quality metrics for individual replicates listed on the ENCODE portal Quality Metrics page, including measures of library complexity and signal enrichment, were calculated and are available for review (Landt et al., 2012; Kundaje et al., 2013a). The Integrated Quality Flag from this quality assessment was used to assign the quality metadata term for each dataset (e.g., Good vs. Caution). Datasets that did not pass the minimum quality control thresholds are not included in this track.</p> <p><strong>Peak Calling</strong></p> <p>Since every ENCODE dataset is represented by at least two biological replicate experiments, a novel measure of consistency and reproducibility of peak calling results between replicates, known as the Irreproducible Discovery Rate (IDR), was used to determine an optimal number of reproducible peaks (Li et al., 2011; Kundaje et al., 2013b). Code and detailed step-by-step instructions to call peaks using the IDR method are available. In brief, the SPP peak caller (Kharchenko et al., 2008) was used with a relaxed peak calling threshold (FDR = 0.9) to obtain a large number of peaks (maximum of 300K) that span true signal as well as noise (false identifications). The IDR method analyzes a pair of replicates, and considers peaks that are present in both replicates to belong to one of two populations : a reproducible signal group or an irreproducible noise group. Peaks from the reproducible group are expected to show relatively higher ranks (ranked based on signal scores) and stronger rank-consistency across the replicates, relative to peaks in the irreproducible groups. Based on these assumptions, a two-component probabilistic copula-mixture model is used to fit the bivariate peak rank distributions from the pairs of replicates. The method adaptively learns the degree of peak-rank consistency in the signal component and the proportion of peaks belonging to each component. The model can then be used to infer an IDR score for every peak that is found in both replicates. The IDR score of a peak represents the expected probability that the peak belongs to the noise component, and is based on its ranks in the two replicates. Hence, low IDR scores represent high-confidence peaks. An IDR score threshold of 0.02 (2%) was used to obtain an optimal peak rank threshold on the replicate peak sets (cross-replicate threshold). If a dataset had more than two replicates, all pairs of replicates were analyzed using the IDR method. The maximum peak rank threshold across all pairwise analyses was used as the final cross-replicate peak rank threshold. Reads from replicate datasets were then pooled and SPP was once again used to call peaks on the pooled data with a relaxed FDR of 0.9. Pooled-data peaks were once again ranked by signal-score. The cross-replicate rank threshold learned from the replicates was used to threshold the ranked set of pooled-data peaks.</p> <p>Any thresholds based on reproducibility of peak calling between biological replicates are bounded by the quality and enrichment of the worst replicate. Valuable signal is lost in cases for which a dataset has one replicate that is significantly worse in data quality than another replicate. A rescue pipeline was used for such cases in order to balance data quality between a set of replicates. Mapped reads were pooled across all replicates of a dataset, and then randomly sampled (without replacement) to generate two pseudo-replicates with equal numbers of reads. This sampling strategy tends to transfer signal from stronger replicates to the weaker replicates, thereby balancing cross-replicate data quality and sequencing depth. These pseudo-replicates were then processed using the IDR method in order to learn a rescue threshold. For datasets with comparable replicates (based on independent measures of data quality), the rescue threshold and cross-replicate thresholds were found to be very similar. However, for datasets with replicates of differing data quality, the rescue thresholds were often higher than the cross-replicate thresholds, and were able to capture true peaks that showed statistically significant and visually compelling ChIP-seq signal in one replicate but not in the other. Ultimately, for each dataset, the best of the cross-replicate and rescue thresholds were used to obtain a final consolidated optimal set of peaks.</p> <p>All peak sets were then screened against a specially curated empirical blacklist of regions in the human genome (wgEncodeDacMapabilityConsensusExcludable.bed.gz) and peaks overlapping the blacklisted regions were discarded (Kundaje et al., 2013b). Briefly, these artifact regions typically show the following characteristics:</p> <p>* Unstructured and extreme artifactual high signal in sequenced input-DNA and control datasets, as well as open chromatin datasets irrespective of cell type identity.<br /> * An extreme ratio of multi-mapping to unique mapping reads from sequencing experiments.<br /> * Overlap with pathological repeat regions such as centromeric, telomeric and satellite repeats that often have few unique mappable locations interspersed in repeats.</p> <p><strong>&nbsp;Credits</strong></p> <p>The processed data for this track were generated by Anshul Kundaje on behalf of the ENCODE Analysis Working Group. Credits for the primary data underlying this track are included in track description pages listed in the Description section above.&nbsp;</p> <p><strong>&nbsp;References</strong></p> <p>ENCODE Project Consortium. A user&#39;s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 2011 Apr;9(4):e1001046. PMID: 21526222; PMCID: PMC3079585</p> <p>ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74. PMID: 22955616; PMCID: PMC3439153</p> <p>Kharchenko PV, Tolstorukov MY, Park PJ. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008 Dec;26(12):1351-9. PMID: 19029915; PMCID: PMC2597701</p> <p>Kundaje A, Jung L, Kharchenko PV, Sidow A, Batzoglou S, Park PJ. Assessment of ChIP-seq data quality using strand cross-correlation analysis. (submitted), 2012a.</p> <p>Kundaje A, Li Q, Brown JB, Rozowsky J, Harmanci A, Wilder SP, Batzoglou S, Dunham I, Gerstein M, Birney E, et al. Reproducibility measures for automatic threshold selection and quality control in ChIP-seq datasets. (submitted), 2012b.</p> <p>Li QH, Brown JB, Huang HY, Bickel PJ. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 2011; 5(3):1752-1779.</p>
<p>Human transcription factor binding sites based on ChIP-seq experiments generated by production groups in the ENCODE Consortium.</p>
...
Ensembl Regulatory Build

An up-to-date and comprehensive summary of regulatory features across the genome, as well as popular curated external reso...

<p><strong>Defining the Regulatory Build</strong></p> <p>We first determine a cell type independent functional annotation of the genome, referred to as the Regulatory Build, which summarises the function of genomic regions, known as regulatory features.</p> <p>To determine whether a state is useful in practice, it is compared to the overall density of transcription factor binding, as these is measured by the TF ChIP-seq datasets included in the Ensembl Regulation resources. Applying increasing integer cutoffs to this signal, we define progressively smaller regions. If these regions reach a two-fold enrichment in transcription factor binding signal, then the state is retained for the build. This means that although all states are annotated, not all are used to build the Regulatory Build.</p> <p>For any given segmentation, we define initial regions. For every functional label, all the state summaries that were assigned that labelled and judged informative are summed into a single function. Using the overall TF binding signal as true signal, we select the threshold which produces the highest F-score.</p> <p>We then merge the regulatory features across segmentations by annotation.</p> <p>Some simplifications are applied a posteriori:<br /> * Distal enhancers which overlap promoter flanking regions are merged into the latter.<br /> * Promoter flanking regions which overlap transcription start sites are incorporated into the flanking regions of the latter features.</p> <p><strong>Regulatory Features</strong></p> <p>Regions that are predicted to regulate gene expression are called Regulatory features in Ensembl. The different types of regulatory features annotated include:</p> <p>* Promoters (regions at the 5&#39; end of genes where transcription factors and RNA polymerase bind to initiate transcription)<br /> * Promoter flanking regions (transcription factor binding regions that flank the above)<br /> * Enhancers (regions that bind transcription factors and interact with promoters to stimulate transcription of distant genes)<br /> * CTCF binding sites (regions that bind CTCF, the insulator protein that demarcates open and closed chromatin)<br /> * Transcription factor binding sites (sites which bind transcription factors, for which no other role can be determined as yet)<br /> * Open chromatin regions (regions of spaced out histones, making them accessible to protein interactions)</p> <p><strong>Ensembl ID</strong></p> <p>An Ensembl stable ID consists of five parts: ENS(species)(object type)(identifier).(version).</p> <p>* The first part, &#39;ENS&#39;, tells you that it&#39;s an Ensembl ID<br /> * The second part is a three-letter species code. For human, there is no species code so IDs are in the form ENS(object type)(identifier).(version). A list of the other species codes can be found here.<br /> * The third part is a one- or two-letter object type. For example E for exon, FM for protein family, G for gene, GT for gene tree, P for protein, R for regulatory feature and T for transcript.<br /> * The identifier is the number to that object. Combinations of prefixes and identifiers are unique.<br /> * Versions indicate how many times that model has changed during its time in Ensembl. This document explains how we determine that a model has changed sufficiently to update version number. History pages for features show you when these changes took place.<br /> * Using this information we can make assertions about an Ensembl ID. For example ENSMUSG00000017167.6. From this we can see that it&#39;s an Ensembl ID (ENS), from mouse (MUS), it&#39;s a gene (G) and it&#39;s on its sixth version (.6).</p> <p>Information from http://www.ensembl.org/info/genome/funcgen/regulatory_build.html</p>
<p>An up-to-date and comprehensive summary of regulatory features across the genome, as well as popular curated external resources.</p>
...
ExAC Gene and CNV

ExAC Functional Gene Constraint & CNV Scores provides probability of LoF tolerance/intolerance

...
<p><strong>ExAC Functional Gene Constraint &amp; CNV Scores</strong>: probability of LoF tolerance/intolerance<br /> The ExAC database provided in this module contains the probability of a gene being loss-of-function (LoF) intolerant of both heterozygous &amp; homzygous LoF variants as well as intolerant of only homozygous variants. Also provided is the probability of being tolerant of both heterozygous &amp; homozygous LoF variants. Tolerance/Intolerance probabilities are also separated into additonal subsets of nonTCGA and nonpsych. Z scores for the deviation of observation from expectation are also included. Higher Z scores indicate the transcript is more intolerant of variation (more constrained).</p> <p>NOTE: Data provided by [dbNSFP](https://sites.google.com/site/jpopgen/dbNSFP) version 3.5a</p> <p>1. pLI: The probability of being LoF intolerant of both homozygous &amp; heterozygous LoF variants.<br /> 2. pLI (Hom): The probability of being LoF intolerant of homozygous, but not heterozygous LoF variants.<br /> 3. pLT: The probability of being tolerant of both homozygous &amp; heterozygous LoF variants.<br /> 4. pLI NonTCGA: The probability of being LoF intolerant of both homozygous &amp; heterozygous LoF variants on the nonTCGA subset.<br /> 5. pLI (Hom) NonTCGA: The probability of being LoF intolerant of homozygous, but not heterozygous LoF variants on the nonTCGA subset.<br /> 6. pLT NonTCGA: The probability of being tolerant of both homozygous &amp; heterozygous LoF variants on the nonTCGA subset.<br /> 7. pLI Nonpsych: The probability of being LoF intolerant of both homozygous &amp; heterozygous LoF variants on the nonpsych subset.<br /> 8. pLI (Hom) Nonpsych: The probability of being LoF intolerant of homozygous, but not heterozygous LoF variants on the nonpsych subset.<br /> 9. pLT Nonpsych: The probability of being tolerant of both homozygous &amp; heterozygous LoF variants on the nonpsych subset.<br /> 10. Del Intol Z-Score: Winsorized deletion intolerance z-score based on CNV data.<br /> 11. Dup Intol Z-Score: Winsorized duplication intolerance z-score based on CNV data.<br /> 12. CNV Intol Z-Score: Winsorized CNV intolerance z-score based on CNV data.<br /> 13. CNV Bias/Noise: (Y)es or (N)o depending on if the gene is in a known region of recurrent CNVs mediated by tandem segmental duplications and intolerance scores are more likely to be biased or noisy.<br /> &nbsp;</p>
<p>ExAC Functional Gene Constraint &amp; CNV Scores provides probability of LoF tolerance/intolerance</p>
...
FATHMM

Functional analysis through hidden markov models.

...
<p><strong>FATHMM</strong>: functional analysis through hidden markov models<br /> FATHMM is capable of predicting the functional effects of protein missense mutations by combining sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved protein domains, with &quot;pathogenicity weights&quot;, representing the overall tolerance of the protein/domain to mutations.&nbsp;</p> <p>NOTE: Data provided by [dbNSFP](https://sites.google.com/site/jpopgen/dbNSFP) version 3.5a</p> <p>1. Transcript ID: Ensembl transcript ID, multiple entries separated by &quot;;&quot;<br /> 2. Protein ID: Ensembl protein ID, multiple entries separated by &quot;;&quot; corresponding to Transcript IDs<br /> 3. Score: FATHMM default score (weighted for human inherited-disease mutations with Disease Ontology) (FATHMMori). Scores range from -16.13 to 10.64. The smaller the score the more likely the SNP has damaging effect. Multiple scores separated by &quot;;&quot;, corresponding to Protein ID.<br /> 4. Converted Rank Score: FATHMMori scores were first converted to FATHMMnew=1-(FATHMMori+16.13)/26.77, then ranked among all FATHMMnew scores in dbNSFP. The rankscore is the ratio of the rank of the score over the total number of FATHMMnew scores in dbNSFP. If there are multiple scores, only the most damaging (largest) rankscore is presented. The scores range from 0 to 1.<br /> 5. Prediction: If a FATHMMori score is &lt;=-1.5 (or rankscore &gt;=0.81332) the corresponding nsSNV is predicted as &quot;D(AMAGING)&quot;; otherwise it is predicted as &quot;T(OLERATED)&quot;. Multiple predictions separated by &quot;;&quot;, corresponding to Protein ID.<br /> &nbsp;</p>
<p>Functional analysis through hidden markov models.</p>
...
FATHMM MKL Rank Score

A database capable of predicting the effects of coding variants using nucleotide-based HMMs.

...
<p><strong>FATHMM MKL</strong></p> <p>FATHMM MKL is database capable of predicting the effects of coding variants using nucleotide-based HMMs. Our method utilizes various genomic annotations, which have recently become available, and learns to weight the significance of each component annotation source. &nbsp;We used 10 feature groups, denoted [A&ndash;J], which could be predictive of disease association and are therefore used to annotate out datasets using a customized pipeline. These feature groups can all indicate whether an SNV is functional or not, and hence we use a classifier based on multiple kernel learning (MKL). In MKL, different types of input data are encoded into kernel matrices, which quantify the similarity of data objects.</p> <p><strong>Prediction Interpretation</strong></p> <p>Predictions are given as p-values in the range [0, 1]: values above 0.5 are predicted to be deleterious, while those below 0.5 are predicted to be neutral or benign. P-values close to the extremes (0 or 1) are the highest-confidence predictions that yield the highest accuracy.</p> <p>Feature groups (letters A-J) are described in the Supplementary detail of the [main paper](https://academic.oup.com/bioinformatics/article/31/10/1536/177080), and summarised in section 2.1 of the paper.</p> <p>We use distinct predictors for positions either in coding regions (positions within coding-sequence exons) and non-coding regions (positions in intergenic regions, introns or non-coding genes). The coding predictor is based on 10 groups of features, labeled A-J; the non-coding predictor uses a subset of 4 of these feature groups, A-D (see our related publication for details on the groups and their sources).</p> <p>Annotations are not yet available in all feature groups for all genomic positions. To produce a p-value for these positions, we adjust our weights relative to the features that are available. For example, if our weights for A-D were 0.5, 0.1, 0.1 and 0.3, respectively, and there were no annotations for group A, then the missing weight would be distributed proportionally across remaining weights, which would become 0.2, 0.2 and 0.6. This allows us to make predictions for any combination of feature groups while yielding p-values in the [0,1] range.</p> <p>Note that predictions based only on a subset of features may not be as accurate as those based on complete feature sets. In particular, predictions that are missing the conservation score features (groups A and E) will tend to be less accurate than other predictions. To aid in interpreting these predictions, we provide a list of the feature groups that contributed to each prediction.</p> <p><strong>Feauture Groups</strong></p> <p>We used 10 feature groups, denoted [A&ndash;J], which could be predictive of disease association and are therefore used to annotate out datasets using a customized pipeline. Here is a description as follows:</p> <p>**A. 46-Way Sequence Conservation:** based on multiple sequence alignment scores, at the nucleotide level, of 46 vertebrate genomes compared with the human genome.</p> <p>**B.Histone Modifications (ChIP-Seq):** based on ChIP-Seq peak calls for histone modifications.</p> <p>**C.Transcription Factor Binding Sites (TFBS PeakSeq):** based on PeakSeq peak calls for various transcription factors.</p> <p>**D.Open Chromatin (DNase-Seq):** based on DNase-Seq peak calls.</p> <p>**E.100-Way Sequence Conservation:** based on multiple sequence alignment scores, at the nucleotide level, of 100 vertebrate genomes compared with the human genome.</p> <p>**F.GC Content:** based on a single measure for GC content calculated using a span of five nucleotide bases from the UCSC Genome Browser.</p> <p>**G.Open Chromatin (FAIRE):** based on formaldehyde-assisted isolation of regulatory elements (FAIRE) peak calls.</p> <p>**H.Transcription Factor Binding Sites (TFBS SPP):** based on SPP peak calls for various transcription factors.</p> <p>**I.Genome Segmentation:** based on genome-segmentation states using a consensus merge of segmentations produced by the ChromHMM and Segway software.</p> <p>**J.Footprints:** based on annotations describing DNA footprints across cell types from ENCODE.</p> <p>Information from http://fathmm.biocompute.org.uk/fathmmMKL.htm#interpretation and https://academic.oup.com/bioinformatics/article/31/10/1536/177080</p>
<p>A database capable of predicting the effects of coding variants using nucleotide-based HMMs.</p>
...
FATHMM XF

Enhanced Accuracy in Predicting the Functional Consequences of Coding Single Nucleotide Variants (SNVs).

...
<p><strong>FATHMM XF&nbsp;</strong></p> <p>FATHMM-XF (FATHMM with eXtended Features) represents a substantial improvement over our earlier predictor, FATHMM-MKL. By using an extended set of feature groups and by exploring an expanded set of possible models, the new method yields even greater accuracy than its predecessor on independent test sets. As with FATHMM-MKL, FATHMM-XF predicts whether single nucleotide variants (SNVs) in the human genome are likely to be functional or non-functional in inherited diseases. Also like its predecessor, it uses distinct models for coding and non-coding regions, to improve overall accuracy. Unlike FATHMM-MKL, FATHMM-XF models are build up on single-kernel datasets. The models may then learn interactions between data sources that help to boost its accuracy in all regions of the genome.</p> <p><strong>&nbsp;Predicted Interpretation</strong></p> <p>Predictions are given as p-values in the range [0, 1]: values above 0.5 are predicted to be deleterious, while those below 0.5 are predicted to be neutral or benign. P-values close to the extremes (0 or 1) are the highest-confidence predictions that yield the highest accuracy.</p> <p>We use distinct predictors for positions either in coding regions (positions within coding-sequence exons). The coding predictor is based on six groups of features representing sequence conservation, nucleotide sequence characteristics, genomic features (codons, splice sites, etc.), amino acid features and expression levels in different tissues.&nbsp;</p> <p>Information from http://fathmm.biocompute.org.uk/fathmm-xf/about.html</p>
<p>Enhanced Accuracy in Predicting the Functional Consequences of Coding Single Nucleotide Variants (SNVs).</p>
...
FATHMM XF Coding

Enhanced Accuracy in Predicting the Functional Consequences of Coding Single Nucleotide Variants (SNVs).

...
<p># FATHMM XF&nbsp;</p> <p>FATHMM-XF (FATHMM with eXtended Features) represents a substantial improvement over our earlier predictor, FATHMM-MKL. By using an extended set of feature groups and by exploring an expanded set of possible models, the new method yields even greater accuracy than its predecessor on independent test sets. As with FATHMM-MKL, FATHMM-XF predicts whether single nucleotide variants (SNVs) in the human genome are likely to be functional or non-functional in inherited diseases. Also like its predecessor, it uses distinct models for coding and non-coding regions, to improve overall accuracy. Unlike FATHMM-MKL, FATHMM-XF models are build up on single-kernel datasets. The models may then learn interactions between data sources that help to boost its accuracy in all regions of the genome.</p> <p><strong>Predicted Interpretation</strong></p> <p>Predictions are given as p-values in the range [0, 1]: values above 0.5 are predicted to be deleterious, while those below 0.5 are predicted to be neutral or benign. P-values close to the extremes (0 or 1) are the highest-confidence predictions that yield the highest accuracy.</p> <p>We use distinct predictors for positions either in coding regions (positions within coding-sequence exons). The coding predictor is based on six groups of features representing sequence conservation, nucleotide sequence characteristics, genomic features (codons, splice sites, etc.), amino acid features and expression levels in different tissues.&nbsp;</p> <p>Information from http://fathmm.biocompute.org.uk/fathmm-xf/about.html</p>
<p>Enhanced Accuracy in Predicting the Functional Consequences of Coding Single Nucleotide Variants (SNVs).</p>
...
fitCons

fitCons predicts the fraction of genomic positions belonging to a specific function class that are under selective pressur...

<p><strong>&nbsp;fitCons</strong></p> <p>The fitness consequences of functional annotation, integrates functional assays (such as ChIP-Seq) with selective pressure inferred using the INSIGHT method. The result is a score &rho; in the range [0.0-1.0] that indicates the fraction of genomic positions evincing a particular pattern (or &quot;fingerprint&quot;) of functional assay results, that are under selective pressure. As these scores show the selective pressure consequences of patterns of functional genomic assays, they can vary per cell-type just as functional assays do. Scores combine conservative and adaptive selective pressures and may be used an a relative indicator of the potential for interesting genomic function, with higher scores indicating more potential.</p> <p><strong>Calculating The fitCons Score</strong></p> <p>**Covariate Selection**</p> <p>Four covariates (A) are obtained for each cell type: DNase I peaks, Normalized RNA-Seq Read Depth (RPM), Chromatin State (ChromHMM class) and GENCODE annotation as protein coding CDS. The last of these is common to all cell types. Each data set is then quantized into as small number of classes (B), assigning each position in the genome, one class, from each data set.</p> <p>* DNase I HS Peaks: 2-Narrow Peak, 1-Broad Peak (and not Narrow Peak), and 0-No Peak<br /> * RNA-Seq Read Depth: 3-High Depth, 2-Medium Depth, 1-Low Depth, and 0-No Signal<br /> * CDS Annotation: 1-Annotated as CDS, or 0-Not Annotated<br /> * Chromatin State: [1-25] Chrom HMM class assigned to that position, or 26-No Class available.</p> <p>This generates a total of 3x4x2x26=624 unique &quot;fingerprints&quot; with each genomic position evincing exactly one finger print in each cell-type (B). All positions associated with a particular fingerprint are grouped together into a common functional class.</p> <p><strong>Evolutionary Genomic Data</strong></p> <p>Selective pressure for each functional class is inferred (C) from the distribution of human polymorphism and primate divergence, relative to nearby neutrally evolving loci. Human polymorphism data is drawn from position-wise variation among 54 unrelated human individuals from the [69 sequences released by Complete Genomics](https://www.completegenomics.com/public-data/69-Genomes/). Divergence data is derived from an the most recent common ancestor of Human and Chimpanzee, which is inferred using Chimpanzee (panTro2), Orangutan (ponAbe2), and rhesus Macaque (rheMac2) reference genomes. Putative neutral loci are identified by removing a window around known conserved and protein coding genomic positions, as well as a number of technically undesirable genomic positions (such as unmappable regions).</p> <p><strong>Applying INSIGHT</strong></p> <p>INSIGHT was applied (C) to infer a maximum likelihood estimate of the fraction of genomic positions under selective pressure (&rho;) for each of the 624 functional classes. This calculation was performed separately for each cell type. The fitCons score associated with a functional class, was then assigned to all genomic positions in that class (D). A curvature based method was employed to identify the standard error in for each estimate and only classes with a standard error of less than .4x&rho; were considered in the published analysis.</p> <p>Information from http://compgen.cshl.edu/fitCons/</p>
<p>fitCons predicts the fraction of genomic positions belonging to a specific function class that are under selective pressure.</p>
...
FunSeq2

 A flexible framework to prioritize regulatory mutations from cancer genome sequencing

...
<p><strong>FunSeq2</strong></p> <p><strong>Overview</strong></p> <p>This tool is specialized to prioritize somatic variants from cancer whole genome sequencing. It contains two components : 1) building data context from various resources; 2) variants prioritization.</p> <p>The framework combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline. The pipeline has a weighted scoring system combining: inter- and intra-species conservation; loss- and gain-of-function events for transcription-factor binding; enhancer-gene linkages and network centrality; and per-element recurrence across samples. We further highlight putative drivers with information specific to a particular sample, such as differential expression.</p> <p><strong>&nbsp;HOT Region (Transcription factor highly occupied region)</strong><br /> If a variant occurs in HOT regions, the corresponding cell lines (5 in total) are shown.</p> <p><strong>Motif-Breaking Analysis&nbsp;</strong></p> <p>&nbsp;Motif-breaking events are defined as variants decreasing the PWM scores, whereas motif-conserving events are those that do not change or increase the PWM (Position Weight Matrix) scores [29] (we calculated the difference between mutated and germline alleles in the PWMs). Variants causing motif-breaking events are reported in the output together with the corresponding PWM changes. Transcription factor PWMs are obtained from ENCODE project [15], including TRANSFAC and JASPAR motifs.</p> <p>&nbsp;## Motif-Gaining Analysis</p> <p>&nbsp;Whole-genome motif scanning generally discovers millions of motifs, of which a large fraction are false positives. We focused on variants occurring in promoters (defined as -2.5 kb from transcription starting sites) or regulatory elements significantly associated with genes. For each variant, +/- 29 bp are concatenated from the human reference genome (motif length is generally &lt;30 bp). For each PWM, we scanned the 59 bp sequence. For each candidate motif encompassing the variant, we evaluated the sequence scores using TFM-Pvalue [30] (with respect to the PWM). Given a particular PWM (frequencies are transformed to log likelihoods), sequence score is computed by summing up the relevant values at each position in the PWM. If the P value with mutated allele &lt; = 4e-8 and the P value with germline allele &gt;4e-8, we define the variant creating a novel motif. The process is repeated for all PWMs and all variants. The sequence score changes are reported in the output.</p> <p>&nbsp;## Alternate and Reference Scores</p> <p>&nbsp;The alternate and reference scores depend on the type of motif-analysis:</p> <p>&nbsp;**Motif-Breaking:** Alternate allele frequency in PWM, Reference allele frequency in PWM.</p> <p>&nbsp;**Motif-Gaining:** Sequence Score with alternate allele, Sequence Score with reference allele.</p> <p><strong>Coding Scoring Scheme</strong></p> <p>Variants in coding regions (GENCODE 16 for the current version; users can replace this with other GENECODE versions) are analyzed with VAT (variant annotation tool) [57]. Variants are ranked based on the following scheme (each criterion gets score 1): (1) non-synonymous; (2) premature stop; (3) is the gene under strong selection; (4) is the gene a network hub; (5) recurrent; (6) GERP score &gt;2.</p> <p><strong>Non-coding Scoring Scheme (Weighted Scoring Scheme)</strong></p> <p>&nbsp;In general, features can be classified into two classes: discrete and continuous. Discrete features are binary, such as in ultra-conserved elements or not. Continuous features: (1) GERP score; (2) motif-breaking score is the difference between germline and mutated alleles in PWMs; (3) motif-gaining score is the sequence score difference between mutated and germline alleles; (4) network centrality score (the cumulative probability, see `Network analysis of variants associated with genes&rsquo;. If one variant has multiple values of a particular feature (for example, breaking multiple motifs), the largest value is used.</p> <p>&nbsp;We weighted each feature based on the mutation patterns observed in the 1000 Genomes polymorphisms. We randomly selected 10% of the 1000 Genomes Phase 1 SNPs (approximately 3.7 M) and ran them through our pipeline. For each discrete feature d, we calculated the probability p d that overlaps a natural polymorphism. Then we computed 1-Shannon entropy (1) as its weighted value. The value ranges from 0 to 1 and is monotonically decreasing when the probability is between 0 and 0.5c below 0.5).</p> <p>&nbsp;Finally, for each cancer variant, we scored it by summing the weighted values of all its features (3). If a particular feature is not observed, it is not used in the scoring. Considering the situation that some features are subsets of other features, to avoid overweighting similar features, we took into account feature dependencies when calculating the summed scores. As shown in Additional file 1: Table S3, when having leaf features, the weighted values of root features are ignored. For example, when a variant occurs in sensitive regions, the score of `in functional annotations&rsquo; is not used in the sum-up. Leaf features are assumed independent. Variants ranked on top of the output are those with higher scores and are most likely to be deleterious.</p> <p>&nbsp;Information from http://info.gersteinlab.org/Funseq2 and https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0480-5#citeas</p>
<p>&nbsp;A flexible framework to prioritize regulatory mutations from cancer genome sequencing</p>
...
GeneHancer

GeneHancer is a database of genome-wide enhancer-to-gene and promoter-to-gene associations.

...
<p><strong>GeneHancer</strong><br /> GeneHancer is a database of genome-wide enhancer-to-gene and promoter-to-gene associations, embedded in GeneCards. Regulatory elements were mined from the following sources:</p> <p>* The [ENCODE project](https://www.encodeproject.org/) (see [paper](https://www.nature.com/articles/nature11247)) [Z-Lab](http://zlab-annotations.umassmed.edu/enhancers/) Enhancer-like regions<br /> * [Ensembl regulatory build](http://useast.ensembl.org/info/genome/funcgen/index.html) (see [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0621-5))<br /> * FANTOM5 [atlas of active enhancers](http://pressto.binf.ku.dk/) (see [paper](https://www.nature.com/articles/nature12787))<br /> * [VISTA Enhancer Browser](https://enhancer.lbl.gov/) enhancers validated by transgenic mouse assays (see [paper](https://academic.oup.com/nar/article/35/suppl_1/D88/1096925)).<br /> * [dbSUPER](http://asntech.org/dbsuper/) super-enhancers (see [paper](https://academic.oup.com/nar/article/44/D1/D164/2502575)).<br /> * [EPDnew](https://epd.epfl.ch//EPDnew_database.php) promoters (see [paper](https://academic.oup.com/nar/article/41/D1/D157/1070274)).<br /> * [UCNEbase](https://ccg.epfl.ch//UCNEbase/) ultra-conserved noncoding elements (see [paper](https://academic.oup.com/nar/article/41/D1/D101/1057253)).<br /> * [CraniofacialAtlas](https://cotney.research.uchc.edu/data/) (see [paper](https://www.sciencedirect.com/science/article/pii/S2211124718305175?via%3Dihub)).</p> <p>The GeneHancer table lists a set of enhancers and promoters associated with the gene. Gene-GeneHancer associations and likelihood-based scores were generated using information that helps link regulatory elements to genes:</p> <p>* eQTLs (expression quantitative trait loci) from [GTEx](https://www.gtexportal.org/home/) (see [paper](https://www.nature.com/articles/nrg3969))<br /> * Capture Hi-C promoter-enhancer long range interactions (see [paper](https://www.nature.com/articles/ng.3286))<br /> * Expression correlations between eRNAs and candidate target genes from FANTOM5 (see [paper](https://www.nature.com/articles/nature12787))<br /> * Cross-tissue expression correlations between a transcription factor interacting with an enhancer and a candidate target gene<br /> * GeneHancer-gene distance-based associations, scored utilizing inferred distance distributions. Associations include several approaches: (a) Nearest neighbors, where each GeneHancer is associated with its two proximal genes (from all gene categories). In cases where a proximal gene is not protein coding, the nearest protein coding gene is also included; (b) Overlaps with the gene territory (Intragenic); (c) Proximity (&lt;2kb) to the gene TSS (transcription start site). TSS proximity scores are boosted to elevate Gene-GeneHancer associations in the vicinity of the gene TSS.</p> <p><strong>GeneHancer Identifier</strong>&nbsp;<br /> GeneHancer elements have unique, informative and persistent GeneHancer identifiers (GHids). The id begins with GH, which is followed by the chromosome number, a single letter related to the GeneHancer version (constant since version 4.8, &lsquo;J&rsquo;), and approximate kilobase start coordinate. Example: GH0XJ101383 is located on chromosome X, with starting position (in kb) of 101383.</p> <p>Each GeneHancer has a confidence score which is computed based on a combination of evidence annotations: (1) Number of sources; (2) Source scores; (3) TFBSs (from ENCODE). GeneHancers supported by two or more evidence sources were defined as elite and annotated accordingly with an asterisk. For every GeneHancer, the following annotations are included: GH id, GH type (promoter, enhancer or both), the sources with evidence for the GeneHancer, genomic size, GeneHancer confidence score, and a list of TFs (Transcription Factors) having TFBSs (Transcription Factor Binding Sites) within the GeneHancer (based on ChIP-Seq evidence).</p> <p>Disease-GeneHancer associations: GeneHancer-gene pairs were associated to diseases by integrating manually curated disease-associated variants within regulatory elements from (1) DiseaseEnhancer, PMID:29059320; (2) PMID:27569544.</p> <p>GWAS phenotypes: GeneHancer elements were associated to phenotypes by mapping GWAS SNPs from the GWAS Catalog (PMID: 27899670)</p> <p>Information from https://www.genecards.org/Guide/GeneCard#enhancers<br /> &nbsp;</p>
<p>GeneHancer is a database of genome-wide enhancer-to-gene and promoter-to-gene associations.</p>
<p>Freely available for non-commercial use.</p>
...
gnomAD3

Genome Aggregation Database (gnomAD) is a resource developed by an international  coalition of investigators, with th...

<p><strong>gnomAD - genome Aggregation Database</strong></p> <p>The Genome Aggregation Database is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.</p> <p>The v3 short variant data set provided on this website spans 71,702 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies, and is aligned against the GRCh38 reference.</p> <p>We have removed individuals known to be affected by severe pediatric disease, as well as their first-degree relatives, so these data sets should serve as useful reference sets of allele frequencies for severe pediatric disease studies - however, note that some individuals with severe disease may still be included in the data sets, albeit likely at a frequency equivalent to or lower than that seen in the general population.</p> <p><strong>&nbsp;OpenCRAVAT specifics</strong></p> <p>OpenCRAVAT does not include variants which do not meet the gnomAD3 quality control process.</p> <p>Additionally, in some cases, and variant will have an allele frequency of 0 for some populations, and empty or null for other populations. An AF of 0 means that there the population included individuals with high quality enough reads to make a call at this position, and none of those individuals had the alternate allele. An AF of empty/null means that there were no variants calls made for individuals in this population due to quality control filters.</p> &nbsp;</p>
<p>Genome Aggregation Database (gnomAD) is a resource developed by an international&nbsp; coalition of investigators, with the goal of aggregating and harmonizing both exome&nbsp; and genome sequencing data from a wide variety of large-scale sequencing projects</p>
...
gnomAD3 Counts

Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the...

<p><strong>gnomAD - genome Aggregation Database</strong></p> <p>The Genome Aggregation Database is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.</p> <p>The v3.1 short variant data set provided on this website spans 76,156 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies, and is aligned against the GRCh38 reference. In this release, we have included more than 3,000 new samples specifically chosen to increase the ancestral diversity of the resource. As a result, this is the first release for which we have a designated population label for samples of Middle Eastern ancestry. See the [gnomAD v3.1 blog post](https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/) for details of the latest release.&nbsp;</p> <p>Information from https://gnomad.broadinstitute.org/about</p>
<p>Genome Aggregation Database (gnomAD) is a resource developed by an international&nbsp;coalition of investigators, with the goal of aggregating and harmonizing both exome&nbsp; and genome sequencing data from a wide variety of large-scale sequencing projects</p>
...
Human Phenotype Ontology

The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities encountered in human dis...

<p><strong>Human Phenotype Ontology</strong></p> <p>The Human Phenotype Ontology (HPO) project provides an ontology of medically relevant phenotypes, disease-phenotype annotations, and the algorithms that operate on these. The HPO can be used to support differential diagnostics, translational research, and a number of applications in computational biology by providing the means to compute over the clinical phenotype. The HPO is being used for computational [deep phenotyping](https://pubmed.ncbi.nlm.nih.gov/22504886/) and precision medicine as well as integration of clinical data into translational research. Deep phenotyping can be defined as the precise and comprehensive analysis of phenotypic abnormalities in which the individual components of the phenotype are observed and described. The HPO is being increasingly adopted as a standard for phenotypic abnormalities by diverse groups such as international rare disease organizations, registries, clinical labs, biomedical resources, and clinical software tools and will thereby contribute toward nascent efforts at global data exchange for identifying disease etiologies [(K&ouml;hler et al, 2017)](https://pubmed.ncbi.nlm.nih.gov/27899602/).</p> <p>The HPO currently contains over 13,000 terms arranged in a directed acyclic graph and are connected by is-a (subclass-of) edges, such that a term represents a more specific or limited instance of its parent term(s). All relationships in the HPO are is-a relationships, i.e. simple class-subclass relationships. For instance, Abnormal lens morphology is-a Abnormal eye morphology. The relationships are transitive, meaning that they are inherited up all paths to the root. Phenotypic abnormality is the main subontology of the HPO and contains descriptions of clinical abnormalities. Additional subontologies are provided to describe inheritance patterns, onset/clinical course and modifiers of abnormalities.</p> <p><strong>What is An Ontology and What Use is An Ontology for Medical Genetics Research?</strong></p> <p>The word ontology is derived from Greek words meaning the study of existence and being. More recently, the word ontology has been used in computer science to describe systems that describe concepts within some domain and relationships between those concepts. For instance, the [Gene Ontology consortium](http://geneontology.org/) has developed an extensive ontology describing molecular functions, biological processes, and cellular locations over the last decade and a number of groups have supplied annotations using the GO terms to gene products of many organisms. The study of human phenotypes in the context of hereditary and common disease has the potential to lead to great insight on the function of genes and genetic networks. The HPO provides computational resources that allow large-scale computational analysis of the human phenome.</p> <p><strong>What is the Medical Focus of the Human Phenotype Ontology?</strong></p> <p>The medical focus of the HPO in its initial decade (2007-2017) was on rare, mainly Mendelian diseases. The construction of the initial version of the HPO in 2007/2008 was performed by generating an ontology based on descriptions in the Clinical Synopsis of the [Online Mendelian Inheritance in Man (OMIM) database](https://omim.org/). Since the [initial publication of the HPO in 2008](https://pubmed.ncbi.nlm.nih.gov/18950739/), the HPO team has held regular workshops with clinicians to refine and extend the clinical terminology of the HPO in specific areas such as cardiology or immunology. We have added textual definitions, computational logical definitions, and roughly 6,000 new terms since then.</p> <p>The focus of the HPO will be extended to other areas of medicine in coming years. A pilot project on common-disease annotations was published in 2015</p> <p><strong>Terms in the Human Phenotype Ontology</strong></p> <p>Each term in the HPO describes a clinical abnormality. These may be general terms, such as Abnormal ear morphology or very specific ones such as Chorioretinal atrophy. Each term is also assigned to one of the five subontologies (see Table below) The terms have a unique ID such as HP:0001140 and a label such as Epibulbar dermoid. Most terms have textual definitions such as An epibulbar dermoid is a benign tumor typically found at the junction of the cornea and sclera (limbal epibullar dermoid). The source of the definition must be indicated. Many terms have synonyms. For instance, Epibulbar dermoids is taken to be a synonym of Epibulbar dermoids.</p> <p>Information from https://hpo.jax.org/app/help/introduction</p>
<p>The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality.</p>
...
Iranome

 Iranome is a variation database based on WES data from 800 individuals from eight major ethnic groups in Iran. ...

<p><strong>Iranome</strong></p> <p>Completion of the human genome project in 2003 and<br /> availability of the human genome sequence to the scientific community<br /> are important landmarks in the field of human genetics. Since then, by<br /> introducing new technologies such as microarray genotyping and next<br /> generation sequencing to the field, several human genome variation<br /> databases such as International HapMap, 1000 Genomes, NHLBI Exome,<br /> UK10K, ExAC and finally Genome Aggregation databases were made available<br /> &nbsp;to researchers worldwide. Human genome variation databases like these<br /> have been playing a crucial role in interpreting genetic variations in<br /> the human genome and understanding the genetic of human disorders.<br /> However, many ethnic groups are not represented in current human genome<br /> variation databases. It is well known that many human genome variations<br /> are ethnicity-specific and we can not build a complete picture of<br /> genetic variations in the human genome without having representatives<br /> from all different ethnic groups in those databases. In addition, lack<br /> of representatives from specific populations and ethnic groups in human<br /> genome databases may lead to marginalization of members of those<br /> populations, which might put them in danger of discrimination by<br /> depriving them of the benefits of new advances in genetic technologies<br /> and its associated medical advances. So from an ethical point of view,<br /> this project improves health equity at a national and global level.</p> <p>![image](humu23880-fig-0003-m.jpg)</p> <p>Access to clinical genetic testing has been growing<br /> continuously worldwide since the introduction of next generation<br /> sequencing technology to the field of genetics about a decade ago.<br /> Widespread access to genetic testing will have a remarkable impact on<br /> realizing the vision of precision medicine to improve the prevention,<br /> diagnosis and treatment of human disorders, many of which have a genetic<br /> &nbsp;etiology. The benefits of precision medicine may not be realized for<br /> those groups who are not represented in current human genomic variation<br /> databases. With this in mind, the co-principal investigators on this<br /> project, from the Genetics Research Center (GRC) at the University of<br /> Social Welfare &amp; Rehabilitation Sciences, Tehran, Iran and Dalla<br /> Lana School of Public Health at University of Toronto, Toronto, Ontario,<br /> &nbsp;Canada decided to establish the Iranome database ([www.iranome.com](www.iranome.com)) by<br /> performing whole exome sequencing on 800 individuals from eight major<br /> ethnic groups in Iran. The groups included 100 healthy individuals from<br /> each of the following ethnic groups: Arabs, Azeris, Balochs, Kurds,<br /> Lurs, Persians, Persian Gulf Islanders and Turkmen. They represent over<br /> 80 million Iranians and to some degree half a billion individuals who<br /> live in the Middle East, a region with rapid population growth<br /> expectations for the future (MENA Policy Brief, 2001). These ethnic<br /> groups are among the most underrepresented populations in currently<br /> available human genomic variation databases</p> <p><strong>Terms of Use</strong></p> <p>&nbsp;The data presented here can be used freely and we just request that<br /> you cite the Iranome database as the data source. Please use the<br /> relevant primary Iranome publication for the citation as follows:</p> <p>&#39;Zohreh Fattahi, &nbsp;Maryam Beheshtian, &nbsp;Marzieh Mohseni, &nbsp;Hossein<br /> Poustchi, &nbsp;Erin Sellars, &nbsp;Sayyed Hossein Nezhadi, &nbsp;Amir Amini, &nbsp;Sanaz<br /> Arzhangi, &nbsp;Khadijeh Jalalvand, &nbsp;Peyman Jamali, &nbsp;Zahra Mohammadi, &nbsp;Behzad<br /> &nbsp;Davarnia, &nbsp;Pooneh Nikuei, &nbsp;Morteza Oladnabi, &nbsp;Akbar Mohammadzadeh,<br /> Elham Zohrehvand, &nbsp;Azim Nejatizadeh, &nbsp;Mohammad Shekari, &nbsp;Maryam<br /> Bagherzadeh, &nbsp;Ehsan Shamsi‐Gooshki, &nbsp;Stefan B&ouml;rno, &nbsp;Bernd Timmermann,<br /> Aliakbar Haghdoost, &nbsp;Reza Najafipour, &nbsp;Hamid Reza Khorram Khorshid,<br /> Kimia Kahrizi, &nbsp;Reza Malekzadeh, &nbsp;Mohammad R. Akbari, &nbsp;Hossein<br /> Najmabadi. Iranome: A catalog of genomic variations in the Iranian<br /> population. Hum Mutat. 2019 Nov;40(11):1968-1984. doi:<br /> ([10.1002/humu.23880](https://pubmed.ncbi.nlm.nih.gov/31343797/)). Epub 2019 Aug 17. PMID: 31343797.&#39;<br /> &nbsp;</p>
<p>&nbsp;Iranome is a variation database based on WES data from 800 individuals from eight major ethnic groups in Iran.&nbsp;They represent over 80 million Iranians and to some degree, half a billion individuals who live in the Middle East.<br /> &nbsp;</p>
...
LitVar

LitVar allows the search and retrieval of variant relevant information from biomedical literature.

...
<p><strong>&nbsp;LitVar</strong></p> <p>LitVar provides access to genomic variant information from the biomedical literature by mining the 27 million PubMed abstracts and 1+ million PMC full-text articles. LitVar shows the results of research conducted in the Computational Biology Branch, NCBI.</p> <p>LitVar does the following:</p> <p>* Uses tmVar, a high-performance variant name disambiguation engine, to normalize different forms of the same variant into a unique and standardized name so that all matching articles can be returned regardless the use of a specific name in the query.</p> <p>* Processes the entire set of full-text articles from the open access PMC subset in addition to PubMed in order to provide relevant variant information beyond title and abstracts.</p> <p>* Leverages the state-of-the-art literature annotation tool, PubTator, to provide key biological relations among variations, drugs, genes, and diseases.</p> <p><strong>Understanding Results</strong></p> <p>LitVar displays 15 publications per page with the most recent publications shown first.</p> <p>When examining returned publications, you can find pre-annotated biological concepts (displayed in color) as well as concepts matching the query (displayed with a colored background). Each color corresponds to a different biological concept: variant, gene, disease, or chemical.</p> <p>![Screenshot](litvar_screenshot_1.png)</p> <p><strong>Using Concept Filters</strong></p> <p><strong>Ranked biological concepts</strong><br /> Concept filters show genes, diseases, chemicals, and variants that are the most commonly co-occur with the query variant in the same sentence.</p> <p><strong>Add biological concepts to search</strong><br /> You can click on these concepts to narrow down search results. In that case, only publications containing all the selected concepts (including user submitted variant and selected concepts) will be returned.</p> <p><strong>Using Publication Filters</strong></p> <p>Publication-level filters show the top journals in which matching publications were published, as well as the types of these publications, and the number of publications in the last year, last 2 years and last 5 years. You can filter your results by selecting a journal, publication date or publication types.</p> <p><strong>Acknowledgements</strong></p> <p>LitVar is developed by the NCBI Text Mining Research Group at the Computational Biology Branch, with help from the dbSNP group at the Information Engineering Branch.&nbsp;</p> <p>This research is supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.</p> <p>Information from https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/LitVar/index.html#!?query=</p>
<p>LitVar allows the search and retrieval of variant relevant information from biomedical literature.</p>
...
Likelihood Ratio Test

The likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved am...

<p><strong>Likelihood Ratio Test</strong></p> <p>Using a comparative genomics data set of 32 vertebrate species we show that a likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleterious. The LRT is also able to identify known human disease alleles and performs as well as two commonly used heuristic methods, SIFT and PolyPhen. Application of the LRT to three human genomes reveals 796&ndash;837 deleterious mutations per individual, &sim;40% of which are estimated to be at &lt;5% allele frequency. Our results indicate that only a small subset of deleterious mutations can be reliably identified, but that this subset provides the raw material for personalized medicine.</p> <p><strong>&nbsp;LRT Advantages</strong></p> <p>The LRT is conceptually distinct from other comparative genomic methods. To our knowledge, all previous methods designed to identify deleterious mutations rely on heuristic procedures to distinguish sites within a protein that are conserved from those that are not conserved. This is achieved by selecting sequences that are not too closely or too distantly related to the sequence of interest and comparing the degree of conservation at the site of interest to other sites in the protein. The advantage of this approach is that the phylogenetic relationship and evolutionary distance among the sequences is not required.&nbsp;</p> <p>&nbsp;The LRT also differs from other comparative genomic methods in that all amino acid changes are treated the same rather than weighting radical and conservative amino acid changes differently. While this is expected to reduce the power of the LRT, empirically, both the false-positive and the false-negative rates of the LRT are lower for radical relative to conservative amino acid changes.</p> <p><strong>Converted Rank Score</strong></p> <p>LRTori scores (The original LRT two-sided p-value) were first converted as LRTnew=1-LRTori*0.5 if Omega&lt;1, or LRTnew=LRTori*0.5 if Omega&gt;=1. Then LRTnew scores were ranked among all LRTnew scores in dbNSFP. The rankscore is the ratio of the rank over the total number of the scores in dbNSFP. The scores range from 0.00162 to 0.8433.</p> <p>Information from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752137/</p>
<p>The likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein-coding sequences.</p>
...
MaveDB

MaveDB is a public repository for datasets from Multiplexed Assays of Variant Effect (MAVEs).

...
<p><strong>MaveDB</strong></p> <p>MaveDB is a public repository for datasets from Multiplexed Assays of Variant Effect (MAVEs), such as those generated by deep mutational scanning (DMS) or massively parallel reporter assay (MPRA) experiments. Despite the importance of MAVE data for basic and clinical research, there is no standard resource for their discovery and distribution. Here, we present MaveDB (https://www.mavedb.org), a public repository for large-scale measurements of sequence variant impact, designed for interoperability with applications to interpret these datasets.</p> <p>MaveDB is open-source, released under the AGPLv3 license.</p> <p>MaveDB is hosted by the Fowler Lab in the Department of Genome Sciences at the University of Washington. It is supported and developed by the University of Washington, the Walter and Eliza Hall Institute of Medical Research, and the Brotman Baty Institute.</p> <p><strong>Data Organization</strong></p> <p>MaveDB has three dataset types: Score Set, Experiment, and Experiment Set. All score, count, and target information is stored in score sets. Experiment sets and experiments are used for organization and metadata.</p> <p>The dataset types are organized hierarchically. Each experiment set can contain multiple experiments and each experiment can contain multiple score sets.</p> <p><strong>Score Sets</strong></p> <p>To capture the structure of real-world study designs, MaveDB is organized hierarchically into score sets, experiments, and experiment sets. Score sets, the most basic unit of organization, contain the variant effect scores and additional metadata such as target sequence information and detailed methods. Each variant effect score is a numeric value.&nbsp;</p> <p><strong>&nbsp;Accession Number Formats</strong></p> <p>MaveDB accession numbers use the URN (Uniform Resource Name) format. These accession numbers have a hierarchical structure that reflects the relationship between experiment sets, experiments, score sets, and individual variants in MaveDB.</p> <p><strong>Studies</strong></p> <p>MaveDB&#39;s variant effect scores are captured from two separate studies, involving specific genes and accession numbers. The following provides information to which genes are included in each study, and methodology behind calculating the scores.&nbsp;</p> <p><strong>Calmodulin yeast complementation</strong><br /> * Includes genes: CALM1, SUMO1, UBE2I, and TPK1<br /> * Method: A Deep Mutational Scan of Calmodulin (CALM1) using functional complementation in yeast was performed using DMS-TileSeq and a machine-learning method was used to impute the effects of missing variants and refine measurements of lower confidence.</p> <p><strong>VAMP-seq</strong><br /> * Includes genes: PTEN and TPMT<br /> * Method: Barcoded variant libraries were created using inverse PCR. Barcodes were associated with full-length variants using Pacific Biosciences SMRT sequencing to generate long reads. Each protein variant is fused to EGFP so that its abundance can be tracked using fluorescence. Cells were sorted into bins using FACS based on the ratio of EGFP to mCherry.</p> <p><strong>References</strong></p> <p>Matreyek, Kenneth A., et al. &quot;Multiplex assessment of protein variant abundance by massively parallel sequencing.&quot; Nature genetics 50.6 (2018): 874-882.</p> <p>Weile, Jochen, et al. &quot;A framework for exhaustively mapping functional missense variants.&quot; Molecular systems biology 13.12 (2017): 957.</p> <p>Information from https://www.mavedb.org/<br /> &nbsp;</p>
<p>MaveDB is a public repository for datasets from Multiplexed Assays of Variant Effect (MAVEs).</p>
...
MetaLR

 MetaLR creates an ensemble-based prediction score by using machine learning  and logistic regression.

...
<p><strong>MetaLR</strong></p> <p>MetaLR is a ensemble-based prediction algorithm devloped by integrating 10 component scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations, using a logistic regression model.</p> <p><strong>Scores</strong></p> <p>**MetaLR_score:** Our logistic regression (LR) based ensemble prediction score. Larger value means the SNV is more likely to be damaging.&nbsp;<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Scores range from 0 to 1.</p> <p>**MetaLR_rankscore:** MetaLR scores were ranked among all MetaLR scores in dbNSFP. The rankscore<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;is the ratio of the rank of the score over the total number of MetaLR scores in dbNSFP.&nbsp;<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;The scores range from 0 to 1.</p> <p>**MetaLR_pred:** Prediction of our MetaLR based ensemble prediction score,&quot;T(olerated)&quot; or<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;D(amaging)&quot;. The score cutoff between &quot;D&quot; and &quot;T&quot; is 0.5. The rankscore cutoff between&nbsp;<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;D&quot; and &quot;T&quot; is 0.81101.</p> <p><strong>Logistic Regression and Support Vector Machine Models</strong></p> <p>MetaLR has a sister database, metaSVM, that explores the same concepts using a support vector machine model.</p> <p>Quality of machine learning models, such as Support Vector Machine (SVM) and Logistic Regression (LR), can be influenced by selection of component scores as well as the selection of parameters. To optimize the selection of component scores and parameters for our SVM and LR model, we collected training dataset, on which we performed feature selection and parameter tuning for our models.&nbsp;</p> <p>Output scores were harvested from all of the prediction methods for all mutations in all of our datasets, combined them with MMAF from various populations and integrated them into input files for constructing LR and SVM, with linear kernel, radial kernel and polynomial kernel using R package e1071 (44). Performance for each model under each specific setting was tested on testing datasets I and II and was evaluated using R package ROCR (45). Because testing dataset III contains only TN observations, we applied manually calculated TNR for evaluating its performance.</p> <p>Moreover, in order to assess the relative contribution of each prediction score to the performance of LR and SVM, we tested several modified SVM and LR models with one prediction score deleted from the original models and plotted average ROC curve and AUC value. In addition, in order to test whether our model can be further improved by using different combinations of prediction scores, we applied step-wise model selection using Akaike Information Criterion (AIC) statistic as a criterion.&nbsp;</p> <p>Information from https://academic.oup.com/hmg/article/24/8/2125/651446#81269181</p>
<p>&nbsp;MetaLR creates an ensemble-based prediction score by using machine learning&nbsp; and logistic regression.</p>
...
MetaSVM

MetaSVM creates an ensemble-based prediction score by using a support  vector machine model.

...
<p><strong>MetaSVM</strong></p> <p>MetaSVM is a ensemble-based prediction algorithm devloped by integrating 10 component scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations, using a support vector machine model.</p> <p><strong>Scores</strong></p> <p>**MetaSVM_score:** Our support vector machine (SVM) based ensemble prediction score. Larger value means the SNV is more likely to be damaging.&nbsp;<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Scores range from -2 to 3 in dbNSFP.</p> <p>**MetaSVM_rankscore:** MetaSVM scores were ranked among all MetaSVM scores in dbNSFP.The rankscore is the ratio of the rank of the score over the total number of MetaSVM&nbsp;<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;scores in dbNSFP. The scores range from 0 to 1.</p> <p>**MetaSVM_pred:** Prediction of our SVM based ensemble prediction score,&quot;T(olerated)&quot; or<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;D(amaging)&quot;. The score cutoff between &quot;D&quot; and &quot;T&quot; is 0.5. The rankscore cutoff between<br /> &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;D&quot; and &quot;T&quot; is 0.82257.</p> <p><strong>&nbsp;Logistic Regression and Support Vector Machine Models</strong></p> <p>MetaSVM has a sister database, metaLR, that explores the same concepts using a logistic regression model.</p> <p>Quality of machine learning models, such as Support Vector Machine (SVM) and Logistic Regression (LR), can be influenced by selection of component scores as well as the selection of parameters. To optimize the selection of component scores and parameters for our SVM and LR model, we collected training dataset, on which we performed feature selection and parameter tuning for our models.&nbsp;</p> <p>Output scores were harvested from all of the prediction methods for all mutations in all of our datasets, combined them with MMAF from various populations and integrated them into input files for constructing LR and SVM, with linear kernel, radial kernel and polynomial kernel using R package e1071 (44). Performance for each model under each specific setting was tested on testing datasets I and II and was evaluated using R package ROCR (45). Because testing dataset III contains only TN observations, we applied manually calculated TNR for evaluating its performance.</p> <p>Moreover, in order to assess the relative contribution of each prediction score to the performance of LR and SVM, we tested several modified SVM and LR models with one prediction score deleted from the original models and plotted average ROC curve and AUC value. In addition, in order to test whether our model can be further improved by using different combinations of prediction scores, we applied step-wise model selection using Akaike Information Criterion (AIC) statistic as a criterion.&nbsp;</p> <p>Information from https://academic.oup.com/hmg/article/24/8/2125/651446#81269181</p>
<p>MetaSVM creates an ensemble-based prediction score by using a support&nbsp; vector machine model.</p>
...
miRBase

A microRNA database is a searchable database of published miRNA sequences and annotation.

...
<p><strong>miRBase: The MicroRNA Database</strong></p> <p>miRBase provides the following services:<br /> * The miRBase database is a searchable database of published miRNA sequences and annotation. Each entry in the miRBase Sequence database represents a predicted hairpin portion of a miRNA transcript (termed mir in the database), with information on the location and sequence of the mature miRNA sequence (termed miR). Both hairpin and mature sequences are available for searching and browsing, and entries can also be retrieved by name, keyword, references and annotation. All sequence and annotation data are also available for download.</p> <p>* The miRBase Registry provides miRNA gene hunters with unique names for novel miRNA genes prior to publication of results. Visit the help pages for more information about the naming service.</p> <p><strong>Name</strong></p> <p>The names in the database are of the form hsa-mir-121b. The first three letters signify the organism. The mir/miR part of the name denote the precursor and the mature sequence respectively. Identical mature sequences from distinct precursors will get names of the form hsa-miR-121-1 and hsa-miR-121-2, while highly similar sequences will have the form hsa-miR-121a and hsa-miR-121b. Finally, different organisms have slightly different naming conventions. For example, in plants published names are of the form MIR121. We try wherever possible to stick to these conventions.</p> <p><strong>Accession Number</strong></p> <p>In addition to a name or ID, each miRBase Sequence entry has a unique accession number. The accession number is the only truly stable identifier for an entry -- miRNA names may change from those published as relationships between sequences become clear. The advantage of the accessioned system is that such changes can be tracked in the database, allowing names to evolve to remain consistent, whilst providing the user with full access to the data and history. However, accessions convey little biological meaning, and it is expected that miRNAs are referred to by name in publications.</p> <p><strong>Information</strong></p> <p>This page contains information about the miRNA gene of interest and depicts the precusor sequence with the mature miRNA sequence highlighted. The folded precursor structure is computed using the RNAfold program from the ViennaRNA suite. Limitations of the ascii depiction mean that for complex structures, such as some of the plant precursors, only base pairing interactions of the first helix are shown. The entry table also contains links to publications describing the identification of the miRNA.</p> <p>To receive email notification of data updates and feature changes please subscribe to the miRBase announcements mailing list. Any queries about the website or naming service should be directed at mirbase@manchester.ac.uk.</p> <p>miRBase is managed by the Griffiths-Jones lab at the Faculty of Biology, Medicine and Health, University of Manchester with funding from the BBSRC. miRBase was previously hosted and supported by the Wellcome Trust Sanger Institute.</p> <p>Information from http://www.mirbase.org/</p>
<p>A microRNA database is a searchable database of published miRNA sequences and annotation.</p>
...
MITOMAP

The MITOMAP database of human mitochondrial DNA (mtDNA) information has been an important compilation of mtDNA variation f...

<p><strong>MITOMAP:</strong> A human mitochondrial genome database</p> <p>The MITOMAP database of human mitochondrial DNA (mtDNA) information has been an important resource for information about the human mitochondrial DNA (mtDNA) for researchers, clinicians and genetic counselors for the past twenty-five years. Essential information about the mitochondrial reference sequence is provided along with an extensive compilation of mtDNA variants. The MITOMAP curators search research literature for published reports of mitochondrial DNA variants and index those variants in the database. Those variants which are reported as having possible association with disease are noted. A new addition to MITOMAP is the inclusion of data from full-length human mtDNA sequences in GenBank.</p> <p><strong>MitoTIP Scores</strong></p> <p>The pathogenicity scoring from MitoTIP is intended as a starting point for analysis of newly observed tRNA variants. It is based only on database frequencies, the nature of the nucleotide change (transversion/transition/deletion), and conservation scoring. No clinical histories, heteroplasmy data, or functional studies are included. Variants must be further evaluated by the end user in the context of the actual patient heteroplasmy, and the heteroplasmy seen in affected and unaffected family members. Other additional lines of evidence are suggested below.<br /> MitoTIP integrates multiple sources of information to provide a prediction for the likelihood that novel single nucleotide variants or deletions in tRNA-encoding sequences would cause disease. The sources of information used include:<br /> * GenBank deposited mitochondrial sequence<br /> * Annotations of pathogenicity from MITOMAP<br /> * Conservation across species<br /> * The position of the variant within the tRNA<br /> * Whether the nucleotide change is a transition or transversion</p> <p>Each possible nucleotide change was scored and the scores have been interpreted within quartiles. Current MitoTIP scores range from -5.9 to 21.8, with a median score of 12.7. MitoTIP scores will be recalculated on a regular basis.<br /> MitoTIP&#39;s scoring of tRNA variants is embedded into Mitomaster and may be viewed by using Mitomaster&#39;s SNV query or Sequence query tools.</p> <p>Information from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4257604/ and https://mitomap.org/MITOMAP/MitoTipInfo</p>
<p>The MITOMAP database of human mitochondrial DNA (mtDNA) information has been an important compilation of mtDNA variation for researchers, clinicians and genetic counselors.</p>
...
Mutation Assessor

Mutation Assessor is a database providing prediction of the functional impact of amino-acid substitutions in proteins...

<p><strong>Mutation Assessor:</strong> database providing prediction of the functional impact of amino-acid substitutions in proteins<br /> Functional impact is calculated based on evolutionary conservation of the affected amino acid in protein homologs. The method has been validated on a large set (60k) of disease associated (OMIM) and polymorphic variants.</p> <p>1. Variant: Amino acid variant<br /> 2. Score: Mutation Assessor functional impact combined score (MAori). The score ranges from -5.135 to 6.49<br /> 3. Ranked Score: MAori scores were ranked among all MAori scores in the database. The rankscore is the ratio of the rank of the score over the total number of MAori scores in the database. The scores range from 0 to 1.<br /> 4. Functional Impact: Predicted functional, i.e. high (&quot;H&quot;) or medium (&quot;M&quot;), or predicted non-functional, i.e. low (&quot;L&quot;) or neutral (&quot;N&quot;). The MAori score cutoffs between &quot;H&quot; and &quot;M&quot;, &quot;M&quot; and &quot;L&quot;, and &quot;L&quot; and &quot;N&quot;, are 3.5, 1.935 and 0.8, respectively. The rankscore cutoffs between &quot;H&quot; and &quot;M&quot;, &quot;M&quot; and &quot;L&quot;, and &quot;L&quot; and &quot;N&quot;, are 0.92922, 0.51944 and 0.19719, respectively.<br /> &nbsp;</p>
<p>Mutation Assessor is a database providing prediction of the functional&nbsp;impact of amino-acid substitutions in proteins</p>
...
MutationTaster

Evaluates disease-causing potential of sequence alterations

...
<p><strong>MutationTaster&nbsp;</strong></p> <p><strong>Bayes Classifier&nbsp;</strong><br /> MutationTaster employs a Bayes classifier to eventually predict the disease potential of an alteration. The Bayes classifier is fed with the outcome of all tests and the features of the alterations and calculates probabilities for the alteration to be either a disease mutation or a harmless polymorphism. For this prediction, the frequencies of all single features for known disease mutations/polymorphisms were studied in a large training set composed of &gt;390,000 known disease mutations from HGMD Professional and &gt;6,800,000 harmless SNPs and Indel polymorphisms from the 1000 Genomes Project (TGP).</p> <p><strong>Models</strong><br /> We provide three different models aimed at different types of alterations, either aimed at &#39;silent&#39; (non-synonymous or intronic) alterations (without_aae model), at those leading to the substitution/insertion/deletion of a single amino acid (simple_aae model) or at more complex changes of the amino acid sequence (e.g. mutations introducing a premature stop codon, etc - complex_aae model). All models were trained with all available and suitable common polymorphisms and disease mutations. MutationTaster automatically determines the correct model for each alteration.</p> <p><strong>Prediction</strong><br /> MutationTaster predicts an alteration as one of four possible types:</p> <p>* disease causing - i.e. probably deleterious<br /> * disease causing automatic - i.e. known to be deleterious<br /> * polymorphism - i.e. probably harmless<br /> * polymorphism automatic - i.e. known to be harmless</p> <p><strong>Automatic Predictions</strong></p> <p>Any known polymorphism(s) or known disease variant that have been found at the position in question. Our database contains all single nucleotide polymorphisms (SNPs) from the NCBI SNP database (dbSNP). Moreover, we have stored all HapMap genotype frequencies as well as variants from the 1000 Genomes Project [4] (abbreviated here as TGP). If an alteration is located at the same position as a known dbSNP, MutationTaster provides the SNP ID (or rs ID) and a link together with the HapMap genotype frequencies, if available. If every of the three possible geno-types is observed in at least one HapMap population, the alteration is automatically regarded as a polymorphism and predicted as polymorphism automatic (the naive Bayes classifier is run nevertheless and the p value for the prediction is shown). Please note that there may be differences between your alteration and the alleles in dbSNP. For the 1000 Genomes Project, MutationTaster provides information in either of the following formats:<br /> * more than 4 cases homozygous in TGP: TGP: allele_alt/allele_alt found more than 4 times in TGP data: #homozygous_hits<br /> * more than 4 cases heterozygous in TGP: TGP: allele_ref/allele_alt found more than 4 times in TGP data: #heterozygous_hits (#homozygous_hits for allele_alt/allele_alt)<br /> * less than 4 cases homo-/heterozygous in TGP: TGP: allele_ref/allele_alt found #heterozygous_hits times in TGP data, allele_alt/allele_alt #homozygous_hits times.</p> <p>If an alteration was found more than 4 times homozygously in TGP, it is automatically regarded as polymorphism.<br /> We also display known disease variants from dbSNP ClinVar. If a variant is marked as probable-pathogenic or pathogenic in ClinVar, it is automatically predicted to be disease-causing, i.e. disease causing automatic (the naive Bayes classifier is run nevertheless and the p value for the prediction is shown).</p> <p>Information from http://www.mutationtaster.org/info/documentation.html<br /> &nbsp;</p>
<p>Evaluates disease-causing potential of sequence alterations</p>
...
Mutpanning

Discovers new tumor genes in aggregated sequencing data.

...
<p><strong>MutPanning&nbsp;</strong></p> <p>MutPanning is designed to detect rare cancer driver genes from aggregated whole-exome sequencing data. Most approaches detect cancer genes based on their mutational excess, i.e. they search for genes with an increased number of nonsynonymous mutations above the background mutation rate. MutPanning further accounts for the nucleotide context around mutations and searches for genes with an excess of mutations in unusual sequence contexts that deviate from the characteristic sequence context around passenger mutations.</p> <p><strong>Introduction</strong></p> <p>MutPanning analyzes aggregated DNA sequencing data of tumor patients to identify genes that are likely to be functionally relevant, based on their abundance of nonsynonymous mutations or their increased number of mutations in unusual nucleotide contexts that deviate from the background mutational process.&nbsp;<br /> &nbsp;<br /> The name MutPanning is inspired by the words &quot;mutation&quot; and &quot;panning&quot;. The goal of the MutPanning algorithm is to discover new tumor genes in aggregated sequencing data, i.e. to &quot;pan&quot; the few tumor-relevant driver mutations from the abundance of functionally neutral passenger mutations in the background. Previous approaches for cancer gene discovery were mostly based on mutational recurrence, i.e. they detected cancer genes based on their excess of nonsynonymous mutation above the local background mutation rate. &nbsp;Further, they search for mutations that occur in functionally important genomic positions, as predicted by bioinformatical scores). These approaches are highly effective in tumor types, for which the average background mutation rate (i.e., the total mutational burden) is low or moderate.<br /> &nbsp;<br /> The ability to detect driver genes can be increased by considering the nucleotide context around mutations in the statistical model. MutPanning utilizes the observation that most passenger mutations are surrounded by characteristic nucleotide sequence contexts, reflecting the background mutational process active in a given tumor. In contrast, driver mutations are localized towards functionally important positions, which are not necessarily surrounded by the same nucleotide contexts as passenger mutations. Hence, in addition to mutational excess, MutPanning searches for genes with an excess of mutations in unusual sequence contexts that deviate from the characteristic sequence context around passenger mutations. That way, MutPanning actively suppresses mutations in its test statistics that are likely to be passenger mutations based on their surrounding nucleotide contexts. Considering the nucleotide context is particularly useful in tumor types with high background mutation rates and high nucleotide context specificity (e.g., melanoma, bladder, endometrial, or colorectal cancer).</p> <p><strong>Algorithm&nbsp;</strong></p> <p>Most passenger mutations occur in characteristic nucleotide contexts that reflect the mutational process active in a given tumor. MutPanning searches for mutations in &quot;unusual&quot; nucleotide contexts that deviate from this background mutational process. In these positions, passenger mutations are rare and mutations are thus a strong indicator of the shift of driver mutations towards functionally important positions.<br /> &nbsp;<br /> The main steps of MutPanning are as follows (adopted from Dietlein et al.):&nbsp;<br /> (i) Model the mutation probability of each genomic position in the human exome depending on its surrounding nucleotide context and the regional background mutation rate.&nbsp;<br /> (ii) Given a gene with n nonsynonymous mutations, use a Monte Carlo simulation approach to simulate a large number of random &quot;scenarios&quot; in which n or more nonsynonymous mutations are randomly distributed along the same gene .&nbsp;<br /> (iii) Compare the number and positions of mutations in each random scenario with the observed mutations in gene . Based on these comparisons, derive a p-value for the gene.&nbsp;<br /> (iv) Combine this p-value with additional statistical components that account for insertions and deletions, the abundance of deleterious mutations, and mutational clustering.<br /> &nbsp;<br /> <strong>License</strong></p> <p>Distributed under the BSD-3-Clause open source license. &nbsp;A copy of the license text is available at https://github.com/genepattern/docker-mutpanning/blob/develop/LICENSE.txt</p> <p>Information from http://www.cancer-genes.org/</p> <p><br /> &nbsp;</p>
<p>Discovers new tumor genes in aggregated sequencing data.</p>
...
MutPred

MutPred is a random forest model for the prediction of pathogenic missense  variants and automated inference of molec...

<p><strong>MuPred-Indel</strong></p> <p>MutPred-Indel is a machine learning-based method and software package that integrates genetic and molecular data to reason probabilistically about the pathogenicity of nonframeshifting indel variants. The model provides both pathogenicity prediction and a ranked list of molecular alterations potentially affecting phenotype. It is trained on a set of pathogenic and unlabeled (putatively neutral) variants obtained from the Human Gene Mutation Database (HGMD) [1] and ExAC [2]. MutPred-Indel is a bagged ensemble of 100 feed-forward neural networks, each trained on a balanced subset of pathogenic and putatively neutral variants.</p> <p>MutPred-Indel was developed by Kymberleigh Pagel at Indiana University Bloomington, and was a joint project of the Mooney group at the University of Washington and the Radivojac group at Indiana University.</p> <p><strong>Interpreting The Results</strong></p> <p>The output of MutPred-Indel consists of a general score (g), i.e., the probability that the framshifting or stop gain variant is pathogenic. This score is the average of the scores from all neural networks in MutPred-Indel. If interpreted as a probability, a score threshold of 0.50 would suggest pathogenicity. However, in our evaluations, we have estimated that a threshold of 0.50 yields a false positive rate (fpr) of 10% and that of 0.70 yields an fpr of 5%.</p> <p>MutPred-Indel also outputs property scores that reflect the impact of a variant on different properties. An empirical P-value (P) is calculated as the fraction of putatively neutral variants in MutPred-Indel&#39;s training set with an amount of impacted residues &gt;= to that amount for the given variant. A P-value threshold of 0.05 means that, under the null hypothesis, we expect 5% of putatively neutral variants to impact the particular property to the extent that the given variant does. These P-values are specific to each property.</p> <p>Information from http://mutpredindel.cs.indiana.edu/index.html</p>
<p>MutPred is a random forest model for the prediction of pathogenic missense&nbsp; variants and automated inference of molecular mechanisms of disease.</p>
...
ncER: non-coding essential regulation

 ncER has a good performance for the identification of deleterious variants in the non-coding genome. ncER can also i...

<p><strong>&nbsp;ncER: non-coding essential regulation</strong></p> <p>In summary, ncER has a good performance for the identification of deleterious variants in the non-coding genome. ncER can also identify non-coding regions associated with cell viability, an in vitro surrogate of essentiality, and with regulation of an essential gene. Thus, we speculated that ncER may help map critical regulatory and structural elements of the non-coding genome in the setting of human disease. Of note, the model is trained to assess variants and regions of the non-coding genome, and therefore is not relevant for the scoring of protein coding regions</p> <p><strong>ncER Percentile</strong></p> <p>Two sets of ncER percentile scores were computed. The first with nucleotide resolution and the second where raw ncER scores were averaged over 10&thinsp;bp bins and then expressed as percentiles genome-wide. To explore ncER percentile distribution of highly likely pathogenic variants, we used a manually curated set of pathogenic non-coding variants associated with Mendelian traits24, and selected those falling at least 500&thinsp;bp from any pathogenic variants from the training/test sets, yielding a set of 85 dominant and 52 recessive non-coding pathogenic variants.The control genomic variants (N&thinsp;=&thinsp;13,659) consisted of singleton variants from the gnomAD whole-genome sequencing datasets matched to the pathogenic sets for the distance to the nearest splice sites and genomic elements.</p> <p>Information from https://www.nature.com/articles/s41467-019-13212-3</p>
<p>&nbsp;ncER has a good performance for the identification of deleterious variants in the non-coding genome. ncER can also identify non-coding regions associated with cell viability, an in vitro surrogate of essentiality9, and with regulation of an essential gene.</p>
...
OncoKB

OncoKB is a precision oncology knowledge base that annotates the biological consequences and clinical implications (therap...

<p><strong>OncoKB</strong>: MSK&#39;s Precision Oncology Knowledge Base</p> <p>OncoKB is a precision oncology knowledge base developed at Memorial Sloan Kettering Cancer Center that contains biological and clinical information about genomic alterations in cancer.</p> <p>Alteration- and tumor type-specific therapeutic implications are classified using the [OncoKB Levels of Evidence system](https://www.oncokb.org/levels), which assigns clinical actionability to individual mutational events.</p> <p>For additional details about the OncoKB curation process, please refer to the version-controlled [OncoKB Curation Standard Operating Procedure](https://www.oncokb.org/sop). When using OncoKB, please cite: [Chakravarty et al., JCO PO 2017](https://ascopubs.org/doi/full/10.1200/PO.17.00011).</p> <p>Information from https://www.oncokb.org/</p> <p>**Pre-requisite**</p> <p>OncoKB API token is required to run this module. See [here](https://api.oncokb.org/oncokb-website/api) on getting an OncoKB API token. Make `token.txt` with the token string as the only content and place in under `data` sub-folder (create it if needed) of this module&#39;s folder.</p> <p>![Screenshot](oncokb_screenshot_1.png)<br /> &nbsp;</p>
<p>OncoKB is a precision oncology knowledge base that annotates the biological consequences and clinical implications (therapeutic, diagnostic, and prognostic) of genetic variants in cancer.</p>
<p>For commercial use, visit https://www.oncokb.org/account/register.</p>
...
PolyPhen-2

.

...
<p>PolyPhen-2 is a new development of the PolyPhen tool for annotating coding nonsynonymous SNPs. Some of the highlights of the new version are:</p> <p>* High quality multiple sequence alignment pipeline<br /> * Probabilistic classifier based on machine-learning method</p> <p><strong>Overview</strong></p> <p>Most of human genetic variation is represented by SNPs (Single-Nucleotide Polymorphisms) and many of them are believed to cause phenotypic differences between human individuals.</p> <p><br /> We specifically focus on nonsynonymous SNPs (nsSNPs), i.e., SNPs located in coding regions and resulting in amino acid variation in protein products of genes. It was shown in several studies that impact of amino acid allelic variants on protein structure/function can be reliably predicted via analysis of multiple sequence alignments and protein 3D-structures. As we demonstrated in an earlier work, these predictions correlate with the effect of natural selection seen as an excess of rare alleles. Therefore, predictions at the molecular level reveal SNPs affecting actual phenotypes.</p> <p><br /> PolyPhen-2 is an automatic tool for prediction of possible impact of an amino acid substitution on the structure and function of a human protein. This prediction is based on a number of features comprising the sequence, phylogenetic and structural information characterizing the substitution.</p> <p><br /> For a given amino acid substitution in a protein, PolyPhen-2 extracts various sequence and structure-based features of the substitution site and feeds them to a probabilistic classifier.</p> <p>__OpenCRAVAT PolyPhen2 scores are sourced from [dbNSFP](https://sites.google.com/site/jpopgen/dbNSFP)__</p>
<p>.</p>
...
REVEL

Rare Exome Variant Ensemble Learner

...
<p>![screenshot](screenshot_1.png)<br /> &lt;br /&gt;<br /> REVEL is an ensemble method for predicting the pathogenicity of missense variants based on a combination of scores from 13 individual tools: MutPred, FATHMM v2.3, VEST 3.0, PolyPhen-2, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP++, SiPhy, phyloP, and phastCons. &nbsp;REVEL was trained using recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. &nbsp;The REVEL score for an individual missense variant can range from 0 to 1, with higher scores reflecting greater likelihood that the variant is disease-causing. &nbsp;When applied to two independent test sets, REVEL had the best overall performance (p&lt;10-12) compared with any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. &nbsp;Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies &lt;0.5%. &nbsp;Compared with other ensemble methods, the area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants, and 0.027-0.143 higher in an independent test set of 1953 pathogenic and 2406 benign variants recently reported in ClinVar. &nbsp;We provide precomputed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.<br /> &nbsp;</p>
<p>Rare Exome Variant Ensemble Learner</p>
...
RVIS

Residual variation intolerance scoring

...
<p><strong>RVIS:</strong> Residual variation intolerance scoring<br /> RVIS is a database providing variation intolerance scoring that assesses whether genes have relatively more or less functional genetic variation than expected based on the apparently neutral variation found in the gene. Scores were developed using sequence data from 6503 whole exome sequences made available by the NHLBI Exome Sequencing Project (ESP).</p> <p>NOTE: Data provided by [dbNSFP](https://sites.google.com/site/jpopgen/dbNSFP) version 3.5a</p> <p>1. Score: A score measuring intolerance of mutational burden, the higher the score the more tolerant to mutational burden the gene is. Based on EVS (ESP6500) data.<br /> 2. Percentile Rank: The percentile rank of the gene based on RVIS, the higher the percentile the more tolerant to mutational burden the gene is. Based on EVS (ESP6500) data.<br /> 3. FDR p-value: &quot;A gene&#39;s corresponding FDR p-value for preferential LoF depletion among the ExAC population. Lower FDR corresponds with genes that are increasingly depleted of LoF variants.&quot;<br /> 4. ExAC-Based RVIS: &quot;Setting &#39;common&#39; MAF filter at 0.05% in at least one of the six individual ethnic strata from ExAC.&quot;<br /> 5. ExAC-Based RVIS Percentile: &quot;Genome-Wide percentile setting &#39;common&#39; MAF filter at 0.05% in at least one of the six individual ethnic strata from ExAC.&quot;<br /> &nbsp;</p>
<p>Residual variation intolerance scoring</p>
...
Segway

Chromatin state activity scores

...
<p><strong>Segway encyclopedia of human regulatory elements</strong></p> <p><strong>All Tissues</strong></p> <p>Chromatin state annotations of 164 human cell types using 1,615 genomics data sets were used to develop a measure of the importance of each genomic position called the &quot;conservation-associated activity score&quot;. The aggregated conservation-associated activity score, aggregated across multiple cell types, provide a measure of importance directly attributable to a specific activity in a specific set of cell types. In contrast to evolutionary conservation, this measure is not biased to detect only elements shared with related species. The conservation-associated activity score for all annotations were combined to create a single, cell type-agnostic encyclopedia that catalogs all human transcriptional and regulatory elements.&nbsp;</p> <p><strong>Interpreting scores</strong></p> <p>Each base pair has a aggregated conservation-associated activity score (renamed from &quot;functionality score&quot;). Mean Score and Sum Score are the mean and sum within the segment containing the query variant. The conservation-associated activity score is defined based on the enrichment of each annotation state for evolutionary conservation, and therefore aims to separate functional activity (such as genes, promoters and enhancers) from non-functional activity (repressed regions).</p> <p>Higher scores are associated with increased functional activity. The figure below may help interpret score values.</p> <p>![alt text](F3.large.jpg)</p>
<p>Chromatin state activity scores</p>
...
SpliceAI

A deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling p...

<p><strong>SpliceAI:</strong> A deep learning tool to identify splice variants&nbsp;</p> <p>The splicing of pre-mRNAs into mature transcripts is remarkable for its precision, but the mechanisms by which the cellular machinery achieves such specificity are incompletely understood. Here, we describe a deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing. Synonymous and intronic mutations with predicted splice-altering consequence validate at a high rate on RNA-seq and are strongly deleterious in the human population. De novo mutations with predicted splice-altering consequence are significantly enriched in patients with autism and intellectual disability compared to healthy controls and validate against RNA-seq in 21 out of 28 of these patients. We estimate that 9%&ndash;11% of pathogenic mutations in patients with rare genetic disorders are caused by this previously underappreciated class of disease variation.</p> <p><strong>Scoring</strong></p> <p>Delta score of a variant, defined as the maximum of (Acceptor Gain, Acceptor Loss, Donor Gain, Donor Loss), ranges from 0 to 1 and can be interpreted as the probability of the variant being splice-altering. Delta position conveys information about the location where splicing changes relative to the variant position (positive values are downstream of the variant, negative values are upstream).</p> <p><strong>Examples</strong></p> <p>The output for the following variant 19:38958362 C&gt;T can be interpreted as follows:</p> <p>|Acceptor Gain Score| Acceptor Loss Score| Donor Gain Score| Donor Loss Score|Acceptor Gain Position| Acceptor Loss Position| Donor Gain Position| Donor Loss Position|<br /> |-------------------|:-------------------|:----------------|:----------------|:---------------------|:----------------------|:-------------------|:-------------------|<br /> |0.00|0.00|0.91|0.08|-28|-46|-2|-31|</p> <p>* The probability that the position 19:38958360 (=38958362-2) is used as a splice donor increases by 0.91.<br /> * The probability that the position 19:38958331 (=38958362-31) is used as a splice donor decreases by 0.08.</p> <p>Similarly, the output for the variant 2:179415988 C&gt;CA has the following interpretation:</p> <p>|Acceptor Gain Score| Acceptor Loss Score| Donor Gain Score| Donor Loss Score|Acceptor Gain Position| Acceptor Loss Position| Donor Gain Position| Donor Loss Position|<br /> |-------------------|:-------------------|:----------------|:----------------|:---------------------|:----------------------|:-------------------|:-------------------|<br /> |0.07|1.00|0.00|0.00|-7|-1|35|-29|</p> <p>* The probability that the position 2:179415981 (=179415988-7) is used as a splice acceptor increases by 0.07.<br /> * The probability that the position 2:179415987 (=179415988-1) is used as a splice acceptor decreases by 1.00.</p> <p><br /> Information from https://github.com/Illumina/SpliceAI/blob/master/README.md</p> <p><br /> &nbsp;</p>
<p>A deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing.</p>
<p>Freely available for non-commercial use.</p>
...
Swiss-Prot Domains

Provides information on location,topology, and domain(s) of a protein.

...
<p><strong>Swiss-Prot Domains</strong><br /> Information on sequence similarities with other proteins and the domain(s) present in a protein. In addition to information on the location and the topology of the mature protein in the cell.</p> <p>The information is filed in different subsections. The current subsections and their content are listed below:</p> <p>|Subsection|Content|<br /> |----------|:------|<br /> |[Domain](https://www.uniprot.org/help/domain)|&nbsp;&nbsp; &nbsp;Denotes the position and type of each modular protein domain|<br /> |[Repeat](https://www.uniprot.org/help/repeat)| Denotes the positions of repeated sequence motifs or repeated domains|<br /> |[Motif](https://www.uniprot.org/help/motif)| Short (up to 20 amino acids) sequence motif of biological interest|<br /> |[Peptide](https://www.uniprot.org/help/peptide)| Position and length of an active peptide in the mature protein.|<br /> |[Transmembrane](https://www.uniprot.org/help/transmem)| Extent of a membrane-spanning region|<br /> |[Topological Domain](https://www.uniprot.org/help/topo_dom)| Location of non-membrane regions of membrane-spanning proteins|<br /> |[Intramembrane](https://www.uniprot.org/help/intramem)| Extent of a region located in a membrane without crossing it|</p> <p>Information from https://www.uniprot.org/help/subcellular_location_section and https://www.uniprot.org/help/family_and_domains_section</p>
<p>Provides information on location,topology, and domain(s) of a protein.</p>
...
Swiss-Prot PTM

 A high quality manually annotated protein sequence database, specifying in post-translational modifications (PTMs).<...

<p><strong>Swissprot PTMs</strong></p> <p>UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB).<br /> It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions.</p> <p>Swiss-Prot has a subsection &#39;PTM / Processing&#39; that describes post-transitional modifications (PTMs). This subsection complements the information provided at the sequence level or describes modifications for which position-specific data is not yet available.</p> <p>|Subsection|Content|<br /> |----------|:------|<br /> |[Cross-link](https://www.uniprot.org/help/crosslnk)|Residues participating in covalent linkage(s) between proteins|<br /> |[Disulfide Bond](https://www.uniprot.org/help/disulfid)|Cysteine residues participating in disulfide bonds|<br /> |[Glycosylation](https://www.uniprot.org/help/carbohyd)|Covalently attached glycan group(s)|<br /> |[Initiator methionine](https://www.uniprot.org/help/init_met)|Cleaved initiator methionine|<br /> |[Lipidication](https://www.uniprot.org/help/lipid)|Covalently attached lipid group(s)|<br /> |[Modified residue](https://www.uniprot.org/help/mod_res)|Modified residues excluding lipids, glycans and protein crosslinks|<br /> |[Propeptide](https://www.uniprot.org/help/propep)|Part of a protein that is cleaved during maturation or activation|<br /> |[Signal](https://www.uniprot.org/help/signal)|Sequence targeting proteins to the secretory pathway or periplasmic space|<br /> |[Transit Peptide](https://www.uniprot.org/help/transit)|Extent of a transit peptide for organelle targeting|</p> <p>Information from https://www.uniprot.org/help/post-translational_modification<br /> &nbsp;</p>
<p>&nbsp;A high quality manually annotated protein sequence database, specifying in post-translational modifications (PTMs).</p>
...
Trinity CTAT

Trinity assembles transcript sequences from Illumina RNA-Seq data.

...
<p><strong>Trinity</strong></p> <p>Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.</p> <p><strong>Trinity Cancer Transcriptome Analysis Toolkit (CTAT)</strong></p> <p>The Trinity Cancer Transcriptome Analysis Toolkit (CTAT) aims to provide tools for leveraging RNA-Seq to gain insights into the biology of cancer transcriptomes. Bioinformatics tool support is provided for mutation detection, fusion transcript identification, de novo transcript assembly of cancer-specific transcripts, lncRNA classification, and foreign transcript detection (viruses, microbes). CTAT is funded by the National Cancer Institute Informatics Technology for Cancer Research (NCI ITCR) program. Software tools and pipelines developed as components of Trinity CTAT are described below with links to the corresponding open source software, documentation, and tutorials.</p> <p><strong>&nbsp;CTAT-Mutations Pipeline Overview</strong></p> <p>CTAT-Mutations Pipeline is a variant calling pipeline focussed on detecting mutations from RNA sequencing (RNA-seq) data. It integrates GATK Best Practices along with downstream steps to annotate, filter, and prioritize cancer mutations. This includes leveraging the RADAR and RediPortal databases for identifying likely RNA-editing events, dbSNP for excluding common variants, and COSMIC to highlight known cancer mutations. Finally, CRAVAT is leveraged to annotate and prioritize variants according to likely biological impact and relevance to cancer.</p> <p>The CTAT Mutations pipeline is one of the components of the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), complementing other functionality that leverages RNA-Seq data for characterizing cancer transcriptomes, including identification of fusion transcripts, copy number variations from tumor single cell transcriptomes, among other analyses.</p> <p>Our CTAT-Mutation pipeline aims to make mutation discovery from rna-seq data as easy as possible, requiring only the rna-seq reads as input, and generating summary reports and visualizations to help guide you to the most meaningful findings.</p> <p>Information from https://github.com/NCIP/Trinity_CTAT/wiki</p>
<p>Trinity assembles transcript sequences from Illumina RNA-Seq data.</p>
...
Turkish Variome

Turkish Variome defines the extent of variation observed in Turkey,  based on sequencing data of 3,362 unrelated Turk...

<p><strong>Turkish Variome</strong></p> <p>We delineated the fine-scale genetic structure of the Turkish population by using sequencing data of 3,362 unrelated Turkish individuals from different geographical origins and demonstrated the position of Turkey in terms of human migration and genetic drift. The results show that the genetic structure of present-day Anatolia was shaped by historical and modern-day migrations, high levels of admixture, and inbreeding. We observed that modern-day Turkey has close genetic relationships with the neighboring Balkan and Caucasus populations. We generated a Turkish Variome which defines the extent of variation observed in Turkey, listed homozygous loss-of-function variants and clinically relevant variants in the cohort, and generated an imputation panel for future genome-wide association studies.</p> <p>![image](pnas.2026076118fig01_red.jpg)</p> <p>Kars, M Ece et al. &ldquo;The genetic structure of the Turkish population reveals high levels of variation and admixture.&rdquo; Proceedings of the National Academy of Sciences of the United States of America vol. 118,36 (2021): e2026076118. doi:([https://doi.org/10.1073/pnas.2026076118][10.1073/pnas.2026076118])</p>
<p>Turkish Variome defines the extent of variation observed in Turkey,&nbsp; based on sequencing data of 3,362 unrelated Turkish individuals from different&nbsp; geographical origins.</p>