Box 4.4 Ensembl Stable IDs
Ensembl assigns accession numbers to many data types in its database. Each identifier begins with the organism prefix; for human, the prefix is ENS
; for mouse, it is ENSMUS
; and for anole lizard, it is ENSACA.
Next comes an abbreviation for the feature type: G
for gene, T
for transcript, P
for protein, R
for regulatory, and so forth. This is followed by a series of digits, and an optional version. The version number increments when there is a change in the underlying data. The gene version changes when the underlying transcripts are updated, and the transcript and protein versions increment when the sequence changes.
For example, the human PAH gene has the following identifiers:
ENSG00000171759.9: the identifier of the human PAH gene
ENST00000553106.5: the identifier of one transcript of the human PAH gene, transcript PAH-215
ENSP00000448059.1: the identifier of the protein translation of transcript PAH-215, ENST00000553106.5
ENSR00000056420: the identifier of a promoter of several PAH transcripts
Navigation controls between the second and third panels of the Location tab allow the display to be zoomed or moved to the left or right. The blue bar at the top of the Region in detail allows users to toggle between Drag and Select. When the Drag option is highlighted, click on the graphical view window and drag it to the left or right to change the location. When the Select option is highlighted, click on a region of interest in the graphical view, then, holding the mouse button down, scroll to the left or right to highlight the region (Figure 4.17a). The highlight can be left on for visualization purposes or, alternatively, select Jump to region to zoom in to the selected region. Figure 4.17b shows the results of zooming in to the last exon of transcript PAH-203; since the gene is transcribed from right to left, the last exon is on the left. Note the track called All phenotype-associated short variants (SNPs and indels) that contains those variants that have been associated with a phenotype or disease. SNPs are color coded by function, with dark green indicating coding sequence variants. Select the dark green SNP, highlighted with a red box near the left end of the window, and follow the link for additional information. The resulting Variant tab provides links to SNP-related resources. For example, the Phenotype Data for this SNP (rs76296470; Figure 4.18a) shows that this variant is pathogenic and is associated with the disease phenylketonuria. The most severe consequence for this SNP is a stop gained. Further details about the consequences are available under the Genes and regulation link (Figure 4.18b) on the left sidebar. This variant is found in 10 transcripts of the PAH gene. In five of those transcripts, it alters one nucleotide in a codon, changing an arginine to a stop codon, thus truncating the PAH protein. In the other five transcripts, either the variant is downstream of the gene or the transcript is non-coding.
Ensembl makes available many annotation tracks through the Configure this page link on the left sidebar. There are over 500 tracks available for display on GRCh38, with the majority falling in the categories of Variation, Regulation, and Comparative Genomics. The Ensembl Regulatory Build includes regions that are likely to be involved in gene regulation, including promoters, promoter flanking regions, enhancers, CCCTC-binding factor (CTCF) binding sites, transcription factor binding sites (TFBS), and open chromatin regions (Zerbino et al. 2016). A summary Regulatory Build track is turned on by default in the Location tab, and the display of individual features can be adjusted in the Configure this page menu. In the UCSC Genome Browser, the GTEx track shows that the PAH gene is highly expressed in liver and kidney (Figure 4.10); the epigenetic factors that may be controlling this activity can be viewed in Ensembl Regulatory Build. To view these factors, navigate to Regulation → Histones & polymerases on the Configure this page menu, mouse over the HepG2 human liver carcinoma line, and select All features for HepG2 (Figure 4.19a). In addition, navigate to Regulation → Open chromatin & TFBS and confirm that the DNase1 track is in its default state for HepG2; the dark blue indicates that the track is shown. Close the Configure this page menu by clicking on the check mark in the upper right corner of the pop-up window. Notice that the Regulatory Build track has now expanded to include the selected gene regulatory marks in the HepG2 cell line. Zoom in on the first exon of transcript PAH-215 to see the promoter region of this gene, being mindful of the orientation of the gene (Figure 4.19b). The solid red rectangle in the Regulatory Build track shows the location of the PAH promoter. The presence of a DNaseI hypersensitive site along with the activating histone marks of H3K27Ac, H3K4me1, H3K4me2, H3K4me3, H3K79me2, and H3K9Ac may help to explain why this gene is highly expressed in liver cells (Box 4.3). Detailed information about features in the Regulatory Build track, such as the source of the data, is available under the Regulation tab. Click on the feature and select its identifier (the letters ENSR, followed by numbers) to open this tab.
Figure 4.17 Zooming in on the bottom section of the Location tab from Figure 4.16. (a) Highlight a region of interest, the final exon of PAH transcript PAH-203, by clicking the mouse and then scrolling to the left or right. In order to highlight the region, the Drag/Select toggle in the blue bar at the top of the section must first be set to Select. (b) To zoom in to the highlighted region, select Jump to region. It may take a few iterations to create the view in this figure. At the bottom of the window is a track labeled All phenotype-associated – short variants (SNPs and indels). In this track, the SNP rs76296470 has been manually highlighted in red.