Читать книгу Bioinformatics - Группа авторов - Страница 80
Box 4.3 Histone Marks
ОглавлениеHistone proteins package DNA into chromosomes. Post-translational modifications of these histones can affect gene expression, as well as DNA replication and repair, by changing chromatin structure or recruiting histone modifiers (Lawrence et al. 2016). The post-translational modifications include methylation, phosphorylation, acetylation, ubiquitylation, and sumoylation. Histone H3 is primarily acetylated on lysine residues, methylated at arginine or lysine, or phosphorylated on serine or threonine. Histone H4 is primarily acetylated on lysine, methylated at arginine or lysine, or phosphorylated on serine.
Histone modification (or “marking”) is identified by the name of the histone, the residue on which it is marked, and the type of mark. Thus, H3K27Ac is histone H3 that is acetylated on lysine 27, while H3K79me2 is histone H3 that is dimethylated on lysine 79. Different histone marks are associated with different types of chromatin structure. Some are more likely found near enhancers and others near promoters and, while some cause an increase of expression from nearby genes, others cause less. For example, H3K4me3 is associated with active promoters, and H3K27me3 is associated with developmentally controlled repressive chromatin states.
The DNase Clusters track depicts regions where chromatin is hypersensitive to cutting by the DNaseI enzyme. In these hypersensitive regions, the nucleosome structure is less compacted, meaning that the DNA is available to bind transcription factors. Thus, regulatory regions, especially promoters, tend to be DNase sensitive. The track settings for the ENCODE Regulation super-track allows other ENCODE tracks to be added to the browser window, including additional histone modification and DNaseI hypersensitivity data. Changing the display of the H3K4Me3 peaks from hide to full highlights the peaks in the H3K4Me3 track near the 5′ ends of the HIF1A and SNAPC1 transcripts that overlap with DNase hypersensitive sites (Figure 4.7, blue highlights). These peaks may represent promoter elements that regulate the start of transcription.
The UCSC Genome Browser displays data from NCBI's Single Nucleotide Polymorphism Database (dbSNP) in four SNP tracks. Common SNPs contains SNPs and small insertions and deletions (indels) from NCBI's dbSNP that have a minor allele frequency of at least 1% and are mapped to a single location in the genome. Researchers looking for disease-causing SNPs can use this track to filter their data, hypothesizing that their variant of interest will be rare and therefore not displayed in this track. Flagged SNPs are those that are deemed by NCBI to be clinically associated, while Mult. SNPs have been mapped to more than one region in the genome. NCBI filters out most multiple-mapping SNPs as they may not be true SNPs, so there are not many variants in this track. All SNPs includes all SNPs from the three subcategories. dbSNP is in a continuous state of growth, and new data are incorporated a few times each year as a new release, or new build, of dbSNP. These four SNP tracks are available for a few of the most recent builds of dbSNP, indicated by the number in the track name. Thus, for example, Common SNPs (150) are SNPs found in ≥1% of samples from dbSNP build 150.
By default, the Common SNPs (150) track is displayed in dense mode, with all variants in the region compressed onto a single line. Variants in the Common SNPs track are color coded by function. Open the Track Settings for this track in order to modify the display (Figure 4.8). Set the Display mode to pack in order to show each variant separately. At the same time, modify the Coloring Options so that SNPs in UTRs of transcripts are set to blue and SNPs in coding regions of transcripts are set to green if they are synonymous (no change to the protein sequence) or red if they are non-synonymous (altering the protein sequence), with all remaining classes of SNPs set to display in black. Note the changes in the resulting browser window, with the green synonymous and blue untranslated SNPs clearly visible (Figure 4.9).
Figure 4.7 The genomic context of the human HIF1A gene, after changing the display of the H3K4Me3 peaks from hide to full. The H3K4Me3 track is part of the ENCODE Regulation super-track. Below the graphic display window in Figure 4.5, open up the ENCODE Regulation Super-track, in the Regulation menu. Change the track display from hide to full to reproduce the page shown here. Note that the H3K4Me3 peaks, which can indicate promoter regions (Box 4.3), overlap with the transcription starts of the SNAPC1 and HIF1A genes (light blue highlight). These regions also overlap with the DNase HS track, indicating that the chromatin should be available to bind transcription factors in this region. The highlights were added within the Genome Browser using the Drag-and-select tool. This tool is accessed by clicking anywhere in the Scale track at the top of the Genome Browser display and dragging the selection window across a region of interest. The Drag-and-select tool provides options to Highlight the selected region or Zoom directly to it.
Figure 4.8 Configuring the track settings for the Common SNPs(150) track. Set the Coloring Options so that all SNPs are black, except for untranslated SNPs (blue), coding-synonymous SNPs (green), and coding-non-synonymous SNPs (red). In addition, change the Display mode of the track from dense to pack so that the individual SNPs can be seen. By default, the function of each variant is defined by its position within transcripts in the GENCODE track. However, the track used for annotation can be changed in the settings called Use Gene Tracks for Functional Annotation.
Figure 4.9 The genomic context of the human HIF1A gene, after changing the colors and display mode of the Common SNPs(150) track as shown in Figure 4.8. The SNPs in the 5′ and 3′ untranslated regions of the HIF1A GENCODE transcripts are now colored blue, while the coding-synonymous SNP is colored green.
Two types of Expression tracks display data from the NIH Genotype-Tissue Expression (GTEx) project (GTEx Consortium 2015). The GTEx Gene track displays gene expression levels in 51 tissues and two cell lines, based on RNA-seq data from 8555 samples. The GTEx Transcript track provides additional analysis of the same data and displays median transcript expression levels. By default, the GTEx Gene track is shown in pack mode, while the GTEx Transcript track is hidden. Figure 4.10 shows the Gene track in pack display mode, in the region of the phenylalanine hydroxylase (PAH) gene. The height of each bar in the bar graph represents the median expression level of the gene across all samples for a tissue, and the bar color indicates the tissue. The PAH gene is highly expressed in kidney and liver (the two brown bars). The expression is more clearly visible in the details page for the GTEx track (Figure 4.10, inset, purple box). The GTEx Transcript track is similar, but depicts expression for individual transcripts rather than an average for the gene.
An alternate entry point to the UCSC Genome Browser is via a BLAT search (see Chapter 3), where a user can input a nucleotide or protein sequence to find an aligned region in a selected genome. BLAT excels at quickly identify a matching sequence in the same or highly similar organism. We will attempt to use BLAT to find a lizard homolog of the human gene disintegrin and metalloproteinase domain-containing protein 18 (ADAM18). The ADAM18 protein sequence is copied in FASTA format from the NCBI view of accession number NP_001307242.1 and pasted into the BLAT Search box that can be accessed from the Tools pull-down menu; the method for retrieving this sequence in the correct format is described in Chapter 2. Select the lizard genome and assembly AnoCar2.0/anoCar2. BLAT will automatically determine that the query sequence is a protein and will compare it with the lizard genome translated in all six reading frames. A single result is returned (Figure 4.11a). The alignment between the ADAM18 protein sequence and lizard chromosome Un_GL343418 runs from amino acid 368 to amino acid 383, with 81.3% identity. The browser link depicts the genomic context of this 48 nt hit (Figure 4.11b). Although the ADAM18 protein sequence aligns to a region in which other human ADAM genes have also been aligned, the other human genes are represented by a thin line, indicating a gap in their alignment. The details link shown in Figure 4.11a produces the alignment between the ADAM18 protein and lizard chromosome Un_GL343418 (Figure 4.11c). The top section of the results shows the protein query sequence, with the blue letters indicating the short region of alignment with the genome. The bottom section shows the pairwise alignment between the protein and genomic sequence translated in six frames. Vertical black lines indicate identical sequences. Taken together, the BLAT results show that only 16 amino acids of the 715 amino acid ADAM18 protein align to the lizard genome (Figure 4.11c). This alignment is short and likely does not represent a homologous region between the ADAM18 protein and the lizard genome. Thus, the BLAT algorithm, although fast, is not always sensitive enough to detect cross-species orthologs. The BLAST algorithm, described in the Ensembl Genome Browser section, is more sensitive, and is a better choice for identifying such homologs.
Figure 4.10 The GTEx Gene track, which depicts median gene expression levels in 51 tissues and two cell lines, based on RNA-seq data from the GTEx project from 8555 tissue samples. The main browser window depicts the GTEx Gene track for the human PAH gene, showing high expression in the two tissues colored brown (liver and kidney) but low or no expression in others. Clicking on the GTEx track opens it in a larger window, shown in the inset.
Figure 4.11 BLAT search at the UCSC Genome Browser. (a) This page shows the results of running a BLAT search against the lizard genome, using as a query the human protein sequence of the gene ADAM18, accession NP_001307242.1. The ADAM18 protein sequence is available from NCBI at www.ncbi.nlm.nih.gov/protein/NP_001307242.1?report=fasta. At the UCSC Genome Browser, the web interface to the BLAT search is in the Tools menu at the top of each page. The BLAT search was run against the lizard genome assembly from May 2010, also called anoCar2. The columns on the results page are as follows: ACTIONS, links to the browser (Figure 4.11b) and details (Figure 4.11c); QUERY, the name of the query sequence; SCORE, the BLAT score, determined by the number of matches vs. mismatches in the final alignment of the query to the genome; START, the start coordinate of the alignment, on the query sequence; END, the end coordinate of the alignment, on the query sequence; QSIZE, the length of the query; IDENTITY, the percent identity between the query and the genomic sequences; CHRO, the chromosome to which the query sequence aligns; STRAND, the chromosome strand to which the query sequence aligns; START; the start coordinate of the alignment, on the genomic sequence; END, the end coordinate of the alignment, on the genomic sequence; and SPAN, the length of the alignment, on the genomic sequence. Note that, in this example, there is a single alignment; searches with other sequences may result in many alignments, each shown on a separate line. It is possible to search with up to 25 sequences at a time, but each sequence must be in FASTA format. (b) This page shows the browser link from the BLAT summary page. The alignment between the query and genome is shown as a new track called Your Sequence from BLAT Search. (c) The details link from the BLAT summary page, showing the alignment between the query (human ADAM18 protein) and the lizard genome, translated in six frames. The protein query sequence is shown at the top, with the blue letters indicating the amino acids that align to the genome. The bottom section shows the pairwise alignment between the protein and genomic sequence translated in six frames. Black lines indicate identical sequences; red and green letters indicate where the genomic sequence encodes a different amino acid. Although the ADAM18 protein sequence has a length of 715 amino acids, only 16 amino acids align as a single block to the lizard genome.