Читать книгу Bioinformatics - Группа авторов - Страница 84
Ensembl Biomart
ОглавлениеThe BioMart tool at Ensembl is akin to the Table Browser at UCSC, in that it provides a web-based interface through which to access the data underlying the Ensembl Genome Browser. Results are returned as text or HTML-formatted tables. Ensembl hosts several mart databases that are described in the online documentation. The Ensembl Genes database contains the Ensembl gene set and integrates Ensembl genes, transcripts, and proteins with a number of resources, including external references, protein domains, sequences, variants, and homology data. After choosing a Database (e.g. Genes) and Dataset (genome assembly, e.g. Homo sapiens), the user specifies the Filters (basically, the input data) and the Attributes (the output data). Users can choose from among seven types of filters, including Region and Gene. A Region could be a chromosomal position, while a Gene could be an accession number, gene name, or even microarray probeset. The list of possible Attributes is long, and includes Ensembl data such as gene and transcript identifiers and positions, links to external data sources including RefSeq, UCSC, Pfam (protein families), and Gene Ontology (GO) terms, as well as mapping to orthologs in the Ensembl genome databases.
In this example, we will identify the mouse orthologs of the human mRNA reference sequences that are associated with common diseases or traits. To do this, we will start with the output of the UCSC Table Browser, the mRNA reference sequences that overlap with a variant from the GWAS Catalog, pull out the corresponding Ensembl gene and transcript identifiers, and then link to the mouse orthologs. The initial step is to retrieve the RefSeq accession numbers that overlap with a variant from the GWAS Catalog by reproducing the search shown in Figure 4.12d, this time changing the output format to sequence. Copy and paste the output from the Table Browser into your favorite text editor to create a list that contains only the accession numbers. Note that BioMart does not accept the accession.version format used by NCBI, so an accession number like NM_001042682.1 would need to be rewritten as NM_001042682.
At BioMart, the first step is to enter these accession numbers as Filters into the Human Genes (GRCh38.p10) Dataset. RefSeq mRNA accession numbers are entered in the filter called Gene → Input external references ID list (Figure 4.22a). The Attributes could be the Ensembl Gene and Transcript identifiers, as well as the Gene name, in the Features → Gene → Ensembl section (Figure 4.22b). To correlate the output with the RefSeq accession numbers entered as Filters, it is necessary to also select the RefSeq accession as an attribute, in the Features → Gene → External References section (Figure 4.22c). After the Filters and Attributes have been set, click on the Results button in the upper left to return the BioMart output (Figure 4.22d). Data can be returned as a text file or as a formatted page in the web browser, with hyperlinks to Ensembl resources. Because of the differences in gene annotation strategies, the mapping of NCBI RefSeq accession numbers to Ensembl gene and transcript identifiers is not one to one; some RefSeq accessions map to more than one Ensembl gene and/or transcript, and some Ensembl genes map to more than one RefSeq identifier.
Figure 4.22 Using BioMart to retrieve the mouse orthologs of the human RefSeqs from the GWAS Catalog. (a) Enter the input RefSeq accession numbers into BioMart. First, create a list of RefSeq accession numbers from the UCSC Table Browser output in Figure 4.12d. BioMart does not accept the accession.version format, so all of the text after the accession number itself will need to be removed. This step can be implemented using a text editor that can perform a wildcard search and replace. For example, to remove the period and all following text from each line, replace ..* with an empty string. Although the resulting list of accession numbers will contain duplicates, as some RefSeqs have been mapped to alternate loci, any redundancy will be removed from the final BioMart results. At BioMart, click on Filters in the left sidebar, open the Gene menu, and click on Input external references ID list. In the pull-down menu, select RefSeq mRNA IDs as the type of identifier. Paste in the list of accession numbers, which should be of the form NM_001042682. Although BioMart instructions recommend limiting the number of access numbers to 500, the interface will process the 3000+ RefSeq accession numbers from the UCSC Table Browser output. (b) Set the BioMart Attributes (fields to be included in the output). Click on the Attributes in the left sidebar, select Features at the top of the page, then open the Gene menu. Gene stable ID and Transcript stable ID should be selected by default, and will return the Ensembl gene (ENSG) and transcript (ENST) identifiers. Also select Gene name to return the gene symbols (e.g. ADAM18). (c) Set additional Attributes. Close the Gene menu and open the External menu. Navigate to External References and select RefSeq mRNA ID. This step is needed to return the input RefSeq accession numbers so that they can be correlated later with the Ensembl identifiers. (d) BioMart output, including the identifiers requested above. Click on the Results button at the top of the page to retrieve the output. Check the box Unique results only to ensure that duplicated RefSeqs are returned only once. The order of the columns in the results file depends on the order in which the items were added to the list of Attributes. The net result is that each human RefSeq accession from the Table Browser is correlated with its Ensembl Gene and Transcript ID, as well as a gene symbol. (e) BioMart output, with human Ensembl Gene ID and gene symbol, as well as the orthologous mouse Ensembl Gene ID and gene symbol. Start a new query by clicking the New box at the top of the BioMart window. Select the same Database, Dataset, and Filters as before. Under Attributes, select the Homologues radio button. The human Ensembl Gene ID and gene symbol are in the Gene →Ensembl menu, called Gene stable ID and Gene name. The mouse Ensembl Gene ID and gene symbol are in the Orthologues → Mouse Orthologues menu, called Mouse gene stable ID and Mouse gene name. This step outputs the orthologous mouse Ensembl Gene ID and symbol for each human Ensembl Gene ID and symbol. The BioMart output from (d) and (e) can be merged to list the mouse ortholog of each human RefSeq from the GWAS Catalog (Figure 4.12d).
Retrieving the mouse orthologs of the NCBI reference sequences must be done as a separate step, as it is not possible to return an external identifier (i.e. the starting RefSeq accession number) and an ortholog in the same BioMart query. Starting with the same Filter and human RefSeq accession numbers as before, choose the Homologues section of the Attributes and select the human Ensembl gene identifier and gene name under Gene → Ensembl, as well as the mouse Ensembl gene identifier and gene name under Orthologues → Mouse Orthologues. The results are shown in Figure 4.22e. Note that not all of the human gene identifiers have been mapped to a corresponding mouse ortholog. The goal of this exercise was to identify the mouse orthologs of the human RefSeq accession numbers from the GWAS Catalog. Using the human Ensembl gene identifiers as a key, the human RefSeq accession numbers can be added to the list of mouse orthologs. This can be carried out by using the VLOOKUP function in Microsoft Excel, or by writing a script in your favorite programming language, and is left as an exercise for the reader.
