Additional information can be included on the definition line to make this simple format a bit more informative, as follows.
>ENA|U54469|U54469.1 Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene, complete cds, alternatively spliced.
This modified FASTA definition line now has information on the source database (ENA), its accession.version number (U54469.1), and a short description of what biological entity is represented by the sequence.
Nucleotide Sequence Flatfiles: A Dissection
As flatfiles represent the elementary unit of information within sequence databases and facilitate the interchange of information between these databases, it is important to understand what each individual field within the flatfile represents and what kinds of information can be found in varying parts of the record. While there are minor differences in flatfile formats, they can all be separated into three major parts: the header, containing information and descriptors pertaining to the entire record; the feature table, which provides relevant annotations to the sequence; and the sequence itself.
The Header
The header is the most database-specific part of the record. Here, we will use the ENA version of the record for discussion (shown in its entirety in Appendix 1.1), with the corresponding DDBJ and GenBank versions of the header appearing in Appendix 1.2. The first line of the record provides basic identifying information about the sequence contained in the record, appropriately named the ID line; this corresponds to the LOCUS line in DDBJ/GenBank.
ID U54469; SV 1; linear; genomic DNA; STD; INV; 2881 BP.
The accession number is shown on the ID line, followed by its sequence version (here, the first version, or SV 1). As this is SV 1, this is equivalent to writing U54469.1, as described above. This is then followed by the topology of the DNA molecule (linear) and the molecule type (genomic DNA). The next element represents the ENA data class for this sequence (STD, denoting a “standard” annotated and assembled sequence). Data classes are used to group sequence records within functional divisions, enabling users to query specific subsets of the database. A description of these functional divisions can be found in Box 1.1. Finally, the ID line presents the taxonomic division for the sequence of interest (INV, for invertebrate; see Internet Resources) and its length (2881 base pairs). The accession number will also be shown separately on the AC line that immediately follows the ID lines.
Box 1.1 Functional Divisions in Nucleotide Databases
The organization of nucleotide sequence records into discrete functional types provides a way for users to query specific subsets of the records within these databases. In addition, knowledge that a particular sequence is from a given technique-oriented database allows users to interpret the data from the proper biological point of view. Several of these divisions are described below, and examples of each of these functional divisions (called “data classes” by ENA) can be found by following the example links listed on the ENA Data Formats page listed in the Internet Resources section of this chapter.
CON | Constructed (or “contigged”) records of chromosomes, genomes, and other long DNA sequences resulting from whole -genome sequencing efforts. The records in this division do not contain sequence data; rather, they contain instructions for the assembly of sequence data found within multiple database records. |
EST | Expressed Sequence Tags. These records contain short (300–500 bp) single reads from mRNA (cDNA) that are usually produced in large numbers. ESTs represent a snapshot of what is expressed in a given tissue or at a given developmental stage. They represent tags – some coding, some not – of expression for a given cDNA library. |
GSS | Genome Survey Sequences. Similar to the EST division, except that the sequences are genomic in origin. The GSS division contains (but is not limited to) single-pass read genome survey sequences, bacterial artificial chromosome (BAC) or yeast artificial chromosome (YAC) ends, exon-trapped genomic sequences, and Alu polymerase chain reaction (PCR) sequences. |
HTG | High-Throughput Genome sequences. Unfinished DNA sequences generated by high-throughput sequencing centers, made available in an expedited fashion to the scientific community for homology and similarity searches. Entries in this division contain keywords indicating its phase within the sequencing process. Once finished, HTG sequences are moved into the appropriate database taxonomic division. |
STD | A record containing a standard, annotated, and assembled sequence. |
STS | Sequence-Tagged Sites. Short (200–500 bp) operationally unique sequences that identify a combination of primer pairs used in a PCR assay, generating a reagent that maps to a single position within the genome. The STS division is intended to facilitate cross-comparison of STSs with sequences in other divisions for the purpose of correlating map positions of anonymous sequences with known genes. |
WGS | Whole-Genome Shotgun sequences. Sequence data from projects using shotgun approaches that generate large numbers of short sequence reads that can then be assembled by computer algorithms into sequence contigs, higher -order scaffolds, and sometimes into near-chromosome- or chromosome-length sequences. |
Following the ID line are one or more date lines (denoted by DT), indicating when the entry was first created or last updated. For our sequence of interest, the entry was originally created on May 19, 1996 and was last updated in ENA on June 23, 2017:
DT 19-MAY-1996 (Rel. 47, Created) DT 23-JUN-2017 (Rel. 133, Last updated, Version 5)
The release number in each line indicates the first quarterly release made after the entry was created or last updated. The version number for the entry appears on the second line and allows the user to determine easily whether they are looking at the most up-to-date record for a particular sequence. Please note that this is different from the accession.version format described above – while some element of the record may have changed, the sequence may have remained the same, so these two different types of version numbers may not always correspond to one another.
The next part of the header contains the definition lines, providing a succinct description of the kinds of biological information contained within the record. The definition line (DE in ENA, DEFINITION in DDBJ/GenBank) takes the following form.
DE Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene, DE complete cds, alternatively spliced.
Much care is taken in the generation of these definition lines and, although many of them can be generated automatically from other parts of the record, they are reviewed to ensure that consistency and richness of information are maintained. Obviously, it is quite impossible to capture all of the biology underlying a sequence in a single line of text, but that wealth of information will follow soon enough in downstream parts of the same record.
Continuing down the flatfile record, one finds the full taxonomic information on the sequence of interest. The OS line (or SOURCE line in DDBJ/GenBank) provides the preferred scientific name from which the sequence was derived, followed by the common name of the organism in parentheses. The OC lines (or ORGANISM lines in DDBJ/GenBank) contain the complete taxonomic classification of the source organism. The classification is listed top-down,