| Content | |
|---|---|
| Description | Annotated collection of publicly available nucleotide sequences and their protein translations. |
| Data types captured | Nucleotide sequence Protein sequence |
| Organisms | All |
| Contact | |
| Research center | National Center for Biotechnology Information (NCBI), United States National Library of Medicine (NLM), National Institutes of Health (NIH) |
| Primary citation | PMID 39558184 |
| Release date | 1982 (1982) |
| Access | |
| Data format | GenBank format |
| Website | https://www.ncbi.nlm.nih.gov/genbank/ |
| Download URL | https://ftp.ncbi.nih.gov/genbank/ |
| Web service URL | E-utilities |
| Tools | |
| Web | Search, BLAST |
| Miscellaneous | |
| License | Public domain (with submitter caveats)1 |
The GenBank sequence database is an open access, annotated collection of publicly available nucleotide sequences and their protein translations. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) and is produced and maintained by the National Center for Biotechnology Information (NCBI), a division of the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH).
As of GenBank release 271.0 (April 2026), the database contained 53.90 trillion bases and 6.27 billion sequence records, including 261,460,182 GenBank entries containing 7,289,942,983,522 base pairs of sequence data.2 The database includes sequences from more than 581,000 formally described species.3 The database was established in 1982 by Walter Goad and the Los Alamos National Laboratory and has become a central resource for biological research. GenBank is built from direct submissions by individual laboratories as well as bulk submissions from large-scale sequencing projects.3
Submissions
GenBank accepts nucleotide sequence submissions from individual laboratories and large-scale sequencing projects. Direct submissions are made using the NCBI Submission Portal4, which provides guided workflows for data submission, or programmatically using tools such as table2asn. The legacy BankIt submission system is being phased out in favor of the Submission Portal.5
Upon receipt of a submission, GenBank staff review the data for completeness, biological context, and consistency, assign an accession number, and perform quality assurance checks before release to the public database. Submitted sequences are accessible through Entrez and are available for download via FTP.
GenBank supports a variety of submission types, including whole genome shotgun (WGS) assemblies, transcriptome shotgun assemblies (TSA), targeted locus studies (TLS), and high-throughput genomic (HTGS) sequences. Third Party Annotation (TPA) records allow the publication of annotations based on sequences already present in GenBank. Raw sequence reads generated by next-generation sequencing technologies are deposited in the Sequence Read Archive (SRA), rather than in GenBank itself.6
History
Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory (LANL) and colleagues established the Los Alamos Sequence Database in 1979, which culminated in the creation of the public GenBank database in 1982.7 Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences had been collected. An early description of the database was published in 1985, when GenBank contained over five million bases across approximately 6,000 sequence entries.8
During the late 1980s and early 1990s, responsibility for GenBank transitioned from Los Alamos National Laboratory to the newly established National Center for Biotechnology Information (NCBI) at the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH). Contemporary GenBank release documentation from this period reflects a shift from joint contributions by LANL-based database staff and NLM-based indexing teams toward full management by NCBI, indicating a phased transfer of data curation and operational responsibilities.9


Growth

GenBank has grown substantially since its inception. Early analyses and release notes have described this growth as approximating a doubling in the number of bases every 18 months, although growth rates have varied over time with changes in sequencing technologies and data submission practices.2
As of GenBank release 271.0 (April 2026), the database contained over 53.9 trillion bases and 6.27 billion sequence records, reflecting continued large-scale contributions from high-throughput sequencing projects.2
Although GenBank release statistics summarize the traditional GenBank divisions, several large bulk-oriented sequence collections are tracked separately. Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) records are processed by GenBank but are not distributed as part of regular GenBank release files; instead, they are made available continuously through separate per-project FTP areas.2
For release 271.0, most sequence data were in WGS records rather than traditional GenBank entries.
| Collection | Entries | Bases |
|---|---|---|
| Traditional GenBank entries | 261,460,182 | 7.289942983522×10 |
| Whole Genome Shotgun (WGS) | 4,756,526,485 | 4.5628511953497×10 |
| Transcriptome Shotgun Assembly (TSA) | 1,058,643,373 | 8.98651493967×10 |
| Targeted Locus Study (TLS) | 191,365,090 | 7.9162820303×10 |
| Total | 6,267,995,130 | 5.3896269141289×10 |
The following table lists the twenty organisms with the largest number of bases in the traditional GenBank divisions for release 271.0; it excludes WGS, TSA, TLS, metagenomic, mitochondrial, chloroplast, synthetic construct, and several other categories defined in the release notes.2
| Organism | Entries | Bases | NCBI Taxonomic IDa |
|---|---|---|---|
| Common wheat (Triticum aestivum) | 1,948,752 | 4.05961950617×10 |
4565 |
| Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) | 9,204,178 | 2.73571729605×10 |
2697049 |
| Oat (Avena sativa) | 30,555 | 2.73192060156×10 |
4497 |
| Barley (Hordeum vulgare) | 113,643 | 2.59319130177×10 |
4513 |
| Hordeum bulbosum | 753 | 2.12675690135×10 |
4514 |
| Cultivated barley (Hordeum vulgare subsp. vulgare) | 1,347,104 | 1.25991654054×10 |
112509 |
| Mistletoe (Viscum album) | 164 | 9.3011095388×10 |
3973 |
| Wild barley (Hordeum vulgare subsp. spontaneum) | 29,978 | 9.2980378923×10 |
4512 |
| House mouse (Mus musculus) | 10,118,677 | 4.6333339697×10 |
10090 |
| Human (Homo sapiens) | 28,773,499 | 3.8432937493×10 |
9606 |
| Escherichia coli | 195,553 | 3.771187484×10 |
562 |
| Eggplant (Solanum melongena) | 4,711 | 3.7408102376×10 |
4111 |
| Thale cress (Arabidopsis thaliana) | 2,643,074 | 3.3455227111×10 |
3702 |
| Cattle (Bos taurus) | 2,246,081 | 3.3006519943×10 |
9913 |
| Klebsiella pneumoniae | 43,595 | 2.6206281917×10 |
573 |
| Maize (Zea mays) | 4,224,726 | 2.4918520894×10 |
4577 |
| Broad bean (Vicia faba) | 28,783 | 2.3037489636×10 |
3906 |
| Palmate newt (Lissotriton helveticus) | 507 | 2.2852582653×10 |
8293 |
| Northern crested newt (Triturus cristatus) | 1,634 | 2.2052877731×10 |
8297 |
| Wild oat (Avena sterilis) | 248 | 2.1843250305×10 |
4499 |
Limitations
An analysis of GenBank and other services for the molecular identification of clinical blood culture isolates using 16S rRNA sequences10 showed that such analyses were more discriminative when GenBank was combined with other services such as EzTaxon-e11 and the BIBI12 databases.
GenBank may contain sequences wrongly assigned to a particular species, because the initial identification of the organism was wrong. A recent study showed that 75% of mitochondrial Cytochrome c oxidase subunit I sequences were wrongly assigned to the fish Nemipterus mesoprion resulting from continued usage of sequences of initially misidentified individuals.13 The authors provide recommendations how to avoid further distribution of publicly available sequences with incorrect scientific names.
Numerous published manuscripts have identified erroneous sequences on GenBank.141516 These are not only incorrect species assignments (which can have different causes) but also include chimeras and accession records with sequencing errors. A recent manuscript on the quality of all Cytochrome b records of birds further showed that 45% of the identified erroneous records lack a voucher specimen that prevents a reassessment of the species identification.17
Another problem is that sequence records are often submitted as anonymous sequences without species names (e.g. as "Pelomedusa sp. A CK-2014" because the species are either unknown or withheld for publication purposes. However, even after the species have been identified or published, these sequence records are not updated and thus may cause ongoing confusion.18
See also
See also
- International Nucleotide Sequence Database Collaboration (INSDC)
- European Nucleotide Archive (ENA)
- DNA Data Bank of Japan (DDBJ)
- RefSeq — curated reference sequences derived from GenBank
- UniProt — protein sequence and annotation database
- Sequence analysis
- Open science data
Notes
Notes
- Also known as NCBI TaxID
References
References
- "GenBank Overview". National Center for Biotechnology Information. Retrieved 2 May 2026.
Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted.
- "GenBank Release 271.0". National Center for Biotechnology Information. Retrieved 2 May 2026.
- Sayers, Eric W.; Cavanaugh, Mark; Frisse, Linda; Pruitt, Kim D.; Schneider, Valerie A.; Underwood, Beverly A.; Yankie, Linda; Karsch-Mizrachi, Ilene (2025). "GenBank 2025 update". Nucleic Acids Research. 53 (D1): D56–D61. doi:10.1093/nar/gkae1114. PMC 11701615. PMID 39558184.
- "Submission Portal GenBank". National Center for Biotechnology Information. Retrieved 2 May 2026.
- "How to submit data to GenBank". National Center for Biotechnology Information. Retrieved 2 May 2026.
- "GenBank Submission Types". National Center for Biotechnology Information. Retrieved 2 May 2026.
- Hanson, Todd (2000-11-21). "Walter Goad, GenBank founder, dies". Los Alamos National Laboratory. Archived from the original on 2008-11-07. Retrieved 2 May 2026.
- Burks, Charles; Fickett, James W.; Goad, Walter B.; Kanehisa, Minoru; Lewitter, Fran I.; Rindone, William P.; Swindell, Charles D.; Tung, Ching-Shien; Bilofsky, Howard S. (1985). "The GenBank nucleic acid sequence database". Computer Applications in the Biosciences. 1 (4): 225–233. PMID 3880345.
- Benton, David (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research. 18 (6): 1517–1520. doi:10.1093/nar/18.6.1517. PMC 330520. PMID 2326192.
- Kyung Sun Park; Chang-Seok Ki; Cheol-In Kang; Yae-Jean Kim; Doo Ryeon Chung; Kyong Ran Peck; Jae-Hoon Song; Nam Yong Lee (May 2012). "Evaluation of the GenBank, EzTaxon, and BIBI Services for Molecular Identification of Clinical Blood Culture Isolates That Were Unidentifiable or Misidentified by Conventional Methods". J. Clin. Microbiol. 50 (5): 1792–1795. doi:10.1128/JCM.00081-12. PMC 3347139. PMID 22403421.
- EzTaxon-e Database eztaxon-e.ezbiocloud.net (archive accessed 25 March 2021)
- leBIBI V5 pbil.univ-lyon1.fr (archive accessed 25 March 2021)
- Ogwang, Joel; Bariche, Michel; Bos, Arthur R. (2021). "Genetic diversity and phylogenetic relationships of threadfin breams (Nemipterus spp.) from the Red Sea and eastern Mediterranean Sea". Genome. 64 (3): 207–216. doi:10.1139/gen-2019-0163. PMID 32678985.
- van den Burg, Matthijs P.; Herrando-Pérez, Salvador; Vieites, David R. (13 August 2020). "ACDC, a global database of amphibian cytochrome-b sequences using reproducible curation for GenBank records". Scientific Data. 7 (1): 268. Bibcode:2020NatSD...7..268V. doi:10.1038/s41597-020-00598-9. eISSN 2052-4463. PMC 7426930. PMID 32792559.
- Li, Xiaobing; Shen, Xuejuan; Chen, Xiao; Xiang, Dan; Murphy, Robert W.; Shen, Yongyi (6 February 2018). "Detection of Potential Problematic Cytb Gene Sequences of Fishes in GenBank". Frontiers in Genetics. 9: 30. doi:10.3389/fgene.2018.00030. eISSN 1664-8021. PMC 5808227. PMID 29467794.
- Heller, Philip; Casaletto, James; Ruiz, Gregory; Geller, Jonathan (7 August 2018). "A database of metazoan cytochrome c oxidase subunit I gene sequences derived from GenBank with CO-ARBitrator". Scientific Data. 5 (1). Bibcode:2018NatSD...580156H. doi:10.1038/sdata.2018.156. eISSN 2052-4463. PMC 6080493. PMID 30084847.
- Van Den Burg, Matthijs P.; Vieites, David R. (22 September 2022). "Bird genetic databases need improved curation and error reporting to <scp>NCBI</scp>". Ibis. doi:10.1111/ibi.13143. eISSN 1474-919X. hdl:10261/282622. ISSN 0019-1019.
- Garg, Akhil; Leipe, Detlef; Uetz, Peter (2019-12-10). "The disconnect between DNA and species names: lessons from reptile species in the NCBI taxonomy database". Zootaxa. 4706 (3): 401–407. doi:10.11646/zootaxa.4706.3.1. ISSN 1175-5334.
- This article incorporates public domain material from NCBI Handbook. National Center for Biotechnology Information.