Article · Wikipedia archive · Last revised May 26, 2026

GenBank

The GenBank sequence database is an open access, annotated collection of publicly available nucleotide sequences and their protein translations. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) and is produced and maintained by the National Center for Biotechnology Information (NCBI), a division of the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH).

Last revised
May 26, 2026
Read time
≈ 8 min
Length
1,746 w
Citations
26
Source
GenBank
Content
DescriptionAnnotated collection of publicly available nucleotide sequences and their protein translations.
Data types
captured
Nucleotide sequence
Protein sequence
OrganismsAll
Contact
Research centerNational Center for Biotechnology Information (NCBI), United States National Library of Medicine (NLM), National Institutes of Health (NIH)
Primary citationPMID 39558184
Release date1982 (1982)
Access
Data formatGenBank format
Websitehttps://www.ncbi.nlm.nih.gov/genbank/
Download URLhttps://ftp.ncbi.nih.gov/genbank/
Web service URLE-utilities
Tools
WebSearch, BLAST
Miscellaneous
LicensePublic domain (with submitter caveats)1

The GenBank sequence database is an open access, annotated collection of publicly available nucleotide sequences and their protein translations. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) and is produced and maintained by the National Center for Biotechnology Information (NCBI), a division of the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH).

As of GenBank release 271.0 (April 2026), the database contained 53.90 trillion bases and 6.27 billion sequence records, including 261,460,182 GenBank entries containing 7,289,942,983,522 base pairs of sequence data.2 The database includes sequences from more than 581,000 formally described species.3 The database was established in 1982 by Walter Goad and the Los Alamos National Laboratory and has become a central resource for biological research. GenBank is built from direct submissions by individual laboratories as well as bulk submissions from large-scale sequencing projects.3

Submissions

GenBank accepts nucleotide sequence submissions from individual laboratories and large-scale sequencing projects. Direct submissions are made using the NCBI Submission Portal4, which provides guided workflows for data submission, or programmatically using tools such as table2asn. The legacy BankIt submission system is being phased out in favor of the Submission Portal.5

Upon receipt of a submission, GenBank staff review the data for completeness, biological context, and consistency, assign an accession number, and perform quality assurance checks before release to the public database. Submitted sequences are accessible through Entrez and are available for download via FTP.

GenBank supports a variety of submission types, including whole genome shotgun (WGS) assemblies, transcriptome shotgun assemblies (TSA), targeted locus studies (TLS), and high-throughput genomic (HTGS) sequences. Third Party Annotation (TPA) records allow the publication of annotations based on sequences already present in GenBank. Raw sequence reads generated by next-generation sequencing technologies are deposited in the Sequence Read Archive (SRA), rather than in GenBank itself.6

History

Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory (LANL) and colleagues established the Los Alamos Sequence Database in 1979, which culminated in the creation of the public GenBank database in 1982.7 Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences had been collected. An early description of the database was published in 1985, when GenBank contained over five million bases across approximately 6,000 sequence entries.8

During the late 1980s and early 1990s, responsibility for GenBank transitioned from Los Alamos National Laboratory to the newly established National Center for Biotechnology Information (NCBI) at the United States National Library of Medicine (NLM), part of the National Institutes of Health (NIH). Contemporary GenBank release documentation from this period reflects a shift from joint contributions by LANL-based database staff and NLM-based indexing teams toward full management by NCBI, indicating a phased transfer of data curation and operational responsibilities.9

GenBank and EMBL: NucleotideSequences 1986/1987 Volumes I to VII. source ↗
CD-ROM of GenBank v100 source ↗

Growth

Growth in GenBank base pairs, 1982 to 2018, on a semi-log scale source ↗

GenBank has grown substantially since its inception. Early analyses and release notes have described this growth as approximating a doubling in the number of bases every 18 months, although growth rates have varied over time with changes in sequencing technologies and data submission practices.2

As of GenBank release 271.0 (April 2026), the database contained over 53.9 trillion bases and 6.27 billion sequence records, reflecting continued large-scale contributions from high-throughput sequencing projects.2

Although GenBank release statistics summarize the traditional GenBank divisions, several large bulk-oriented sequence collections are tracked separately. Whole Genome Shotgun (WGS), Transcriptome Shotgun Assembly (TSA), and Targeted Locus Study (TLS) records are processed by GenBank but are not distributed as part of regular GenBank release files; instead, they are made available continuously through separate per-project FTP areas.2

For release 271.0, most sequence data were in WGS records rather than traditional GenBank entries.

Major GenBank-associated sequence collections in release 271.02
Collection Entries Bases
Traditional GenBank entries 261,460,182 7.289942983522×10^12
Whole Genome Shotgun (WGS) 4,756,526,485 4.5628511953497×10^13
Transcriptome Shotgun Assembly (TSA) 1,058,643,373 8.98651493967×10^11
Targeted Locus Study (TLS) 191,365,090 7.9162820303×10^10
Total 6,267,995,130 5.3896269141289×10^13

The following table lists the twenty organisms with the largest number of bases in the traditional GenBank divisions for release 271.0; it excludes WGS, TSA, TLS, metagenomic, mitochondrial, chloroplast, synthetic construct, and several other categories defined in the release notes.2

Top 20 organisms in GenBank (Release 271.0)2
Organism Entries Bases NCBI Taxonomic IDa
Common wheat (Triticum aestivum) 1,948,752 4.05961950617×10^11 4565
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 9,204,178 2.73571729605×10^11 2697049
Oat (Avena sativa) 30,555 2.73192060156×10^11 4497
Barley (Hordeum vulgare) 113,643 2.59319130177×10^11 4513
Hordeum bulbosum 753 2.12675690135×10^11 4514
Cultivated barley (Hordeum vulgare subsp. vulgare) 1,347,104 1.25991654054×10^11 112509
Mistletoe (Viscum album) 164 9.3011095388×10^10 3973
Wild barley (Hordeum vulgare subsp. spontaneum) 29,978 9.2980378923×10^10 4512
House mouse (Mus musculus) 10,118,677 4.6333339697×10^10 10090
Human (Homo sapiens) 28,773,499 3.8432937493×10^10 9606
Escherichia coli 195,553 3.771187484×10^10 562
Eggplant (Solanum melongena) 4,711 3.7408102376×10^10 4111
Thale cress (Arabidopsis thaliana) 2,643,074 3.3455227111×10^10 3702
Cattle (Bos taurus) 2,246,081 3.3006519943×10^10 9913
Klebsiella pneumoniae 43,595 2.6206281917×10^10 573
Maize (Zea mays) 4,224,726 2.4918520894×10^10 4577
Broad bean (Vicia faba) 28,783 2.3037489636×10^10 3906
Palmate newt (Lissotriton helveticus) 507 2.2852582653×10^10 8293
Northern crested newt (Triturus cristatus) 1,634 2.2052877731×10^10 8297
Wild oat (Avena sterilis) 248 2.1843250305×10^10 4499

Limitations

An analysis of GenBank and other services for the molecular identification of clinical blood culture isolates using 16S rRNA sequences10 showed that such analyses were more discriminative when GenBank was combined with other services such as EzTaxon-e11 and the BIBI12 databases.

GenBank may contain sequences wrongly assigned to a particular species, because the initial identification of the organism was wrong. A recent study showed that 75% of mitochondrial Cytochrome c oxidase subunit I sequences were wrongly assigned to the fish Nemipterus mesoprion resulting from continued usage of sequences of initially misidentified individuals.13 The authors provide recommendations how to avoid further distribution of publicly available sequences with incorrect scientific names.

Numerous published manuscripts have identified erroneous sequences on GenBank.141516 These are not only incorrect species assignments (which can have different causes) but also include chimeras and accession records with sequencing errors. A recent manuscript on the quality of all Cytochrome b records of birds further showed that 45% of the identified erroneous records lack a voucher specimen that prevents a reassessment of the species identification.17

Another problem is that sequence records are often submitted as anonymous sequences without species names (e.g. as "Pelomedusa sp. A CK-2014" because the species are either unknown or withheld for publication purposes. However, even after the species have been identified or published, these sequence records are not updated and thus may cause ongoing confusion.18

See also

See also

Notes

Notes

  1. Also known as NCBI TaxID
References

References

  1. "GenBank Overview". National Center for Biotechnology Information. Retrieved 2 May 2026. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted.
  2. "GenBank Release 271.0". National Center for Biotechnology Information. Retrieved 2 May 2026.
  3. Sayers, Eric W.; Cavanaugh, Mark; Frisse, Linda; Pruitt, Kim D.; Schneider, Valerie A.; Underwood, Beverly A.; Yankie, Linda; Karsch-Mizrachi, Ilene (2025). "GenBank 2025 update". Nucleic Acids Research. 53 (D1): D56–D61. doi:10.1093/nar/gkae1114. PMC 11701615. PMID 39558184.
  4. "Submission Portal GenBank". National Center for Biotechnology Information. Retrieved 2 May 2026.
  5. "How to submit data to GenBank". National Center for Biotechnology Information. Retrieved 2 May 2026.
  6. "GenBank Submission Types". National Center for Biotechnology Information. Retrieved 2 May 2026.
  7. Hanson, Todd (2000-11-21). "Walter Goad, GenBank founder, dies". Los Alamos National Laboratory. Archived from the original on 2008-11-07. Retrieved 2 May 2026.
  8. Burks, Charles; Fickett, James W.; Goad, Walter B.; Kanehisa, Minoru; Lewitter, Fran I.; Rindone, William P.; Swindell, Charles D.; Tung, Ching-Shien; Bilofsky, Howard S. (1985). "The GenBank nucleic acid sequence database". Computer Applications in the Biosciences. 1 (4): 225–233. PMID 3880345.
  9. Benton, David (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research. 18 (6): 1517–1520. doi:10.1093/nar/18.6.1517. PMC 330520. PMID 2326192.
  10. Kyung Sun Park; Chang-Seok Ki; Cheol-In Kang; Yae-Jean Kim; Doo Ryeon Chung; Kyong Ran Peck; Jae-Hoon Song; Nam Yong Lee (May 2012). "Evaluation of the GenBank, EzTaxon, and BIBI Services for Molecular Identification of Clinical Blood Culture Isolates That Were Unidentifiable or Misidentified by Conventional Methods". J. Clin. Microbiol. 50 (5): 1792–1795. doi:10.1128/JCM.00081-12. PMC 3347139. PMID 22403421.
  11. EzTaxon-e Database eztaxon-e.ezbiocloud.net (archive accessed 25 March 2021)
  12. leBIBI V5 pbil.univ-lyon1.fr (archive accessed 25 March 2021)
  13. Ogwang, Joel; Bariche, Michel; Bos, Arthur R. (2021). "Genetic diversity and phylogenetic relationships of threadfin breams (Nemipterus spp.) from the Red Sea and eastern Mediterranean Sea". Genome. 64 (3): 207–216. doi:10.1139/gen-2019-0163. PMID 32678985.
  14. van den Burg, Matthijs P.; Herrando-Pérez, Salvador; Vieites, David R. (13 August 2020). "ACDC, a global database of amphibian cytochrome-b sequences using reproducible curation for GenBank records". Scientific Data. 7 (1): 268. Bibcode:2020NatSD...7..268V. doi:10.1038/s41597-020-00598-9. eISSN 2052-4463. PMC 7426930. PMID 32792559.
  15. Li, Xiaobing; Shen, Xuejuan; Chen, Xiao; Xiang, Dan; Murphy, Robert W.; Shen, Yongyi (6 February 2018). "Detection of Potential Problematic Cytb Gene Sequences of Fishes in GenBank". Frontiers in Genetics. 9: 30. doi:10.3389/fgene.2018.00030. eISSN 1664-8021. PMC 5808227. PMID 29467794.
  16. Heller, Philip; Casaletto, James; Ruiz, Gregory; Geller, Jonathan (7 August 2018). "A database of metazoan cytochrome c oxidase subunit I gene sequences derived from GenBank with CO-ARBitrator". Scientific Data. 5 (1). Bibcode:2018NatSD...580156H. doi:10.1038/sdata.2018.156. eISSN 2052-4463. PMC 6080493. PMID 30084847.
  17. Van Den Burg, Matthijs P.; Vieites, David R. (22 September 2022). "Bird genetic databases need improved curation and error reporting to <scp>NCBI</scp>". Ibis. doi:10.1111/ibi.13143. eISSN 1474-919X. hdl:10261/282622. ISSN 0019-1019.
  18. Garg, Akhil; Leipe, Detlef; Uetz, Peter (2019-12-10). "The disconnect between DNA and species names: lessons from reptile species in the NCBI taxonomy database". Zootaxa. 4706 (3): 401–407. doi:10.11646/zootaxa.4706.3.1. ISSN 1175-5334.


External links