Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

Where do the UniProtKB protein sequences come from?

Last modified December 6, 2019

More than 95% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources (International Nucleotide Sequence Database Collaboration (INSDC)). These CDS are either generated by gene prediction programs or are experimentally proven. A protein identifier ("protein_id") is assigned to the translated CDS and can be found in the original EMBL-Bank/GenBank/DDBJ record and in the relevant UniProtKB entry.

The translated CDS sequences are automatically transferred to the TrEMBL section of UniProtKB. The TrEMBL records can be selected for further manual annotation and then integrated into the UniProtKB/Swiss-Prot section. The "protein_id" are listed in the cross-reference part of the 'Sequence' section, of the UniProtKB entries (see for example P13744 'Translation').

In addition to translated CDS, UniProtKB protein sequences may come from:

The FAQ Does UniProtKB contain all protein sequences? gives information on our UniProtKB protein sequence exclusion policies, e.g. for redundant proteomes.

(1) Complementary pipelines for import of protein sequences have been developed in collaboration with Ensembl for vertebrate species, Ensembl Genomes for non-vertebrate species, WormBase ParaSite for parasitic nematodes and VectorBase for pathogen vector genomes. In addition, a new pipeline imports selected non-redundant genomes annotated by NCBI RefSeq. These sources provide proteome sequences for a number of key genomes of special interest where the INSDC submission is lacking gene model annotation.

To date, these pipeline have been used to populate UniProtKB with additional predicted sequences for the human and mouse proteomes as well as a number of other important vertebrate and non-vertebrate species. See: What are proteomes?

See also:

Related terms: imported, source, origin

UniProt is an ELIXIR core data resource
Main funding by: National Institutes of Health

We'd like to inform you that we have updated our Privacy Notice to comply with Europe’s new General Data Protection Regulation (GDPR) that applies since 25 May 2018.

Do not show this banner again