
Protein Embeddings
=====================

UniProt is providing raw embeddings for UniProtKB/Swiss-Prot and some reference proteomes of model organisms.
The embeddings are generated using the ProtT5 protein language model and stored in the standard HDF5 file format.
There are two embeddings files generated: per-protein embeddings, where a fixed-length embeddings vector is computed for the whole protein sequence, and per-residue embeddings where a fixed-length embeddings vector is computed for each single residue. 

Note: Protein sequences longer than 12k residues are excluded due to limitation of GPU memory (this concerns only a handful of proteins).

Per-protein embeddings:
----------------------------------

This directory contains the following subdirectories, one for each dataset, where the per-protein.h5 embeddings file resides:

1) uniprot_sprot
Per-protein embeddings for UniProtKB/Swiss-Prot.

2) UP000006548_3702
Per-protein embeddings for Arabidopsis thaliana reference proteome.

3) UP000001940_6239
Per-protein embeddings for Caenorhabditis elegans reference proteome.

4) UP000000625_83333
Per-protein embeddings for Escherichia coli reference proteome.

5) UP000005640_9606
Per-protein embeddings for Homo sapiens reference proteome.

6) UP000000589_10090
Per-protein embeddings for Mus musculus reference proteome.

7) UP000002494_10116
Per-protein embeddings for Rattus norvegicus reference proteome.

8) UP000464024_2697049
Per-protein embeddings for SARS-CoV-2 reference proteome.
bases.


Per-residue embeddings:
----------------------------------

Since per-residue embeddings could become very large for larger datasets and longer sequences, they are provided under a different ftp location (and would only be made available based on users interest).

Per-residue embeddings can be accessed from following location: 
https://ftp.ebi.ac.uk/pub/contrib/UniProt/embeddings/current_release

Similar to the per-protein directory, there is one subdirectory for each dataset, where the per-residue.h5 embeddings file resides.


--------------------------------------------------------------------------------
  LICENSE
--------------------------------------------------------------------------------
We have chosen to apply the Creative Commons Attribution 4.0 International 
(CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all 
copyrightable parts of our databases.

(c) 2002-2024 UniProt Consortium

--------------------------------------------------------------------------------
  DISCLAIMER
--------------------------------------------------------------------------------
We make no warranties regarding the correctness of the data, and disclaim
liability for damages resulting from its use. We cannot provide unrestricted
permission regarding the use of the data, as some data may be covered by patents
or other rights.

Any medical or genetic information is provided for research, educational and
informational purposes only. It is not in any way intended to be used as a
substitute for professional medical advice, diagnosis, treatment or care.
