Protein Signal Peptide Prediction (ProtSig)

Signal peptides are short peptide chains that guide newly synthesized proteins into secretion pathways, typically 5-30 amino acids in length. Most signal peptides are located at the N-terminus of proteins, though in rare cases they may appear elsewhere. Signal peptides usually contain a characteristic hydrophobic region, and their core function is to mediate protein secretion to the extracellular space. In recombinant protein expression design, signal peptide prediction is often performed first, followed by deletion, replacement, or artificial addition of signal peptides based on the intended expression strategy (intracellular or secretory expression).

1. Enter protein sequences in the textbox below (supports up to 100 FASTA entries; each sequence is automatically truncated to the first 70 aa)

2. Select the organism type

1. Protein Sequences:

Organism Type:



Signal Peptide Overview

A signal peptide is a hydrophobic amino acid sequence, typically 5-30 residues long, located at the protein N-terminus and used to direct newly synthesized proteins into transport pathways. Signal peptides are found in secreted proteins, transmembrane proteins, and proteins targeted to eukaryotic organelles.

There are two major pathways guided by signal peptides: (1) the conventional Sec/secretory pathway and (2) the Tat/twin-arginine pathway. The former is involved in protein transfer to the plasma membrane in prokaryotes and to the endoplasmic reticulum membrane in eukaryotes. The latter exists in bacteria, archaea, chloroplasts, and mitochondria, and its signal peptides are generally longer, less hydrophobic, and contain two consecutive arginines in the tail region. Unlike the Sec pathway that transports unfolded proteins, the Tat pathway can transport folded proteins across lipid bilayers.

After directing protein transport, signal peptides are cleaved by signal peptidases. There are three major types: (1) SPaseI, (2) SPaseII, and (3) SPaseIII. Most signal peptides are removed by SPaseI, which is present in archaea, bacteria, and eukaryotes; on the eukaryotic endoplasmic reticulum membrane, only SPaseI is present. In bacterial and archaeal lipoproteins, the C-terminal part of the signal peptide contains a conserved region called the lipobox, and cleavage is performed by SPaseII. The residue adjacent to the cleavage site (CS, Cleavage Site) in the lipobox is cysteine, which is related to membrane anchoring. Type IV pilin signal peptides in bacteria are cleaved by SPaseIII. In addition, Sec-related signal peptides can be cleaved by SPaseI, SPaseII, and SPaseIII, whereas Tat-related signal peptides are cleaved only by SPaseI and SPaseII.

This tool is trained with a state-of-the-art deep learning model and can accurately predict prokaryotic signal peptides Sec/SPI, Sec/SPII, and Tat/SPI, as well as eukaryotic Sec/SPI signal peptides. It currently achieves SOTA (State-of-the-Art) performance.

Model Performance Metrics (SOTA Level)

Our model demonstrates excellent performance across signal peptide detection and cleavage-site prediction tasks. Detailed test results are shown below:

============================================================
TEST DETAILED PERFORMANCE REPORT
============================================================

[Detection MCC (One-vs-All)]
EUKARYA - SP             : MCC = 0.9828
POSITIVE - SP            : MCC = 0.9545
POSITIVE - LIPO          : MCC = 1.0000
POSITIVE - TAT           : MCC = 1.0000
POSITIVE - TATLIPO       : MCC = 1.0000
POSITIVE - PILIN         : MCC = 1.0000
NEGATIVE - SP            : MCC = 0.9193
NEGATIVE - LIPO          : MCC = 0.9598
NEGATIVE - TAT           : MCC = 0.9517
NEGATIVE - TATLIPO       : MCC = 0.9547
NEGATIVE - PILIN         : MCC = 1.0000
ARCHAEA - SP             : MCC = 1.0000
ARCHAEA - LIPO           : MCC = 1.0000
ARCHAEA - TAT            : MCC = 1.0000
ARCHAEA - TATLIPO        : MCC = 1.0000
ARCHAEA - PILIN          : MCC = 1.0000

[Cleavage Site (CS) Prediction]
Category                       | Precision  | Recall     | F1        
----------------------------------------------------------------------
EUKARYA SP                     | 0.8296     | 0.8421     | 0.8358
POSITIVE SP                    | 0.7692     | 0.7143     | 0.7407
POSITIVE LIPO                  | 1.0000     | 1.0000     | 1.0000
POSITIVE TAT                   | 0.5000     | 0.5000     | 0.5000
POSITIVE TATLIPO               | 1.0000     | 1.0000     | 1.0000
POSITIVE PILIN                 | 1.0000     | 1.0000     | 1.0000
NEGATIVE SP                    | 0.9000     | 0.9643     | 0.9310
NEGATIVE LIPO                  | 1.0000     | 0.9412     | 0.9697
NEGATIVE TAT                   | 0.7727     | 0.7083     | 0.7391
NEGATIVE TATLIPO               | 0.9167     | 1.0000     | 0.9565
NEGATIVE PILIN                 | 1.0000     | 1.0000     | 1.0000
ARCHAEA SP                     | 1.0000     | 1.0000     | 1.0000
ARCHAEA LIPO                   | 1.0000     | 1.0000     | 1.0000
ARCHAEA TAT                    | 1.0000     | 1.0000     | 1.0000
ARCHAEA TATLIPO                | 1.0000     | 1.0000     | 1.0000
ARCHAEA PILIN                  | 1.0000     | 1.0000     | 1.0000

============================================================
TEST Overall Global Accuracy: 0.9914
TEST Overall Global MCC:      0.9804
TEST Overall CS Macro F1:     0.9171
============================================================
        

References

  • Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol. 2019;37(4):420‐423. doi:10.1038/s41587-019-0036-z