 |
PSORTdb
A database of protein subcellular localizations
for bacteria |
 |
|
PSORTdb documentation
| 1. |
|
| 2. |
|
| 3. |
|
| |
3.1. |
|
| |
3.2. |
|
| 4. |
|
| |
4.1 |
|
| |
4.2 |
|
| |
4.3 |
|
| |
4.4 |
|
| |
4.5 |
|
| 5. |
|
| 6. |
|
| |
6.1.
|
|
| |
6.2. |
|
|
1. Introduction
Identification of a bacterial protein’s
SCL provides valuable clues regarding its biological function. For example,
surface exposed or secreted proteins are of primary interest due to their
potential as vaccine candidates, diagnostic agents (environmental or medical)
and the ease with which they may be accessible to drugs. Computational
SCL analysis of the growing number of completed bacterial genomes or individual
proteins allows researchers to screen for vaccine/drug candidates, automatically
annotate gene products or select proteins for further study.
PSORTdb is a database of SCL for bacteria that contains
both information determined through laboratory experimentation (ePSORTdb
dataset) and computational predictions (cPSORTdb
dataset). The dataset of experimentally verified information (~2000 proteins)
was manually curated by us and represents the largest dataset of its kind.
The second component of this database contains computational analyses
of proteins deduced from the most recent NCBI dataset of completely sequenced
genomes. Analyses are currently calculated using PSORTb, the most precise
automated SCL predictor for bacterial proteins.
PSORTdb database belongs to PSORT
family which provides resources and links for subcellular localization
prediction.
If you use PSORTdb in your research, we would greatly appreciate if
you cited one of the following publications:
- PSORT-DB:
Rey, S., M. Acab, J.L. Gardy, M.R. Laird, K. deFays, C. Lambert, and F.S.L. Brinkman (2005).
PSORT-DB: A Database of Subcellular Localizations for Bacteria.
Nucleic Acids Research.
33:D164-168. (Database issue)
|
| 2. ePSORTdb
ePSORTdb
is a dataset of proteins whose subcellular localization (SCL) proteins
has been verified by laboratory experimentation. This dataset was used
to train PSORTb, our automated
SCL predictor for bacterial proteins, and is currently the largest dataset
of its kind available to date for bacteria. Localizations were manually
assigned to each protein in the dataset by searching the literature (through
the PubMed database or other literature sources, such as microbiology
textbooks) for experimentally derived information.
The following table describes the subcellular
localization sites used to annotate proteins in ePSORTdb.
| |
Gram- negative |
Gram-
positive |
| Single
SCL
|
|
| |
Cytoplasmic (C) |
Cytoplasmic (C) |
| |
Cytoplasmic membrane (CM) |
Cytoplasmic membrane (CM) |
| |
Periplasmic (P) |
|
| |
- |
Cell wall (CW) |
| |
Outer membrane (OM) |
- |
| |
Extracellular (E) |
Extracellular (E) |
|
| Multiple
SCL
|
|
| |
C/CM |
C/CM |
| |
CM/P |
CM/CW |
| |
P/OM |
|
| |
OM/E |
|
|
As of August 2004, ePSORTdb
version 2.0 includes 2165 bacterial proteins (1591 from Gram-negative
and 574 from Gram-positive bacteria).
ePSORTdb can be searched using a variety
of fields. The specific fields available in ePSORTdb
are described below.
|
3.
cPSORTdb
cPSORTdb
is a dataset of protein localizations that were predicted by computational
methods. To date, 140 bacterial genomes available through the NCBI have
been analyzed by both PSORTb version 1.1.2 and version 2.0. In the future,
we woud like to incorporate the results of other SCL predictive methods
into cPSORTdb.
The long format predictions generated by
PSORTb versions 1.1.2 and 2.0 methods are stored in cPSORTdb
and are fully browsable and searchable. Descriptions of the two versions
of PSORTb are given below.
3.1. PSORTb v.1.1.2
PSORT-B v.1.1.2 is designed for Gram-negative
bacterial proteins and consists of six analytical modules:
- SCL-BLAST, or SubCellular Localization BLAST
- Motif Analysis
- Outer Membrane Motif Analysis
- HMMTOP
- SubLocC
- Signal Peptide
Each module analyzes one biological feature
known to influence or be characteristic of subcellular localization.
The modules may act as a binary predictor, classifying a protein as
either belonging or not belonging to a particular localization site,
or they may be multi-category, able to assign a protein to one of several
localization sites.
In order to generate a final prediction,
the results of each module are combined and assessed. A probabilistic
method and 5-fold cross validation were used to assess the likelihood
of a protein being at a specific localization given the prediction of
a certain module. These likelihoods are used to generate a probability
value for each of the five localization sites for a user's query protein.
When analyzing a Gram-negative organism,
the 5 possible localization sites are: cytoplasm, cytoplasmic membrane,
periplasm, outer membrane and extracellular space. PSORT-B v.1.1.2 returns
a list of these five localization sites and the associated probability
value for each, ranked in descending order. A cutoff of 7.5 or above
is used to return a final prediction, otherwise a result of "Unknown"
is returned.
See below for database
fields associated with PSORTb v.1.1.2 , or read more information about
Submit a protein to PSORTb
v.1.1.2.
3.2. PSORTb v.2.0
PSORTb v.2.0 is designed for both Gram-negative
and Gram-positive bacterial proteins and again consists of
multiple analytical modules:
- SCL-BLAST & SCL-BLASTe, or SubCellular Localization
BLAST
- Support Vector Machines (SVMs)
- Motif & Profile Analysis
- Outer Membrane Motif Analysis
- HMMTOP
- Signal Peptide
The same integration and weighting process
as PSORTb v.1.1.2 (see above) is used to generate a final prediction.
For Gram-negative organisms, the 5 possible localization
sites are: cytoplasm, cytoplasmic membrane, periplasm, outer membrane
and extracellular space. For Gram-positive bacteria,
4 localization sites are possible: cytoplasm, cytoplasmic membrane,
cell wall and extracellular space. The 7.5+ cutoff is again used.
See below for database
fields associated with PSORTb v.2.0, or read more information about
PSORTb
v.2.0 modules.
Submit a protein to PSORTb
v.2.0.
|
4.
Accessing ePSORTdb and cPSORTdb
4.1. Text search

A. Simple search input text
field - allows user to enter keywords.
B. PSORTb version check box (only available in
cPSORTdb) - selects version
of PSORTb predictions.
C. Drop down menu - selects a field to search
against.
D. Advanced search input text field - allows user
to enter keywords.
E. Help message box - displays description and/or
example of what type of text can be entered for the selected search
field
F. Boolean operator - allows user to make complex
queries.
Simple search
Simple search (A)
allows the user to perform keyword searches against all text fields
of the database. The wildcard % (representing zero
to many wild characters) will be added between each keyword.
Simple searches can be carried out against
ePSORTdb and cPSORTdb. In the latter case, the search is carried out
by default against the predictions from PSORTb v.2.0 (the most recent
version of PSORTb).
Advanced search
Advanced searches can be done by specifying
keywords in one or more text fields. Available fields vary according
to the database in question (follow the links for a list and description
of the fields available in ePSORTdb
and cPSORTdb).
When you choose a particular field from the drop down menu list (C),
an example of possible values appropriate to this field is automatically
displayed in the help message box (E) on the far
right, letting you know the correct keywords or the correct syntax to
be used. You can then either highlight keywords from the help message
box and use the double arrow button to send them to the input text search
box (D), or you may type your own keywords into
box D. In each input search text (D) box you can
have more than one keyword. These will be separated by an OR when searching
within the selected field. If you want to search multiple keywords in
different fields, you can use Boolean operators (F)
to combine up to 4 fields.
In cPSORTdb
advanced searches, a check box allows (B) the user
to choose among predictions from different versions of the programs
available (currently PSORTb versions 1.1.2 and 2.0).
Localization score fields (e.g. "PSORTb
predicted SCL score" or "cytoplasmic score") are grouped
under the category field called >All scores. If
you select this particular field, the query will be run against all
localization score fields grouped under this category.
In the same manner, reference fields (PubMed
ID, Title, ISBN number, WWW and comments) are also grouped under the
category field called >Reference summary.
Furthermore:
Keywords are case insensitive
Spaces and quotation marks will be included in
the search.
% represents zero to many wild characters.
_ (underscore character) represents a single wildcard
character.
NB: wildcards are
not automatically included in advanced queries. In other words,
every keyword will be an exact match search. If you wish to perform
a substring search within a text field, you have to add wildcards (%
or _ )to your query. Entering the sole keyword _% will only return
entries that have some data in the selected field (this feature is only
functional for text fields; while the help message box does not always
indicate field type, numerical fields are explicitly mentioned). Queries
within numerical fields are achieved using mathematical symbols, such
as <, >, >=, <= or =. Some examples are shown in the following
table.
| Protein name |
synthetase |
no results |
| |
synthetase% |
all proteins which
name starts with synthetase (2 hits) |
| |
%synthetase% |
all proteins which
name contains the keyword synthetase (5,724 hits) |
| GI |
<=19052904 |
all proteins which
GI is less or equal to 19052904 |
HMMTOP
Helices Count |
8<=range<=10 |
all proteins which
have between 8 and 10 predicted helices by the HMMTOP module. |
| Gene Name |
yxb_ |
all proteins which
gene name has 4 characters and starts with yxb |
|
Results of a text search are returned
as a table. For an explanation of this table, see 4.4
Result Display.
4.2. BLAST search
Both ePSORTdb
and cPSORTdb datasets
can be searched using the BLASTP program. One or more proteins can be
submitted at the same time, these must be in FASTA format. A sequence
with a FASTA sequence file consists of three parts:
A title line, which must begin with a '>' symbol,
and may be followed by any type of text
A new line character at the end of the title line
The sequence itself, which continues until the
end of file or the next `>' is reached
An example of FASTA format is shown below:
>sp|O52956|A85A_MYCAV Antigen
85-A precursor (85A)
MTLVDRLRGAVAGMPRRLVVGAAGAALLSGLIGAVGGSATAGAFSRPGLPVEYLQVPSAAMGRD
IKVQFQSGGANSPALYLLDGMRAQDDFNGWDINTPAFEWYNQSGISVAMPVGGQSS FYSDWY
KPACGKAGCTTYKWETFLTSELPQYLSAQKQVKPTGSGVVGLSMAGSSALILAA YHPDQFVYAG
SLSALLDPSQGMGPSLIGLAMGDAGGYKAADMWGPKEDPAWARNDPSLQV GKLVANNTRIWV
YCGNGKPSDLGGDNLPAKFLEGFVRTSNLKFQDAYNGAGGHNAVWNFD ANGTHDWPYWGA
QLQAMKPDLQSVLGATPGAGPATAAATNAGNGQGT
For more information, see the description
at NCBI.
Results of a BLAST search
are presented through the standard BLASTP layout, displaying the retrieved
proteins with their associated parameters. The different sections of
the result page are briefly described here:
|
| Query= sp|O52956|A85A_MYCAV Antigen 85-A
precursor (85A) (347 letters) |
|
Header of the submitted protein with
its sequence length in brackets.
|
| Database: ePSORT.faa 2165 sequences; 1,017,765
total letters |
|
Name and size of the database which
was searched (ePSORT.faa for ePSORTdb and cPSORT.faa for cPSORTdb).
|
| Sequences producing
significant alignments: |
Score
(bits)
|
E
Value
|
| gi|13431272
Extracellular|Antigen 85-A precursor|Mycobacter... |
445 |
e-126
|
| ... |
|
|
| gi|13475694
OuterMembrane|general secretion proteinD|Mesorh... |
24 |
8.8 |
|
Summary of the retrieved proteins with
their Score and E-value. Two links are available for each protein,
one (e.g. gi|13431272)
pointing to the complete PSORTdb entry of the protein and the other
(e.g. 445) jumping further
down on the page to the detailed results of the BLAST search. The
definition line of the retrieved proteins contains their GI number,
their subcellular localization (experimentally confirmed for ePSORTdb
and predicted for cPSORTdb)
their name and their source organism.
|
| >>[ePSORTdb]
[NCBI] gi|13431271 Extracellular|Antigen 85-A |
| |
precursor|Mycobacterium
gordonae |
| |
Length = 339
|
| Score = 403 bits (1036), Expect = e-114 |
| Identities = 197/240 (82%), Positives = 204/240 (85%),
Gaps = 1/240 (0%) |
| Query:
1 |
MTLVDRLRGAVAGMPRRXXXXXXXXXXXXXXXXXXXXXXTAGAFSRPGLPVEYLQVPSAA
60 |
| |
M LVDR RGAV GMPRR TA
AFSRPGLPVEYLQVPSAA |
| Sbjct:
1 |
MKLVDRFRGAVTGMPRRLMVGAVGAALLSGLVGFVGGSATASAFSRPGLPVEYLQVPSAA
60 |
| ... |
|
|
Detailed results of the BLAST search
for a retrieved protein are shown above. Two links are available for
each protein, the first one (called [ePSORTdb]
or [cPSORTdb])
allows the user to view the protein's complete entry in our PSORTdb
database and the second one (called [NCBI])
points the to protein's entry in the NCBI database.
4.3. Browse datasets

A. Current level of the browsing
- allows user to return to a higher level of the database.
B. Next available levels - allows user to proceed
to a lower level of the database.
Both datasets can be browsed by SCL, organism,
phylum, class, Gram stain and genome in every possible combination.
When "List all proteins at Current Level" is selected, the
results are returned as a table. For an explanation of this table, see
4.4 Result Display.
As an example, the browse function might
be used in the cPSORTdb
dataset, to retrieve all predicted localization for a specific genome,
by selecting "Organism"
at the first level, then by selecting "your favorite microbe, and
then choosing "Current Level".
From the results page, all the information associated with this organism
can be downloaded either in tab delimited or FASTA format.
4.4. Result display

A. Sort columns
- sorts by ascending or descending order (up to 3 fields can be used
in sorting).
B. Displayed columns - adds or removes selected
columns from the results view.
C. Results per page - changes the number of records
per page.
D. Download options - downloads the results.
E. Total number of results and pages -navigates
from one result page to another.
F. Results view - displays search results.
Results of browsing and keyword searches
can be viewed page by page. Initially a default set of fields are displayed
as columns but there are numerous options available that will allow
the user to customize their result listing:
- Up to 3 user-selected fields can be sorted simultaneously
in ascending or descending order (A).
- The user can select which columns to view (B).
- The number of records per page can be changed
from the default of 10 (C).
Furthermore, from the results list the user can:
- Download the results list (D).
See also 4.5 Result download
- See the number of pages in the Results list, and jump
to another page (E)
- Click on the protein's GI accession number in the Results
view (F) to get detailed annotation - (information
in all available fields as well as the amino acid sequence of the
protein) for each entry.
4.5. Result download
From the Results view page, the user can
download their search results (D).
Results may be downloaded in two different formats:
A Tab delimited text file containing those fields
currently displayed in Results view.
A FASTA file containing the amino acid sequences
of the proteins in the results list.
|
5.
ePSORTdb fields
The following table lists all available
ePSORTdb fields. Default
fields displayed in results pages are in bold. Numerical fields are followed
by a #.
| Field name |
Description |
| GI # |
GI number of NCBI, primary identification key, unique to each protein |
| Swiss-Prot ID |
Swiss-Prot primary accession number of the protein |
| Protein name |
The name of the protein |
| Alternate protein name |
Alternative names of the protein |
| Gene name |
The name of the gene |
| Alternate gene name |
Alternative names of the gene |
| Organism |
Genus and species of the protein source organism |
| Taxonomy ID # |
NCBI taxonomy identifier of the source organism |
| Gram stain |
Gram classification of the source organism |
| Amino acid sequence |
Amino acid sequence of the protein |
| Sequence length |
Number of amino acids in the protein sequence |
|
| Experimental SCL(terse) |
Experimentally verified SCL (terse format)* |
| Experimental SCL (verbose) |
Experimentally verified SCL (verbose format)** |
| GO Accession ID |
Gene Ontology (GO) accession identifier |
| GO Accession Definition |
Gene Ontology (GO) accession definition |
| PubMed ID reference |
PubMed identifier of literature references |
| Reference title |
Title of literature reference books |
| ISBN number reference |
ISBN identifier of literature reference books |
| WWW reference |
Internet adress of www references |
| Reference comments |
Comments re: literature references |
| References summary |
Concatenation of all reference fields (PubMed ID, title, ISBN number,
WWW and comments) |
|
*strict SCL terminology
as returned by PSORTb:
Gram-negative: Cytoplasmic, CytoplasmicMembrane,
Periplasmic, OuterMembrane
and Extracellular
Gram-positive: Cytoplasmic, CytoplasmicMembrane,
Cellwall and Extracellular
** e.g. Cell wall surface exposed (LPxTG motif) protein
# numerical field |
|
6.
cPSORTdb fields
6.1. cPSORTdb
general fields
The following table lists the general
cPSORTdb fields which
identify the protein and its source genome/organism. Default fields
displayed in results pages are in bold. Numerical fields are followed
by a #.
| Field name |
Description |
| Chromosome Acc ID |
NCBI chromosome accession identifier associated with the protein
(NC_00XXXX) |
| GI # |
NCBI GI number of the protein (primary identification key, unique
to each protein) |
| RefSeq Accession ID |
RefSeq accession identifier of the protein |
| Protein name |
The name of the protein |
| Gene name |
The name of the gene |
| Alternate gene name |
The alternative name of the gene |
| Taxonomy ID # |
NCBI taxonomy identifier of the source organism |
| Organism |
Genus and species of the source organism |
| Phylum |
Phylum of the source organism |
| Class |
Class of the source organism |
| Gram stain |
Gram classification of the source organism |
| Amino acid sequence |
Amino acid sequence of the protein |
| Sequence length # |
Number of amino acids in the protein sequence |
| ePSORTdb
GI Link # |
GI of the protein in the current ePSORTdb
dataset* |
|
*: If the protein in cPSORTdb is identical
at the sequence level and the species level to a protein in ePSORTdb,
a link to the ePSORTdb record will appear here.
# numerical field |
TOP
6.2. PSORTb v.1.1.2 & v.2.0 prediction fields
The following table lists the cPSORTdb
fields which contain information regarding computationally predicted
SCLs. Default fields displayed in results pages are in bold.
| Field name |
Description |
| SCL-BLAST localization |
Predicted localization site (all possible SCL*) |
| SCL-BLAST details |
Protein GI of ePSORTdb
dataset |
| Motif localization |
Predicted localization site (all possible SCL*) |
| Motif details |
Motif accession number from PROSITE (list Gram neg.
& pos.)
|
| OMPmotif localization (N) |
Predicted localization site (either outer membrane or unknown) |
| OMPmotif details (N) |
OMPmotif accession number (list) |
| HMMTOP localization (#) |
Predicted localization site (all possible SCL*) |
| HMMTOP helices count |
Number of predicted helices |
| Signal localization |
Predicted localization site (all possible SCL*) |
| Signal details |
Presence or not of a signal peptide |
| Profile localization (§) |
Predicted localization site (all possible SCL*) |
| Profile details (§) |
Profile accession number from PROSITE (list Gram neg.
& pos.) |
| SCL-BLASTe localization (§) |
Predicted localization site (all possible SCL*) |
| SCL-BLASTe details (§) |
Protein GI of ePSORTdb
dataset |
| |
|
| CytoSVM localization (§) |
Predicted localization site (either cytoplasmic or unknown) |
| CMSVM localization (§) |
Predicted localization site (either cytoplasmic membrane or unknown) |
| PPSVM localization (N,§) |
Predicted localization site (either periplasmic or unknown) |
| CWSVM localization (P,§) |
Predicted localization site (either cell wall or unknown) |
| OMSVM localization (N,§) |
Predicted localization site (either outer membrane or unknown) |
| ECSVM localization (§) |
Predicted localization site (either extracellular or unknown) |
| SubLocC localization (N,§§) |
Predicted localization site (either cytoplasmic or unknown) |
|
| Cytoplasmic score (#) |
Probability of cytoplasmic localization returns by PSORTb |
| Cytoplasmic membrane score (#) |
Probability of cytoplasmic membrane localization returns by PSORTb |
| Periplasmic score (#,N) |
Probability of periplasmic localization returns by PSORTb |
| Cell wall score (#,P) |
Probability of cell wall localization returns by PSORTb |
| Outer membrane score (#,N) |
Probability of outer membrane localization returns by PSORTb |
| Extracellular score (#) |
Probability of extracellular localization returns by PSORTb |
|
| Predicted Localization ** |
Localization site returned by PSORTb |
| GO Accession ID |
Gene Ontology (GO) accession identifier |
| GO Accession Definition |
Gene Ontology (GO) accession definition |
| Predicted Localization Score (#) |
Probability of PSORTb predicted localization |
|
*for Gram-negative; :
cytoplasmic, cytoplasmic membrane, periplasmic, outer membrane,
extracellular
and for Gram-positive
cytoplasmic, cytoplasmic membrane, cell wall, extracellular |
| **: Localization site returns by PSORTb
specific to Gram-negative or positive . |
| N: available only for
Gram-negative |
| P: available only for
Gram-positive |
| # numerical field |
| § only available
in PSORTb versin 2.0 |
| §§ only available
in PSORTb version 1.1.2 |
TOP |
|
|
|
|
|