org.strbio.mol
Class Profile

java.lang.Object
  extended by org.strbio.mol.Molecule
      extended by org.strbio.mol.Polymer
          extended by org.strbio.mol.Protein
              extended by org.strbio.mol.Profile
Direct Known Subclasses:
ProfilePSI

public class Profile
extends Protein

Class to represent a protein with profile information. The main protein described by this class (i.e. with coordinates, sequence, predSS, etc) is called the "key protein". Other aligned proteins are stored only as sequences. The names of the sequences are stored in this class, as are the "e values", which are a measure of the probability that each of the sequences is aligned by chance rather than a real homologue. The sequences of these aligned proteins are stored in the residues of the key protein, in the seq[] array.

 Version 1.71, 3/2/05 - changed readBLAST to skip occasional
   error in blast output formatting:  Query with no Subject
 Version 1.7, 3/1/04 - changed readBLAST to track first and last
   residues of subject (hit) sequence, add it to seqName
 Version 1.61, 11/12/02 - fixed readMSF to read bogus MSF format
   from 3DPSSM files
 Version 1.6, 11/02/01 - don't remove redundant sequences by
   default in blast... need to explicitly do that.
 Version 1.52, 10/19/01 - recognizes TBLASTN output as BLAST.
 Version 1.51, 8/2/01 - made seqPctID public, wrote better docs
 Version 1.5, 7/18/01 - supports SAF format
 Version 1.41, 10/26/00 - changed removeRedundantSequences to
   handle subsequences (ALL)
 Version 1.4, 2/11/00 - added Clustal format
 Version 1.36, 2/10/00 - recognizes more bogus MSF formats
 Version 1.35, 11/4/99 - added setSequence
 Version 1.34, 10/14/99 - added support for precalculated cons weights
   (PRECALC_CW)
 Version 1.33, 10/5/99 - speed up findCW by ignoring >500 seqs, fixed
   compatibility bugs in MSF format
 Version 1.32, 10/4/99 - fixed kludge in readBLAST for faster psiblast reading
 Version 1.31, 9/22/99 - fixed MSF routines for Clustal compatibility
 Version 1.3, 6/22/99 - added blastLog10E, blastBits
 Version 1.22, 3/30/99 - made printfs consistent, limited names to 4096 chars
 Version 1.21, 2/17/99 - moved BLAST code from ProfileSet to here.
 Version 1.2, 2/9/99 - added YAPF format
 Version 1.18, 2/2/99 - fixed bug in readBLAST - 'X' being read as gap
 Version 1.17, 1/19/99 - added removeSeq, removeRedundantSequences
 Version 1.16, 1/8/99 - reads BLAST files.
 Version 1.15, 10/26/98 - reads lower case MSF files.
 Version 1.14, 9/24/98 - detects some unexpectedly truncated HSSP files.
 Version 1.13, 8/12/98 - made readMSF not reset() if at the end of a file;
   this triggers a bug in BufferedReader not resetting to the right place.
 Version 1.12, 8/7/98 - changed PrintfStream to Printf
 Version 1.11, 7/17/98 - added writeTDP, doVarTom
 Version 1.1, 7/10/98 - added choose, writeFasta, Profile(ProteinSet)
 Version 1.01, 5/19/98 - made readMSF more flexible about format
 Version 1.0, 5/1/98 - original version
 

Version:
1.7, 3/1/04
Author:
JMC
See Also:
ProfRes

Field Summary
 int[] blastBits
          Blast score for each sequence, if known... this should measure how close the sequence is to the key sequence.
 double[] blastLog10E
          E values for each sequence, if known... this should measure how close the sequence is to the key sequence, and how likely the match was in the database searched.
 java.lang.String[] seqName
          Name of each of the sequences in the profile, if sequence information is available; otherwise, null.
 
Fields inherited from class org.strbio.mol.Polymer
chainID, includeFile, includingFile, monDistance, monomers, properties
 
Fields inherited from class org.strbio.mol.Molecule
atoms, data, MAX_NAME_LENGTH, name
 
Constructor Summary
Profile()
          Make an empty Profile.
Profile(Profile q)
          Copy a Profile, including all data in it.
Profile(Protein q)
          Make a Profile from a Protein, copying all data.
Profile(ProteinSet q)
          Make a Profile from a ProteinSet, copying all data.
 
Method Summary
 void addKeySeq()
          Tell the profile to add the "key sequence"... another sequence with the sequence of this protein.
 void addSeq(java.lang.String newSeqName)
          Tell the profile to add a sequence with a given name.
 void addSeqsDirectlyFrom(Profile q)
          Add additional sequences from another profile of the same length.
 void addSeqsFrom(Profile q)
          add sequence info from another profile; aligns and copies.
 void allocSeqs(int n)
          Tell the Profile to allocate space for at least N known sequences.
 void allocSeqsRes()
          Tell all residues in the profile to allocate space for at least as many sequences as we know about.
 void blast(Printf outfile)
          Run BLAST on this profile, using NCBI's blast server.
 void blast(Printf outfile, Blast blastServer)
          Run BLAST on this profile, using a specified server.
 Profile choose(int seqNum)
          Make a profile based on this one, featuring one of the sub- sequences.
 void clearSeqs()
          Tell the Profile to forget any sequence information it knows.
 Polymer copy()
          Return a copy of yourself.
 void copySeqsDirectlyFrom(Profile q)
          Copy additional sequences from another profile of the same length, eliminating current sequences.
 void copySeqsFrom(Profile q)
          copy sequence info from another profile; aligns and copies, replacing current sequence info.
 Protein doVarTom(Printf outfile)
          Do var-tom on this profile.
 void ensureUniqueNames()
          This makes sure that every seqName is unique and non-null.
 void ensureUniqueNames(int maxNameLen)
          This makes sure that every seqName is unique and non-null.
 void findConsensus()
          Find the consensus residue for each residue in the profile.
 void findConservationWeights()
          Calculate conservation weight for each residue in the profile.
 void findFrequencies()
          Calculate the frequencies for each residue in the profile, but leave the consensus residue as it is.
 void findNonZeroFrequencies()
          Find the non-zero frequencies for each residue in the profile.
 int findSeqByName(java.lang.String name)
          Find a sequence by name, if it exists.
 int firstNonGap(int n)
          What's the position of the first non-gap residue in sequence N?
 java.lang.String getSeqName(int i)
          get the name of sequence i
 int lastNonGap(int n)
          What's the position of the first non-gap residue in sequence N?
 Monomer newMonomer()
          Profiles are made of ProfRes-type monomers.
 Monomer newMonomer(char t)
          This should return a new monomer of whatever type this polymer is made of (i.e.
 Monomer newMonomer(java.lang.String s)
          This should return a new monomer of whatever type this polymer is made of (i.e.
 void printProfile(Printf outfile)
          Print profile to output in a pretty format.
 void printProfile(Printf outfile, int pad)
          Print profile to output in a pretty format.
 void processYAPF(java.lang.String buffer)
          Process a line from a YAPF file... this should ignore the line if it doesn't know what it is.
 void readBLAST(java.io.BufferedReader infile, Printf outfile)
          This reads a profile from BLASTP 2.0.11 output.
 void readClustal(java.io.BufferedReader infile, Printf outfile)
          read a protein out of a CLUSTAL format file.
 void readHSSP(java.io.BufferedReader infile, Printf outfile)
          read a protein out of a HSSP (Sander & Schneider format) file.
 void readMSF(java.io.BufferedReader infile, Printf outfile)
          read a protein out of a MSF (GCG's format) file.
protected  boolean recognizeAndRead(java.lang.String buffer, java.io.BufferedReader infile, Printf outfile)
          This looks for a profile and reads it in.
 void removeRedundantSequences(boolean removeSubSequences)
          Eliminate redundant sequences (or subsequences) from the set.
 void removeSeq(int n)
          Remove one of the sequences.
 void removeSpacesInNames()
          This changes all spaces in sequence names to _
 double seqPctCoverage(int r, int s)
          Find pct coverage of second sequence by first.
 double seqPctID(int r, int s)
          Find pct id between 2 sequences, w/o gap penalties.
 java.lang.String sequence(int n)
          Return a string containing one of the sequences in the profile.
 int sequences()
          Number of sequences in the profile, if known; otherwise, zero.
 void setSeqName(int i, java.lang.String s)
          set the name of sequence i
 void setSequence(int n, java.lang.String newseq)
          Set the sequence for one of the sequences in the profile.
protected  Polymer splitCopy()
          When a Profile is split, sequence names should go to each child.
protected  void splitCopy(Profile q)
          When a Profile is split, sequence names should go to each child.
 void truncateNames(int maxChars)
          This truncates all names to a given length (or less).
 void writeClustal(Printf outfile)
          Write in CLUSTAL format.
 void writeFasta(Printf outfile)
          Write each sequence in the profile in Fasta format.
 void writeMSF(Printf outfile)
          Write in MSF format.
 void writeProf(Printf outfile)
          Write in Prof format.
 void writeSAF(Printf outfile)
          Write in Burkhard Rost's SAF format.
 void writeTDP(Printf outfile)
          A simple way of printing the profile, that var-tom uses as input.
protected  void writeYAPFInfo(Printf outfile)
          Write applicable sections of YAPF info.
 
Methods inherited from class org.strbio.mol.Protein
AA, actualAccuracy, copyPredSSFrom, expectedAccuracy, filterPred, filterPred, filterReal, filterReal, findAccess, findDSSP, findDSSPResult, findPDB, firstRes, fixDistanceGaps, getDSSPResults, getInfo, hasGaps, isCATHMatch, makeMonDistance, makeMonDistanceCB, makeVirtualCB, molecularWeight, preCalculateAlphaTau, preCalculateAngles, preCalculatePhiPsi, predictSS, predictSS, readAccess, readCASP, readConv, readDSSP, readEA, readPDB, readPDBAtom, readProf, readSwissProt, readVar, readVar2, residues, reverse, runDSSP, showGaps, smoothHE, stripAllButCA, thread, thread, translateEA, writeCASP, writeConv, writeEA, writePDB, writePDB, writeVar2
 
Methods inherited from class org.strbio.mol.Polymer
alignToArray, alignToInverseArray, autoSplit, centerOfMass, centerOfMass, clear, clearProperties, clearProperty, copyAtoms, correctAlignFold, correctAlignSeq, deleteMonDistance, firstMon, getGlobalAlignment, getMonomers, getNonGapMonomers, getProperty, getValidNonGapMonomers, globalAlign, globalAlign, globalCompare, globalCompare, keepMonomers, kludgeChainID, lastMon, length, makeMonDistanceAllAtomMin, minareaSuperimpose, monomer, nMonomers, nMonomers, nonGapMonomers, pad, printSequence, quickAlign, quickAlign, quickCompare, quickCompare, read, read, read, readFasta, readList, readSequence, readYAPF, renumberMonomers, renumberValidNonGapMonomers, reverseCopy, rotate, sequence, sequenceNonGap, sequenceValidNonGap, setProperty, stripAllAtoms, stripAllBut, stripAllButFirstAtom, stripAtomsByName, stripCommonGaps, stripGaps, stripInvalidAndGaps, stripNoAtoms, stripType, transform, translate, trimEnds, validNonGapMonomers, writePDBAtom, writePDBSeqres, writePTS, writeYAPF, writeYAPFAtom, YAPFGetNextMonomer
 
Methods inherited from class org.strbio.mol.Molecule
atomSearch, copyAtoms
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

seqName

public java.lang.String[] seqName
Name of each of the sequences in the profile, if sequence information is available; otherwise, null. If sequence information is available, this array must be the same length as the number of sequences, even if seqName[i] are all null (which indicates there is a sequence, but we don't know the name).


blastLog10E

public double[] blastLog10E
E values for each sequence, if known... this should measure how close the sequence is to the key sequence, and how likely the match was in the database searched. This is NaN if unknown.


blastBits

public int[] blastBits
Blast score for each sequence, if known... this should measure how close the sequence is to the key sequence. Meaning: a hit with this score will occur at random in searching 2^N sequences. This is zero if unknown.

Constructor Detail

Profile

public Profile()
Make an empty Profile.


Profile

public Profile(Protein q)
Make a Profile from a Protein, copying all data.


Profile

public Profile(Profile q)
Copy a Profile, including all data in it.


Profile

public Profile(ProteinSet q)
Make a Profile from a ProteinSet, copying all data. Most information is taken from the first protein in the set [protein(0)]. Other proteins in the set contribute only their sequence. Conservation weights and the consensus sequence are then calculated. All proteins in the set should be the same length, and pre-aligned; proteins longer than the first one in the set will be truncated.

Method Detail

sequences

public int sequences()
Number of sequences in the profile, if known; otherwise, zero.


findSeqByName

public int findSeqByName(java.lang.String name)
Find a sequence by name, if it exists. Returns the number of the sequence. If not found, returns -1.


getSeqName

public java.lang.String getSeqName(int i)
get the name of sequence i


setSeqName

public void setSeqName(int i,
                       java.lang.String s)
set the name of sequence i


copy

public Polymer copy()
Return a copy of yourself.

Overrides:
copy in class Protein

newMonomer

public Monomer newMonomer()
Profiles are made of ProfRes-type monomers.

Overrides:
newMonomer in class Protein

newMonomer

public Monomer newMonomer(char t)
Description copied from class: Protein
This should return a new monomer of whatever type this polymer is made of (i.e. Residue, Base), and initialize the type from a character.

Overrides:
newMonomer in class Protein

newMonomer

public Monomer newMonomer(java.lang.String s)
Description copied from class: Protein
This should return a new monomer of whatever type this polymer is made of (i.e. Residue, Base), and initialize the type from a string.

Overrides:
newMonomer in class Protein

splitCopy

protected void splitCopy(Profile q)
When a Profile is split, sequence names should go to each child. This copies these from another profile.


splitCopy

protected Polymer splitCopy()
When a Profile is split, sequence names should go to each child.

Overrides:
splitCopy in class Protein

clearSeqs

public void clearSeqs()
Tell the Profile to forget any sequence information it knows.


allocSeqsRes

public void allocSeqsRes()
Tell all residues in the profile to allocate space for at least as many sequences as we know about.


allocSeqs

public void allocSeqs(int n)
Tell the Profile to allocate space for at least N known sequences. All old information is copied. The default sequence character is '-' (same as the default Monomer type).


addSeq

public final void addSeq(java.lang.String newSeqName)
Tell the profile to add a sequence with a given name.


addKeySeq

public final void addKeySeq()
Tell the profile to add the "key sequence"... another sequence with the sequence of this protein.


addSeqsDirectlyFrom

public final void addSeqsDirectlyFrom(Profile q)
Add additional sequences from another profile of the same length.


copySeqsDirectlyFrom

public final void copySeqsDirectlyFrom(Profile q)
Copy additional sequences from another profile of the same length, eliminating current sequences.


addSeqsFrom

public final void addSeqsFrom(Profile q)
add sequence info from another profile; aligns and copies.


copySeqsFrom

public final void copySeqsFrom(Profile q)
copy sequence info from another profile; aligns and copies, replacing current sequence info.


sequence

public final java.lang.String sequence(int n)
Return a string containing one of the sequences in the profile.


setSequence

public final void setSequence(int n,
                              java.lang.String newseq)
Set the sequence for one of the sequences in the profile. Make sure the string is the name number of characters as the length of the profile.


removeSeq

public final void removeSeq(int n)
Remove one of the sequences.


removeRedundantSequences

public final void removeRedundantSequences(boolean removeSubSequences)
Eliminate redundant sequences (or subsequences) from the set.


findFrequencies

public final void findFrequencies()
Calculate the frequencies for each residue in the profile, but leave the consensus residue as it is.


findNonZeroFrequencies

public final void findNonZeroFrequencies()
Find the non-zero frequencies for each residue in the profile.


findConsensus

public final void findConsensus()
Find the consensus residue for each residue in the profile. This also finds the frequencies as a side effect.


seqPctID

public double seqPctID(int r,
                       int s)
Find pct id between 2 sequences, w/o gap penalties. Does not count areas where one sequence is longer than the other (at the beginning or end). Does not count areas where both sequences are gaps. Does count gaps in one or the other sequence as a mismatch.


seqPctCoverage

public double seqPctCoverage(int r,
                             int s)
Find pct coverage of second sequence by first. Does not count areas where both sequences are gaps.


findConservationWeights

public final void findConservationWeights()
Calculate conservation weight for each residue in the profile. This black magic brought to you courtesy of: Rost & Sander, JMB, 1993, v. 232, p 590 Note: changed to use only the first 500 sequences.


firstNonGap

public final int firstNonGap(int n)
What's the position of the first non-gap residue in sequence N? If sequence information is unknown or invalid, returns -1. If sequence is all gaps, returns -1.


lastNonGap

public final int lastNonGap(int n)
What's the position of the first non-gap residue in sequence N? If sequence information is unknown or invalid, returns -1. If sequence is all gaps, returns -1.


printProfile

public final void printProfile(Printf outfile,
                               int pad)
                        throws java.io.IOException
Print profile to output in a pretty format.

Parameters:
outfile - where to print to
pad - number of leading spaces to pad output with
Throws:
java.io.IOException

writeTDP

public final void writeTDP(Printf outfile)
                    throws java.io.IOException
A simple way of printing the profile, that var-tom uses as input. TDP = Tom Defay Profile

Throws:
java.io.IOException

processYAPF

public void processYAPF(java.lang.String buffer)
                 throws java.io.IOException
Process a line from a YAPF file... this should ignore the line if it doesn't know what it is.

Overrides:
processYAPF in class Protein
Throws:
java.io.IOException

writeYAPFInfo

protected void writeYAPFInfo(Printf outfile)
                      throws java.io.IOException
Write applicable sections of YAPF info.

Overrides:
writeYAPFInfo in class Protein
Throws:
java.io.IOException

printProfile

public final void printProfile(Printf outfile)
                        throws java.io.IOException
Print profile to output in a pretty format.

Parameters:
outfile - where to print to
Throws:
java.io.IOException

truncateNames

public final void truncateNames(int maxChars)
This truncates all names to a given length (or less).


removeSpacesInNames

public final void removeSpacesInNames()
This changes all spaces in sequence names to _


ensureUniqueNames

public final void ensureUniqueNames(int maxNameLen)
This makes sure that every seqName is unique and non-null. It uses capital letters to distinguish names. Bug: this can only generate 26^2 unique names, or 26 if maxNameLen is 1. It will fail silently in those cases.


ensureUniqueNames

public final void ensureUniqueNames()
This makes sure that every seqName is unique and non-null.


readHSSP

public final void readHSSP(java.io.BufferedReader infile,
                           Printf outfile)
                    throws java.io.IOException
read a protein out of a HSSP (Sander & Schneider format) file.

Parameters:
infile - an open HSSP file
outfile - if non-null, will print info on what's going on
Throws:
java.io.IOException

readMSF

public final void readMSF(java.io.BufferedReader infile,
                          Printf outfile)
                   throws java.io.IOException
read a protein out of a MSF (GCG's format) file.

This should go into a MultipleAlignment class instead, and a Profile should turn itself into a MultipleAlignment (and vice versa) and this function should be called from here, but coded there.

Parameters:
infile - an open MSF file
outfile - if non-null, will print info on what's going on
Throws:
java.io.IOException

readClustal

public final void readClustal(java.io.BufferedReader infile,
                              Printf outfile)
                       throws java.io.IOException
read a protein out of a CLUSTAL format file.

Parameters:
infile - an open CLUSTAL file
outfile - if non-null, will print info on what's going on
Throws:
java.io.IOException

blast

public void blast(Printf outfile,
                  Blast blastServer)
Run BLAST on this profile, using a specified server. Add answers from blast to the current profile.


blast

public void blast(Printf outfile)
Run BLAST on this profile, using NCBI's blast server.


readBLAST

public final void readBLAST(java.io.BufferedReader infile,
                            Printf outfile)
                     throws java.io.IOException
This reads a profile from BLASTP 2.0.11 output. This is not a standard format, so it may change in the future without notice. It also works with at least some older versions (2.0.x, and 1.4.11 tested).

Throws:
java.io.IOException

recognizeAndRead

protected boolean recognizeAndRead(java.lang.String buffer,
                                   java.io.BufferedReader infile,
                                   Printf outfile)
                            throws java.io.IOException
This looks for a profile and reads it in.

Overrides:
recognizeAndRead in class Protein
Throws:
java.io.IOException
See Also:
Protein.recognizeAndRead(java.lang.String, java.io.BufferedReader, org.strbio.io.Printf)

writeMSF

public final void writeMSF(Printf outfile)
                    throws java.io.IOException
Write in MSF format.

This should go into a MultipleAlignment class instead, and a Profile should turn itself into a MultipleAlignment (and vice versa) and this function should be called from here, but coded there.

Parameters:
outfile - where to write to
Throws:
java.io.IOException

writeSAF

public final void writeSAF(Printf outfile)
                    throws java.io.IOException
Write in Burkhard Rost's SAF format.

Parameters:
outfile - where to write to
Throws:
java.io.IOException

writeClustal

public final void writeClustal(Printf outfile)
                        throws java.io.IOException
Write in CLUSTAL format. Note that . and : are not written, since I don't know exactly how these are calculated.

Parameters:
outfile - where to write to
Throws:
java.io.IOException

writeProf

public final void writeProf(Printf outfile)
                     throws java.io.IOException
Write in Prof format.

Parameters:
outfile - where to write to
Throws:
java.io.IOException

choose

public final Profile choose(int seqNum)
Make a profile based on this one, featuring one of the sub- sequences. The profile will have all gaps in that sequence removed, and feature its sequence instead of the consensus sequence.


writeFasta

public void writeFasta(Printf outfile)
                throws java.io.IOException
Write each sequence in the profile in Fasta format.

Overrides:
writeFasta in class Polymer
Parameters:
outfile - where to write to
Throws:
java.io.IOException

doVarTom

public Protein doVarTom(Printf outfile)
                 throws java.io.IOException
Do var-tom on this profile.

Throws:
java.io.IOException