Composition Profiler - version 1.1 (April 2007), version 1.2 (August 2019)

Copyright (c) 2007 Vladimir Vacic, Vladimir N. Uversky, A. Keith Dunker,
Stefano Lonardi.

Composition Profiler incorporates portions of code from the Cephes Math
Library; detailed licensing information can be found in the LICENSE.txt
file.


CONTENTS:

cgi-bin  - This directory contains Ruby scripts for generating composi-
           tion profiles. The main script for the web (CGI) application
           is profiler.cgi, and the main scripts for the command line 
           application are cdiscover.rb and cprofile.rb.

datasets - Datasets used to build the examples on the web page and to
           compute standard protein database statistics.

html     - Help, examples and credits HTML pages. 

cc       - C code used for bootstrapping and to compute statistical 
           significance of the difference and relative entropy between 
           two samples.



SYSTEM REQUIREMENTS:

Composition Profiler requires the Ruby interpreter, GhostScript (a 
PostScript interpreter) and ImageMagick. All three programs are by 
default installed on any Linux system; in the event that they are 
not installed, they can be downloaded free of charge from:

Ruby        - http://www.ruby-lang.org 
GhostScript - http://www.cs.wisc.edu/~ghost
ImageMagick - http://www.imagemagick.org

Ruby interpreter is normally in the system path (type ruby -v on the
system prompt to verify this). GhostScript and ImageMagick are usually
in the system path as well; in the case that they are not, the 
locations of the binaries can be specified in the cprof.conf file.

In addition to these three, the web version requires a running web 
server: Composition Profiler has been tested on Apache, using the 
Ruby module, which can be downloaded from:

mod_ruby    - http://www.modruby.net

Composition Profiler was tested with Ruby version 1.8, GhostScript 8.56,
ImageMagick 6.3.4, mod_ruby 1.2.5 on Fedora Core and Ubuntu Linux 
distributions and Mac OS X.   


C PROGRAMS:

Composition Profiler uses three programs written in C ("pvalue", 
"frequency" and "rentropy") for calculating computationally-intensive 
functions. Source code of these three C programs can be found in the 
/cc directory. Before they can be used, they needs to be compiled on 
the platform on which they will be run. A Makefile is provided; it 
suffices to type "make" on the command prompt in the /cc directory 
and copy the executables in the directory with the Ruby scripts.



COMMAND LINE ARGUMENTS:

Usage: cdiscover.rb -Q <query file> [options]
Looks for statistically significant composition differences between two sets.

Mandatory arguments:
  -Q <query file>

Optional arguments:
  -B <background file>
     or
  -D <known distribution>    One of the following:
                             disprot       Disordered regions from DisProt 3.4
                             pdbs25        PDB Select 25
                             sprot         Proteins from SwissProt
                             surface       Surface residues of monomers from PDB
                             Defaults to sprot.

  -A <alpha value>           Significance value for the statistical test.
                             Defaults to 0.05. 

  -b                         Bonferroni correction.
                             Off by default.

------------

Usage: cprofile.rb -Q <query file> -O <output file> [options]
Creates a composition profile for the input FastA file.

Mandatory arguments:
  -Q <query file>

  -O <output file>           Output file name.

Optional arguments:
  -B <background file>
     or
  -D <known distribution>    One of the following:
                             disprot       Disordered regions from DisProt 3.4
                             pdbs25        PDB Select 25
                             sprot         Proteins from SwissProt
                             surface       Surface residues of monomers from PDB
                             Defaults to sprot.

  -C <color scheme>          One of the following:        
                             alpha_n       Alpha helix frequency (N)
                             amino         Amino color scheme
                             aromatics     Aromatics
                             beta_n        Beta structure frequency (N)
                             bw            Black and white
                             bulkiness_z   Bulkiness (Z)
                             charge        Charge
                             coil_n        Coil propensity (N)
                             discolor_d    Discolor propensity (D)
                             flex_v        Flexibility
                             hydro_e       Hydrophobicity (E)
                             hydro_kd      Hydrophobicity (K-D)
                             hydro_fp      Hydrophobicity (F-P)
                             interface_jt  Interface propensity (J-T)
                             linker_gh     Linker propensity (G-H)
                             polarity_z    Polarity (Z)
                             shapley       Shapley color scheme
                             size_d        Size (D)
                             surface_j     Surface exposure (J)
                             solvation_jt  Solvation potential (J-T)
                             weblogo       Weblogo color scheme
                             Defaults to bw.

  -F <format>                Format of output (EPS, GIF, PDF, PNG, TXT). 
                             Defaults to PNG.

  -H <image height>          Height of output image.
                             Defaults to 3.5".

  -I <iterations>            Number of bootstrap iterations.
                             Deafults to 10000.

  -R <resolution>            Bitmap resolution. 
                             Defaults to 96.

  -S <order>                 Sorts residues in the increasing order of
                             one of the physico-chemical or structural properties:
                             alpha        Alphabetical order
                             alpha_n      Alpha helix frequency (Nagano)
                             diff         By observed differences
                             beta_n       Beta structure frequency (Nagano)
                             bulikness_z  Bulkiness (Zimmerman)
                             coil_n       Coil propensity (Nagano)
                             flex_v       Flexibility (Vihinen)
                             hydro_e      Hydrophobicity (Eisenberg)
                             hydro_kd     Hydrophobicity (Kyte-Doolittle)
                             hydro_fp     Hydrophobicity (Fauchere-Pliska)
                             interface_jt Interface propensity (Jones-Thornton)
                             linker_gh    Linker propensity (George-Heringa)
                             polarity_z   Polarity (Zimmerman)
                             size_d       Size (Dawson)
                             surface_j    Surface exposure (Janin)
                             solvation_jt Solvation potential (Jones-Thornton)
                             Defaults to alphabetical order.

  -U <units>                 Chart dimensions units (cm, inch, pixel, point).
                             Defaults to cm. 

  -W <image width>           Width of output image. Defaults to 5".

  -X <res units>             Resolution units when bitmap resolution is 
                             specified (ppi, ppc, ppp). Defaults to ppi.

  -Y                         Y-axis label.


Optional toggles (no values associated):
  -a                         Toggle antialiasing.



COMMAND LINE EXAMPLES:

Simple command line examples for discovery and plotting of composition 
anomalies for alpha-MoRF residues:

./cdiscover.rb -Q ../datasets/alpha_morf.fa -D pdbs25

------------

./cprofile.rb -Q ../datasets/alpha_morf.fa -O alpha.png -F PNG -D pdbs25 \
-S flex_v -C disorder_d -Y "(Alpha MoRF - PDBS25) / PDBS25" -a

------------

./rentropy ../datasets/heterodimers.fa ../datasets/homodimers.fa 10000

The first line of the output is the relative entropy, the second line
is the p-value (details of estimating the p-value are given in the
paper).



WEB APPLICATION SETTINGS:

Due to security concerns, we have separated the cgi scripts from the html
documents and images. Cgi scripts are in a subdirectory of cgi-bin, and html
documents are in a subdirectory of the Apache web document root. For Apache 
web server running on a Linux system, assuming the default Apache settings, 
those are "/var/www/cgi-bin" and "/var/www/html", respectively. 

For the Composition Profiler cgi script to be able to link to html 
documents (such as help files, images, etc.), relative path for the html 
documents in relation to the cgi scripts directory has to be specified in 
profiler.cgi "path" variable. For example:

path = "../../profiler/"

In addition to this, the profiler.cgi script needs to be configured to write
the output images into an Apache-writable directory under the Apache web
document root, so they can be displayed to the user. This is done using
the "temp" variable. For example:

temp = "/var/www/html/temp/"

