README updated 12/27/2011
GERP consists of two main components: gerpcol, which analyzes multiple alignments and computes RS scores for all positions, and gerpelem, which finds constrained elements given the RS scores produced by gerpcol. See individual descriptions for more details.
Description of GERP score data
--------------
These files contain one line for each position of the alignment, and each line consists of the neutral rate N (from step 2) and RS score S (from step 4), separated by tab, for that alignment position. Sample output may look like this:
1.05 -1.1
1.05 1.05
1.23 1.23
1.23 -1.98
1.23 -1.23
1.23 0.288
...
The first line of chr1.rates has the neutral rate and RS score for the 1st position of chr1, the second line for the 2nd position, etc.
IMPORTANT NOTE: The number of lines in each .rates file is less than the length of each chromosome because GERP only reports scores for sites that are
aligned in the .maf file, which consists of non-contiguous blocks of
aligned sequences. GERP assigns a neutral rate and RS score of 0 for sites between alignment blocks & between the start of the alignment & the first block, but reports nothing for sites that come after the last alignment block.
Downloading GERP score data from the UCSC genome browser
---------------
It is now possible to view and download RS scores for hg19 and mm9 through the UCSC genome browser. The GERP score track can be found in the 'Comparative Genomics' section of the browser.
To download scores for selected regions or sites, click the 'Tables' link at the top of the page. For 'group:' select 'Comparative Genomics' and for 'track:' select 'GERP'. In the 'region' field, select the position(s) you want to get a GERP score for. Under 'output format' select 'custom track'. Now click 'Get ouput'. On the next page, it will ask you to 'Select type of data output:'. Choose 'DATA VALUE format' and click 'get custom track in file'. This output will include chromosome position & GERP score.
Description of GERP element data
---------------
Gerpelem will create an output file in the same directory as the input rates file by adding a suffix to the that filename. This suffix defaults to ".elems" and can be changed using the -x parameter similar to gerpcol.
The output file will contain one line for each constrained element found, listed in order of increasing p-value. We report the following values for each element, separated by spaces:
start end length RS-score p-value
Generation of GERP score files
---------------
Given an evolutionary tree and one or more multiple alignments, gerpcol will compute RS scores for every position of each alignment by doing the following:
1) Project the evolutionary tree to the set of species that are present at that position. If there are fewer than 3 species remaining, gerpcol will not perform the main computation and will instead output a neutral rate and RS score of 0 for this position.
2) Compute the total neutral rate N of the projected tree from (1) by summing the branch lengths of the tree.
3) Compute the scaling factor r that will maximize the probability of the leaf nucleotides given the scaled tree. This probability is computed using the HKY85 model of evolution with nucleotide frequencies estimated from the multiple alignment data and the Tr/Tv parameter settable by the user (defaults to 2.0). The optimal scaling r is computed using numerical optimization.
4) Compute the rejected substitution score for the position by computing the quantity S = N - N*r. This represents the number of substitutions "rejected" by evolutionary constraint, i.e. the difference between the expected (i.e. neutral rate) and the observed (maximum likelihood estimate). This quantity will be capped from below at -2*N; since r is by definition nonnegative, the final RS score will be between -2*N and N, inclusive.
Generation of GERP element files
----------------
Given set of GERP scores for a region or chromsome, the gerpelem program will find and report a list of elements that appear constrained beyond what is likely to occur by chance. Each candidate element is assigned a p-value based on the likelihood of an element with the same score (RS scores generated by gerpcol are used additively) and length appearing at random. However, many repetitive elements that are present only in a shallow subtree of the alignment will have a low probability of occuring due to random chance (because shallow columns will score close to 0 regardless of conservation, and longer stretches of such columns will have a much higher score than stretches of unconstrained but non-shallow columns, which will have a substantially negative score). Therefore, gerpelem incorporates certain measures to remove these shallow elements from consideration.
First, gerpelem will identify all columns below a given depth as shallow; this depth can be specified using the -d flag and defaults to 0.5 substitutions per site. Rather than the rejected substitution score computed by gerpcol, these columns will be penalized as a multiple of the total neutral rate of the tree: they will get the score -k * neutral_rate, where the parameter k can be set using the -p flag (defaults to 0.5). All columns where gerpcol failed to perform the computation due to an insufficient number of species present will automatically be penalized in this way.
Furthermore, longer stretches of shallow columns, i.e. shallow regions, are penalized further. For any stretch of n > 1 consecutive shallow columns, we will exempt a certain number of border positions on each end of the shallow region (this parameter is controlled by the -b flag and defaults to 2). The remaining middle positions are considered shallow region positions (SRPs), and we will discard any candidate elements that include too many such positions (this parameter is controlled by the -a flag and defaults to 10).
Once the shallow region columns are identified, candidate elements are generated. We impose a specified minimum and maximum element length; these are controlled using the -l and -L parameters and default to 3 and 2000, respectively. Increasing the maximum element length will increase the running time of the program. We initially consider all candidate elements of legal length that start and end on a positive-RS-scoring position, and are maximal under this condition (cannot be extended in either direction without adding positions with negative RS score). From this pool of candidate elements we remove any that exceed the allowed number of SRPs, as well as elements that have a very low score to length ratio, which also indicates a shallow element. This is controlled by the -r parameter and defaults to a lower bound of 0.1 rejected substitution per position. Setting this lower bound to 0 will effectively toggle this filter off.
All candidate elements that pass the shallow filtering procedure are now assigned a p-value score based on their RS score and length. A candidate element with score S and length L is assigned a p-value that is equal to the probability of attaining a score of S or higher if L positions are sampled independently from the background distribution. This background distribution is derived from the distribution of RS scores of all positions in the alignment except SRPs, with a small added uniform prior for smoothing purposes. Also, for the purpose of this computation, all scores need to be rounded to discrete values. It is possible to specify the rounding tolerance using the -t parameter along with the inverse of the tolerance, i.e. the default inverse tolerance of 10 will result in rounding each score to the nearest 0.1, using '-t 100' will round to the nearest 0.01, etc. Increasing the inverse tolerance will increase the running time of the program.
After being assigned p-values, the candidate elements are sorted and reported in order of increasing p-value as long as they do not overlap any previously reported elements. This continues until one of two criteria is reached - either a predefined p-value cutoff is exceeded, or a certain false positive rate is exceeded. By default a false positive rate of 0.03 (3%) is used without any set p-value cutoff. The user may specify a p-value cutoff using the -c parameter, and/or the desired false positive rate using the -e parameter. If both parameters are used, gerpelem will stop reporting elements as soon as either condition is violated. We estimate the number of false positives (and therefore the positive rate) by shuffling all non-SRP positions of the alignment and calculating how many elements our method would find in the shuffled alignment.
Contacts
--------
For additional information contact:
David Goode dgoode.stanford@gmail.com