cpgreport

Function

Description

cpgreport scans a nucleotide sequence for regions with higher than expected frequencies of the dinucleotide CG.

CpG refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases.

Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands.

This program does not find CpG islands as normally defined: "a region of greater than 200 bp with a %GC of greater than 50% and observed/expected CpG > 0.6". cpgreport instead uses a running sum rather than a window to create the score as follows: if not CpG at position i, then decrement running-Sum counter, but if CpG then running-Sum counter is incremented by the CPGSCORE. Spans greater than the threshold are searched for recursively.

This method overpredicts islands but finds the smaller ones around primary exons.

Usage

Command line arguments


Input file format

Any DNA sequence USA.

Output file format

The first non-blank line of the output file 'rnu68037.cpgreport' is the title line giving the program name, the name of sequence being analysed and the start and end positions of the sequence.

The second non-blank line contains the headings of the columns.

Subsequent lines contain columns with the following information:

If the count of GpC in the region is zero, then the ratio of CG/GC is reported as '-'.

Data files

None.

Notes

This program does not find CpG islands as normally defined (see cpgplot).

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 if successful.

Known bugs

None. As there is no official definition of what is a cpg island is, and worst where they begin and end, we have to live with 2 definitions and thus two methods. These are:

1. newcpgseek and cpgreport - both declare a putative island if the score is higher than a threshold (17 at the moment). They now also displaying the actual CpG count, the % CG and the observed/expected ration in the region where the score is above the threshold. This scoring method based on sum/frequencies overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable.

2. newcpgreport and cpgplot use a sliding window within which the Obs/Exp ratio of CpG is calculated. The important thing to note in this method is that an island, in order to be reported, is defined as a region that satisfies the following contraints:

   Obs/Exp ratio > 0.6
   % C + % G > 50%
   Length > 200.

For all practical purposes you should probably use newcpgreport. It is actually used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server.

geecee measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be usefull for detecting sequences that MIGHT contain an island.

Author(s)

This program was originally written by

The algorithm was modified for inclusion in EGCG under the name 'CPGSPANS' by

This application was modified for inclusion in EMBOSS by

History

Target users

Comments