CpG refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases.
Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands.
This program does not find CpG islands as normally defined: "a region of greater than 200 bp with a %GC of greater than 50% and observed/expected CpG > 0.6". cpgreport instead uses a running sum rather than a window to create the score as follows: if not CpG at position i, then decrement running-Sum counter, but if CpG then running-Sum counter is incremented by the CPGSCORE. Spans greater than the threshold are searched for recursively.
This method overpredicts islands but finds the smaller ones around primary exons.
|
The first non-blank line of the output file 'rnu68037.cpgreport' is the title line giving the program name, the name of sequence being analysed and the start and end positions of the sequence.
The second non-blank line contains the headings of the columns.
Subsequent lines contain columns with the following information:
If the count of GpC in the region is zero, then the ratio of CG/GC is reported as '-'.
1. newcpgseek and cpgreport - both declare a putative island if the score is higher than a threshold (17 at the moment). They now also displaying the actual CpG count, the % CG and the observed/expected ration in the region where the score is above the threshold. This scoring method based on sum/frequencies overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable.
2. newcpgreport and cpgplot use a sliding window within which the Obs/Exp ratio of CpG is calculated. The important thing to note in this method is that an island, in order to be reported, is defined as a region that satisfies the following contraints:
Obs/Exp ratio > 0.6 % C + % G > 50% Length > 200.
For all practical purposes you should probably use newcpgreport. It is actually used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server.
geecee measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be usefull for detecting sequences that MIGHT contain an island.
The algorithm was modified for inclusion in EGCG under the name 'CPGSPANS' by
This application was modified for inclusion in EMBOSS by