December 16, 2014

Extending methylKit: Christmas Offer for Methylome Researchers

Next week is going to be a vacation and most of us are going on vacation. Me too. However, I thought that this Christmas I would OFFER fellow researchers a GOOD SERVICE. 

Many of the biologists are learning to deal with their high-throughput sequencing data. I also started just like you and spent hours and days learning R/Bioconductor related packages. Now, I thought I would SAVE your time and offer the data-analysis skills I have developed at most affordable price for the researchers (actually, my service is FREE, I collect charges only to PREVENT SPAM requests and some of it will go into developing further resources to help users like you)

Here are the details of my christmas offer: I am now offering methylation analysis for two samples (1 control and 1 test). As part of my offering you would get the following done:
  • QC report of the FASTQ reads
  • Mapping (hg19/mm10) using BISMARK
  • Bisulfite efficiency analysis (if you have used lambda spike in)
  • methylKit workflow to suit your needs
  • Additional workflow in bioconductor to extract the list of genic elements/or non-genic elements with differentially methylated CpGs
Not only this, I also offer 30min - 1 hr consultation on planning methylome experiments (RRBS/WGBS). Yes, you read it right, I do data analysis and I build NGS libraries my self. So, I can help you planning experiments. I have built and sequenced over 100 RRBS libraries and 10 WGBS libraries from 10ng - 30ng DNA (I have NOT heard of any commercial service provider offering methylomes from such smaller amounts of DNA). So, with my tips you can build NGS libraries from very precious samples. I know the ins and outs of NGS library prep and good understanding of methylation analysis. I can see the QC and tell you what has gone wrong in the library prep (if there is any).

Price details:
  1. RRBS/WGBS consultation (Sample processing, library prep, data analysis) for 30min to 1 hr : $15 for 30 min (total 6 slots for consultation)
  2. RRBS data analysis as mentioned above : $ 50 (only 10 slots)
    • I will also explain you the pipeline and the results will be delivered in neatly explained document/ppt. If necessary over skype/hangout
  3. WGBS data analysis : $50 (only 1 slot)
    • WGBS analysis will NOT cover the above data analysis due to huge cost of time and computational resources. I will do QC report, mapping and bisulfite efficiency analysis only.
Risk free:
  1. After experiencing this service if you feel that my expertise is not useful for you, I WILL RETURN 80% of the price you PAID with NO QUESTIONS ASKED. All you have to do is just ask for your money. I am sure people reading this are true to their conscience. I do not want anybody to comment on this blog that they have lost money. So, my reputation is at stake if you are not satisfied.
  2. If you happen to use the results, you need not give me any contribution. Its YOUR data. 
You can contact me using my personal email id at kalyankpy[at]gmail[dot]com for further details.

Disclaimer: Pls note that this is not marketing or business. I want to share my expertise and skill to the needed (I have been active in the methylkit_discussion forum on google groups for the topics I am familiar with). However, I don't want to invite SPAMMERS or JUNK. Price for this service is meant to invite genuine researchers only. Some of the money raised will be used in developing additional online resources for methylation anlaysis.

November 5, 2014

Create a REFSEQ transcript database

Transcript databases (Txdb) in bioconductor are very good annotation packages. These packages help the researchers annotate the genomic regions of interest to multiple genic elements such as exons, introns, UTRs, CDS, genes etc.,. For the human genome bioconductor offers Txdb files only for the UCSC knowngenes. Here, I share the code needed for generating human Txdb using bioconductor package "GenomicFeatures"

The following code/function could be used for generating any Txdb of choice for any organism of interest. This is a very simple function. However, due to the naming of the function and the default parameters hide the full potential of this function in utilizing it for creating a variety of databases. In other words, this function could be used to generate a Txdb from every table existing at UCSC.

October 23, 2014

Extending methylKit : Extract promoters with differentially methylated CpGs

In my previous post, I wrote about the features of methylKit. Here, I will discuss how to extend bisulfite sequencing data analysis beyond methylKit.

Annotation is an important feature of genomic analyses. Coming to bisulfite sequencing analyses such as RRBS or WGBS, methylKit does a pretty good job by identifying the differentially methylated individual CpGs or any specific regions/tiles. It also performs basic annotation and plots pie charts indicating where all the differentially methylated CpGs overlap the genic annotations.

The adjacent picture is an example of the kind of annotation performed by methylKit.Using the native functions, methylKit users will be able to annotate the differentially methylated CpGs to the genic regions. Adjacent picture indicates that among all the diffmeth
CpGs 46% overlap with promoters, 27% overlap with intergenic region. Another way to look at the annotations is to identify the list of promoters, exons, introns, intergenic regions that overlap differentially methylated CpGs. There are no methylKit functions to perform the annotations from the point of genic regions.However, methylKit facilitates this by enabling the coercion of the methylKit objects into "GRanges" (GenomicRanges). The following script will help methylKit users in extracting the list of promoter/exons/introns that overlap with differentially methylated CpGs.

October 21, 2014

methylKit for bisulfite sequencing data analysis

I have been relying upon methylKit, an R package for my RRBS data analysis. It is one the most highly cited R package for analysing bisulifite sequencing data. It is straight forward to install and it's vignette details all the major steps in the bisulfite analysis with clarity. Altuna Akalin, the author of the methylKit has been actively supporting (via google groups) the issues faced by the users in implementing methylKit. Overall, methylKit could also be used with little knowledge of R. Interestingly, working with methylKit also helps laboratory researchers learn R.

As with other bisulfite sequencing data analysis packages, methylkit takes charge once the bisulfite reads are aligned to the genome. Here are the tasks one can implement using methylKit:

  • Extract methylation information from aligned data from mappers like Bismark
  • Alternatively, one can read the methylation information of mapped cytosines easily from other mappers like BSMAP  or any other bisulfite mapper in a specified format 
  • Normalize the CpGs covered by removing the CpGs that have excess coverage due to over amplification/PCR duplication
  • Calculate methylation status of each CpG covered (or specified regions or tiles across genome) and export them into bed or bedcoverage formats for visualization in a genome browser. methylKit also enables merging of strand coverage.
  • Enables the consideration of replicates among control and test samples
  • Calculate differential methylation either at single CpG levels or regions/tiles covered across the control and test samples
  • Facilitates PCA and cluster analysis to identify the overall relation among the samples from methylation point of view
  • Enables annotation of differential methylation across CpG islands/shores and multiple genic regions of interest such as promoters, exons, introns......
Any genomic analysis is highly customized after a certain number of basic steps. One has to build the customization by utilizing the options among several packages and bridging the gaps by fine tuning the input and output file formats. methylKit does a fairly good job by facilitating the coercion of methylKit objects into GenomicRanges objects such as GRanges. This feature enables seamless integration into multiple packages from bioconductor. In the future posts, I will detail some examples and R scripts that facilitate extension to methylKit analysis.

October 8, 2014

Are Post-bisulfite DNA library preparation kits suitable for RRBS?

Recently, a colleague asked me if post-bisulfite DNA library preparation kits are suitable for reduced representation bisulfite sequencing (RRBS). In this blog post I share my views on this concept and its applicability.

The concept of DNA library preparation after bisulfite conversion of DNA was originally introduced by Zymo research for whole genome Methyl-seq. This is really exciting because bisulfite reaction is so harsh that 90% of library gets fragmented in the traditional protocols and is not amplifiable. The amplification we see is actually from the remaning 10% library. Other advantages are listed below:
  • Generally sonication of DNA is performed in a buffer of atleast 130 ul (Volume of a Covaris micro-tube). After sonication DNA needs to be purified/concentrated. So, sonication always accompanies additional purification steps that lead to loss of DNA (purification by columns will lead to a minimum loss of 10% of the DNA). Additionally, one has to check the concentration of DNA and the fragments size before proceeding.
  • Bisulfite conversion of the whole genome is a harsh reaction that leads to random nicks in the DNA. Thus DNA is broken down into fragments. Subjecting the whole genome to bisulfite conversion is thus doing two steps: fragmentation and bisulfite conversion. Thus, it is advantageous as it avoids sonication and loss of DNA during the purifications steps.
  • Another advantage is that there is no fragmentation induced loss of DNA after ligation (as in normal Methyl-seq where bisulfite reaction is performed post ligation. This generally leads to fragmentation of ligated library that could not be amplified).
Before commenting on the applicability of post-bisulfite library preparation to RRBS, it is wise to understand the associated library preparation steps post-bisulfite conversion of whole genome. The following picture explains the methodology (adapted from zymo research webpage)

An important consideration in the above work-flow is to convert bisulfite converted ssDNA into dsDNA. This is achieved in a process akin to cDNA preparation using random oligos. Since, the fragmentation induced by bisulfite reaction is random, random oligo seems a right choice. This protocol should work fine for preparing the library for whole genome bisulfite sequencing (Methyl-seq).

To find out the suitability of this workflow for RRBS, let us revisit the basic concepts. RRBS is inteded to enrich the CpG rich regions by digesting the DNA with MspI restriction enzyme. This will result in DNA fragments ranging from as low as 40 bp to multiple kilobases in length with identical termini. However, we choose fragments in the size range of 40-480 bp that seem to be a better representation of the CpG rich regions and promoters. In the traditional RRBS protocol, we perform bisulfite conversion post adapter ligation. So, we amplify the fragments we 'choose' (with some loss during the bisulfite conversion due to random fragmentation of DNA).

Let us see what happens if we subject the MspI digested DNA to bisulfite reaction prior to library preparation.

  • DNA is further fragmented into smaller fragments.
  • This fragmentation will skew the composition of the MspI digested fragments and the desired size range of 40-480bp does not just represent CpG rich regions. This size range now contains any region of the genome.
  • Termini will not be MspI recognition motifs but random nucleotides due to chemical induced fragmentation.
  • Because of the random fragmentation, even sequencing data from replicates is likely to represent CpGs from different genomic loci reducing the overlap among replicates.
  • The QC of the RRBS reads is assessed by the 5' termini (CGG/TGG). Now, because of random fragmentation, terminal nucleotides are altered!
  • Even when random fragmentation doesn't happen, another issue exists during ssDNA to dsDNA conversion. Usage of random oligos is good for randomly fragmented termini. For MspI digested termini, the major chunk are identical termini which means, random oligos may not convert the DNA at the same efficiency!
Since there are multiple issues involved, I conclude (well I have not done any experiment yet) that post-bisulfite library preparation kits are not suitable for RRBS. In my view any company that claims the suitability of this kit should do the RRBS experiments with replicates before documenting and selling these kits. I am eager to know if this has worked for RRBS. I would be willing to retract this post then!

August 28, 2014

Survey on Indian Research Scholars

Recently, I have initiated a survey to gather the issues faced by research scholars across India. I have posted the link to this survey in the facebook page titled "Research Scholars of India". This survey lasted for 10 days (16-26 August 2014).

This survey received a good response as much as from 85 research scholars across India spanning a number of universities and institutions. Participants also ranged from all the possible major funding sources in India for research. In summary this report is indicative of the grim situation faced by the research scholars in India. 

Following are the highlights of the survey:

  • This survey has 73% male participation against 23% female. The rest chose not to mention their sex.
  •  90% of the respondents have clarity about their registration for PhD
  • Nearly 50% of the scholars say that they do not get monthly stipends and are facing administrative issues either at the host institutions or at funding agencies.
  • More than 60% scholars indicate that the stipend is barely sufficient to meet their individual needs.
  • About 80% of the researchers are in distress and not satisfied.
  • 80% researchers say that they can not meet the financial needs in case they are married and most of them decided not to get married. Few researchers, in their personal communication also mentioned that their marriage prospects are next to nothing!
  • A whopping 98% researchers are facing tough life either due to work environment or insufficient stipends.
For those who want to have a look at the survey, they can do so at this webpage. Further suggestions are welcome to improve this report!

Here is the survey report.

Survey Results

April 28, 2014

Restriction digestion of eukaryotic genomes in R

There are multiple desktop tools (Bioedit, Emboss, various basic bioinformatic tools) and browser based tools (NEBcutter, biotools, In silico restriction digestion and various other tools) available for performing restriction digestion of smaller prokaryotic genomes or smaller eukaryotic chromosomes individually. However, no desktop tool or browser based tool is available for free use to perform restriction digestion on the whole genome of eukaryotes with larger genomes. Tools like Emboss handles this sort of task but in a primitive way that is not helpful for the downstream analysis of the results right away.

I have been working on methylation analysis using RRBS method. This method is based on the restriction digestion pattern of the enzyme MspI on the whole genome. I wanted to perform in silico digestion of MspI on mouse genome(mm10) to virtually see the pattern of digestion. After scanning the web finally I narrowed down on a bioconductor package "Biostrings" that helped me achieve this task. Here, I give the code to perform this task. Since the package I used is an R package, it also helped me perform a variety of downstream analysis pretty fast.

This method is based on the ability of the "Biostrings" package to recognize the MspI restriction site (CCGG) on the mouse genome (BSgenome.Mmusculus.UCSC.mm10 bioconductor package loaded into R). Following tasks are peformed by the script below:
  • Load the needed bioconductor and R packages
  • Identify the MspI restriction sites (genomic co-ordinates) per chromosome in the genome.
  • Extract the start and end co-ordinates of the dna fragments resulting from the genomic digestion (using gaps)
  • Create a dataframe of the genomic co-ordinates of the digested fragments fro each chromosome for easier downstream analysis
  • Plot the frequency of the length of digested fragments using ggplot2
  • A reproducible document is also available at R pubs

Update on 19th August 2014:
I came to know that some people tried to replicate this code and noticed some errors on their side. So, I would say that some issues may crop up when the version of the libraries change. Users may note that the above code is working with the following session information in R.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C             

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] scales_0.2.4                          plyr_1.8.1                          
 [3] ggplot2_1.0.0                         BSgenome.Mmusculus.UCSC.mm10_1.3.1000
 [5] BSgenome_1.32.0                       GenomicRanges_1.16.3                
 [7] GenomeInfoDb_1.0.2                    Biostrings_2.32.1                   
 [9] XVector_0.4.0                         IRanges_1.22.9                      
[11] BiocGenerics_0.10.0                 

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     colorspace_1.2-4 digest_0.6.4     grid_3.1.1       gtable_0.1.2   
 [6] labeling_0.2     MASS_7.3-33      munsell_0.4.2    proto_0.3-10     Rcpp_0.11.2    
[11] reshape2_1.4     Rsamtools_1.16.1 stats4_3.1.1     stringr_0.6.2    tools_3.1.1    
[16] zlibbioc_1.10.0

April 8, 2014

ENCODE Transcription factor tracks

I was trying to overlap the differentially methylated CpGs at the transcription factor binding sites for my data. But I am puzzled to see that the ENCODE consortium which has spent huge amount of money on performing experiments and publishing papers, did not pay much attention to giving details of the tracks and their naming structure as a quick reference that can be easily found. After browsing a lot on the net, I got the clue from this webpage about the details of the transcription factor chip experiments.

In this post I summarize the details about how to download all the tracks and understanding the naming of the tracks.

Transcription factor related ChIP-seq tracks are available for individual download from this link. However, for those wanting to perform analysis from all the tracks, it is painful to download each track and merge the tracks ensuring the identity of each track. Fortunately, I found that Sartor lab has created a bed file merging all the ENCODE TF tracks. This file also ensured the identity of each row of the track by labelling its source. This could be easily converted into 'GRanges' object either by custom functions in base R or using import function from rtracklayer package from Bioconductor.

Another issue with these tracks is their naming structure. While the consortium ensured that every track name includes all the necessary information, its structure was not documented anywhere (that could be easily obtained). After few hours of browsing and collecting the information, I understood the structure of the TF track name as follows:

  • List of tracks and the related metadata for the tracks is available for download from this webpage (click the files.txt link)
  • The downloaded file is not uniformly tabulated. So, I had to fiddle with it to make it look uniform
  • Further I extracted only the details that matter to understand the file/track name. You may download this file from here.

Here I will explain the name of one track as an example:
Track name = wgEncodeAwgTfbsSydhK562CjunIfna6hUniPk
Every track name includes the following elements:
  1. Text common to all: wgEncodeAwgTfbs
  2. Laboratories involved in generating the track: Sydh (Labs from Stanford,yale, USC, Harvard)
    • Other possible values appearing in the track names are (Haib, UChicago, Uta, UW)
  3. Cell line used=K562
    • There are over 92 types of cell lines used. So, this part is highly variable. Details of the cell types used is available from this page.
  4. Antibody used = c-Jun
    • There are over 190 antibodies used (=TFs probed). This is also a vairable part of the track name. In some cases, they have provided the catalog number of the antibody purchased.
  5. Treatment=ifna6h (Means cells are treated with IFNA for 6 hrs)
    • This part gives details of the cell treatment. When cells are not subjected to any treatement, there is no mention of this part. Overall, there are 29 variables at this place. When the treatment is for 36h, it is taken as standard and not mentioned in the name of the track.
  6. Algorithm used=UniPk (Uniform peak calling). Common for every track!
I hope, this information is useful for others doing a similar analysis.

Update (26th June 2014)

Here are few more links that give additional information about the ENCODE transcription factor ChIP experiments.
  1.  List of antibodies registered at Data coordination center (DCC), ENCODE is available at This link provides the list of antibodies and their targets probed in ENCODE.
  2. For  those interested in the comprehensive list of transcription factors across various genomes this link may be useful. However these are predicted transcription factors and not human curated.
  3. For those who want know about the type of experiments performed in ENCODE, this page may be helpful ( This page also lists the number of experiments performed. Also lists the various ChIP-seq experiments.
  4. Finally FAQs are available at 
  5. Human curated list of transcription factors is available at . Reference article for the same is at

March 26, 2014

CpG Island shelves

Nowadays, regions with relatively lower CpG density are gaining importance in DNA methylation studies. This is based on the fact that a majority of CpG rich regions (CpG islands) are non-dynamic and less variant in terms of methylation status probed across a variety of tissues and cell populations (Irizarry 2009, Ziller 2013). It is now proven that methylation is more dynamic along the CpG shores (< 2kb flanking CpG Islands) and CpG shelves (<2kb flanking outwards from a CpG shore). While it is easy to retreive the genomic co-ordinates of CpG Islands from UCSC browser, public resources for retrieving the genomic co-ordinates for shores and shelfs were missing or scarce. Here, I am displaying the code (in R using Bioconductor package GenomicRanges) I use for generating objects for CpG islands, CpG island shores and CpG island shelves. The resulting objects are GRanges objects which could be used in multiple downstream applications in Bioconductor related packages.

February 25, 2014

Create a GENCODE transcript database in R

The following gist will help the researchers in creating the gencode transcript database using the bioconductor packages. I am assuming that the user's computer has preinstalled packages "GenomicRanges" and "GenomicFeatures". Following script has the following information:
  • loads the needs bioconductor packages
  • gives information about creating the intermediate files needed for generating the database
  • brief explanation about each step in the procedure
  • create the transcript database, saving and loading when needed
  • extract information for each feature (gene, cds,transcript,exon,intron,intergenic regions) as 'GRanges' object, 'sort' when needed.
  • saves all the extracted features into combined object to be loaded in future