April 28, 2014

Restriction digestion of eukaryotic genomes in R

There are multiple desktop tools (Bioedit, Emboss, various basic bioinformatic tools) and browser based tools (NEBcutter, biotools, In silico restriction digestion and various other tools) available for performing restriction digestion of smaller prokaryotic genomes or smaller eukaryotic chromosomes individually. However, no desktop tool or browser based tool is available for free use to perform restriction digestion on the whole genome of eukaryotes with larger genomes. Tools like Emboss handles this sort of task but in a primitive way that is not helpful for the downstream analysis of the results right away.

I have been working on methylation analysis using RRBS method. This method is based on the restriction digestion pattern of the enzyme MspI on the whole genome. I wanted to perform in silico digestion of MspI on mouse genome(mm10) to virtually see the pattern of digestion. After scanning the web finally I narrowed down on a bioconductor package "Biostrings" that helped me achieve this task. Here, I give the code to perform this task. Since the package I used is an R package, it also helped me perform a variety of downstream analysis pretty fast.

This method is based on the ability of the "Biostrings" package to recognize the MspI restriction site (CCGG) on the mouse genome (BSgenome.Mmusculus.UCSC.mm10 bioconductor package loaded into R). Following tasks are peformed by the script below:
  • Load the needed bioconductor and R packages
  • Identify the MspI restriction sites (genomic co-ordinates) per chromosome in the genome.
  • Extract the start and end co-ordinates of the dna fragments resulting from the genomic digestion (using gaps)
  • Create a dataframe of the genomic co-ordinates of the digested fragments fro each chromosome for easier downstream analysis
  • Plot the frequency of the length of digested fragments using ggplot2
  • A reproducible document is also available at R pubs

Update on 19th August 2014:
I came to know that some people tried to replicate this code and noticed some errors on their side. So, I would say that some issues may crop up when the version of the libraries change. Users may note that the above code is working with the following session information in R.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8      
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C             
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] scales_0.2.4                          plyr_1.8.1                          
 [3] ggplot2_1.0.0                         BSgenome.Mmusculus.UCSC.mm10_1.3.1000
 [5] BSgenome_1.32.0                       GenomicRanges_1.16.3                
 [7] GenomeInfoDb_1.0.2                    Biostrings_2.32.1                   
 [9] XVector_0.4.0                         IRanges_1.22.9                      
[11] BiocGenerics_0.10.0                 

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     colorspace_1.2-4 digest_0.6.4     grid_3.1.1       gtable_0.1.2   
 [6] labeling_0.2     MASS_7.3-33      munsell_0.4.2    proto_0.3-10     Rcpp_0.11.2    
[11] reshape2_1.4     Rsamtools_1.16.1 stats4_3.1.1     stringr_0.6.2    tools_3.1.1    
[16] zlibbioc_1.10.0

No comments: