overview_ccmb – Organized by the Translational Bioinformatics Group (TBG), International Centre for Genetic Engineering and Biotechnology (ICGEB), New Delhi, India.

CCMB, Hyderabad

Unraveling the Genomic Landscape: Advancements in Bioinformatics at CCMB Hyderabad

Analysis of single cell RNA sequencing data from TSCs derived from murine ESCs. Data is from four different samples: ESC - Embryonic Stem Cells, TSC - Trophectoderm Stem Cells derived from blastocyst. ESTSP5 - Early passage (P5) TSCs derived from ESCs. ESTSP24 - Late passage (P24) TSCs derived from ESCs. A) UMAP clustering of cell lineages from the four samples, colored by their sample origin (top) and by independent clustering (bottom). B) Heatmap depicting expression of cluster specific markers in each independent cluster

The Bioinformatics Center at CSIR-CCMB has a primary goal: to unravel how genetic variations among the Indian population contribute to health and disease disparities. This study occurs on two fronts; firstly, they scrutinize Indian genomes in relation to their phenotypic traits and secondly, they delve into the molecular and epigenetic repercussions of genetic variations. The center's mission is not only to develop a cutting-edge bioinformatics facility but also to promote skilled professionals capable of analyzing extensive whole-genome data, extracting biologically meaningful insights, and drawing conclusions.

The Center has developed a computational pipeline designed for discovery of efficient structural variants, using short-read and long-read sequencing data. This approach underwent validation on a trio dataset created in-house, employing mutant E. coli strains and enzymatic treatments. The data not only benchmarked existing tools for detecting base modifications from Nanopore data but also led to the development of new models aimed at improving accuracy. A novel approach involving methylation-specific restriction digestion followed by single-molecule sequencing on Nanopore was developed to identify 6-methyladenosine at single-base resolution. To understand the sub-cellular lineages of murine trophectoderm stem cells, a custom pipeline built on the top of the widely used Seurat pipeline was employed for analyzing single-cell transcriptome data. This pipeline strategically combines the accuracy of short reads with the versatility of long reads for structural variant discovery, validated through a trio dataset sequenced in-house on both Illumina and ONT platforms. There searchers used public and in-house data to systematically benchmark existing tools for accuracy in identifying CpG methylation, developing novel models for studying 5mC in non-CpG contexts and 6mA.

Computational facility at DBT-Bioinformatics Center at CSIR-CCMB

Performance of various tools in accurate identification of 5-methylcytosine in CpG context using Human NA12878 data. X axis indicates % methylation as reported by Bisulfite Sequencing,and Y axis is the % methylation as reported by the tool in question. Color indicates concordance, as indicated by the scale on the right side.

Computational facility at DBT-Bioinformatics Center at CSIR-CCMB

In the realm of training, the Center organized a workshop focusing on the analysis of whole transcriptome data. Attended by 20 individuals, the workshop covered a wide range of topics, including Linux commands, QC using FastQC, data cleanup using CutAdapt, alignment to reference using hisat2, post-alignment QC using QualiMap, transcript quantification using featureCounts, and differential gene expression analysis using DESeq2. Participants were also guided through downstream analyses and visualization techniques such as PCA, heatmap generation, and GO analysis. Additionally, the workshop provided insights into utilizing the High-Performance Cluster job system for running pipelines on individual datasets. The Center has collaborated with Dr. P. Chandra Sekhar's lab to successfully derive mouse trophectoderm stem cells from mouse embryonic stem cells. Characterizing their sub-cellular lineages using single-cell transcriptomics, the data was processed using the cell ranger toolkit. Furthermore, the Center also offers workshops for students on transcriptome data analysis and wastewater data analysis for antimicrobial resistance trends.

Looking ahead, in collaboration with the NNP partner, IIT Hyderabad, the project's overarching objective is to scrutinize human genome information, identifying genomic and epigenomic variants associated with common polygenic disorders. Deep learning-based models will be developed to establish connections between phenotypes, lifestyle, variants, epigenotypes, and the onset, progression, prognosis, severity, and response to treatments of diseases. The project will leverage AI and ML models to predict variant effects and correlate them with disease phenotypes. The project will result in developing efficient tools and pipelines for processing genomic data, and a skilled workforce specialized in large-scale data analysis. Other projects include developing of a pipeline for discovering structural variants, deep learning models for methylation detection, single-cell RNA seq analysis for early mouse embryonic development, and a workshop designed to train 20 individuals in transcriptome data analysis.

Computational facility at DBT-Bioinformatics Center at CSIR-CCMB