Bioinformatics and Bio-imaging, ICDM 2005 Tutorial by Chris Ding and Hanchuan Peng

ICDM 2005 Tutorial by Chris Ding and Hanchuan Peng, Lawrence Berkeley National Laboratory.

Bioinformatics and Bio-image Analysis

The fast evolving trends in bioinformatics and computational genomics are to use machine learning methods to computationally determine functions, structures, interactions, among DNAs and proteins with biological significance. Using classification methods, one can predict protein 3D structures, RNA coding regions, binding /non-bind active sites, etc. Some related data mining techniques such as feature extraction and selection become particularly important. Another new aspect is the simultaneous feature selection and data clustering, such as biclustering, iterative feature selection and data clustering, etc. A rapidly developing area is protein interactions and protein functional module discovery. All these areas involve traditional data mining methods, but with particular characteristics relevant to bioinformatics and have much progresses over the last several years. This tutorial will cover these areas where machine learning methods are most widely and fruitfully adopted. The biology problems, data mining methods, and computational issues will discussed in details.

One of the new significant development in Bioinformatics is bioimaging (sequence data are not enough to answer many foundamental questions). With the development of advanced imaging techniques, the number of biological images (e.g. cellular and molecular images, as well as medical images) acquired in digital forms is growing rapidly. Large-scale bioimage databases are becoming available. Analyzing these images sheds new light for biologists to seek answers to many biological problems. For example, analysis of the spatial distribution of proteins in molecular images can differentiate cancer cell phenotypes. Comparison of in situ gene expression pattern images during embryogenesis helps to predict co-expressed and co-regulated genes, and delineate the underlying gene networks. Image analysis and signal processing techniques have also been found useful in more conventional bioinformatics applications such as sequence analysis.

The emphasis of this tutorial is to cover some more recent developments in the area.

Content.

Part 1. (50 mins)

Genomic Basics, sequence alignment, position specific score matrix, hidden Markov models

Function and Structure predictions using classification methods.

Protein structure classification system: SCOP, Pfam

Protein fold recognition.

RNA coding region detection.

Binding site detection

Cancer type detection.

Positive samples only classification

Feature extraction. - Amino acid composition - High-order composition

Physical-chemical based features

K-mer based string kernels - Fisher/HMM derived kernels

Part 2. (40mins)

Feature selection. - t-statistic, F-statistic - entropy and mutual information criteria - Filter methods
Minimum Redundancy Feature selection
Unsupervised class, feature, and phenotype discovery. - Gene expression clustering
Subspace clustering - Biclustering
Iterative feature selection and data clustering - Gene Shaving
2-way feature ordering and selection

Part 3. (60mins)

Protein interaction networks and gene regulation networks.
Network structure deduction. Bayesian networks. Linear models.
Protein functional module detection - Graph-based methods

Part 4. (60 mins)

Bioimage data mining and informatics
Acquisition of gene expression images of several model systems (fruifly, etc.) at the cellular resolution.
Bioimage feature measurement, description, extraction, and selection
Bioimage registration and comparison
Object segmentation and tracking in bioimages
Clustering/classification of bioimages or patterns derived from bioimages
Object/pattern recognition and understanding in bioimages
Bioimage ontology and related data mining
Bioimage data visualization
Bioimaging databases.
Tools/software for bioimage data processing and data mining
Bioimage related biology, bioinformatics, and biomedicine applications, e.g. gene regulatory network/pathway modeling, etc.
Joint data mining using both bioimages and other data (e.g. sequences, microarray, protein interaction, etc.)

We will also explain the common data used in these research, the micro-array data, protein hybridization data, protein structure data, and protein interaction data, sequence data and other useful bioinformatics data bases.

This tutorial is an update-to-date survey of bioinformatics problems solved or being solved by data mining methods. The emphasis is on matching the two fields, and on the more recent developments.

This is a much updated version of the similar tutorial that we presented in ICDM 2003, with substantial new materials on feature selections and clustering, protein interactions, and bio-image analysis. These updates reflect the rapid advance in the field.

This tutorial is self-contained and requires no prior biology and bioinformatics knowledge. But many of the materials are in the middle to advanced levels; thus a person with some prior knowledge will appreciate them more.

Targeted audience.

People with data mining and machine learning background who wish to apply them in bioinformatics. After this tutorial, they will be able to quickly getting started on appropriate bioinformatics problems, related data, and a sense of where the research emphasis and trends are moving.