ICDM 2005 Tutorial by Chris Ding and Hanchuan Peng, Lawrence Berkeley National Laboratory.

Bioinformatics and Bio-image Analysis

The fast evolving trends in bioinformatics and computational genomics are to use machine learning methods to computationally determine functions, structures, interactions, among DNAs and proteins with biological significance. Using classification methods, one can predict protein 3D structures, RNA coding regions, binding /non-bind active sites, etc. Some related data mining techniques such as feature extraction and selection become particularly important. Another new aspect is the simultaneous feature selection and data clustering, such as biclustering, iterative feature selection and data clustering, etc. A rapidly developing area is protein interactions and protein functional module discovery. All these areas involve traditional data mining methods, but with particular characteristics relevant to bioinformatics and have much progresses over the last several years. This tutorial will cover these areas where machine learning methods are most widely and fruitfully adopted. The biology problems, data mining methods, and computational issues will discussed in details.

One of the new significant development in Bioinformatics is bioimaging (sequence data are not enough to answer many foundamental questions). With the development of advanced imaging techniques, the number of biological images (e.g. cellular and molecular images, as well as medical images) acquired in digital forms is growing rapidly. Large-scale bioimage databases are becoming available. Analyzing these images sheds new light for biologists to seek answers to many biological problems. For example, analysis of the spatial distribution of proteins in molecular images can differentiate cancer cell phenotypes. Comparison of in situ gene expression pattern images during embryogenesis helps to predict co-expressed and co-regulated genes, and delineate the underlying gene networks. Image analysis and signal processing techniques have also been found useful in more conventional bioinformatics applications such as sequence analysis.

The emphasis of this tutorial is to cover some more recent developments in the area.

Content.

Part 1. (50 mins)

Part 2. (40mins) Part 3. (60mins) Part 4. (60 mins)

We will also explain the common data used in these research, the micro-array data, protein hybridization data, protein structure data, and protein interaction data, sequence data and other useful bioinformatics data bases.

This tutorial is an update-to-date survey of bioinformatics problems solved or being solved by data mining methods. The emphasis is on matching the two fields, and on the more recent developments.

This is a much updated version of the similar tutorial that we presented in ICDM 2003, with substantial new materials on feature selections and clustering, protein interactions, and bio-image analysis. These updates reflect the rapid advance in the field.

This tutorial is self-contained and requires no prior biology and bioinformatics knowledge. But many of the materials are in the middle to advanced levels; thus a person with some prior knowledge will appreciate them more.

Targeted audience.

People with data mining and machine learning background who wish to apply them in bioinformatics. After this tutorial, they will be able to quickly getting started on appropriate bioinformatics problems, related data, and a sense of where the research emphasis and trends are moving.