CSE 5334: Data Mining

Outline and Schedule :

There are 4 projects.
They are a coherent mini package for data mining.
They include classifers kNN, Centroid, Linear Regression and K-means clustering.
These codes will run on data directly, or on
- PCA/LDA (dimension reduced) representation,
- feature selected data,
- graph-embedbed data.
-----------------------------------------------------------------------------

Week 1.
We start with three concrete examples:
1. Data Mining example: Market basket Data analysis
2. Pattern Recognition example: Handwritten letters recognition
3. Cancer prediction using DNA expressions recorded on microarrays

From these examples, key ideas, concepts and methods will be introduced.
Data mining uses many techniques from Machine Learning and Pattern Recognition.

Week 2.
- Brief Introduction to Information Retrieval, word processing, vector space model
- Naive Bayes Classification

Week 3.
Classification, decision boundary, Bayes classifier
-k Nearest Neighbors (kNN)
- Centroid Method
- Linear Regression

Project 1. Classification using kNN, Centroid method, Linear Regression
Due Feb 24

Week 4.
- Support Vector Machine
- Multi-class classification using binary classifiers
- Kernels (Gussian, polynomial)
- Evaluation of classifiers: Precision, Recall, cross-validation

Project 2. Split data into training and testing. Run kNN, Centroid method, Linear Regression. Run SVM using linear and Guassian kernels. Do 5-fold cross-validation.
Due March 9

Week 5 - 6.
Clustering
- K-means clustering
- Gaussin Mixture Model and EM Algorithm

Project 3. Clustering using K-means, use data in Project 1.
Due March 30

Week 7.
Data types, preprocessing, normalization, etc

Week 8 - 9.
Feature Selection
- t-statistic, f-statisic
- mutual information
- mininum redundency, maximun relevance
- filters, wrappers, feature set selection

Project 4a. Using f-statistic to select features. Run kNN, centroids, linRegression, SVM. Run K-means on selected data.

Week 10 - 11.
Dimension Reduction
- principle component analysis
- linear discriminant analysis

Project 4b. Run PCA and LDA on data to obtain low-dimensional representatin. Run kNN, centroids, linRegression, SVM. Run K-means on selected data.

Week 12 - 13.
Graph Embedding
- Embedding a graph (distance matrix) in a metric space: multi-dimensional scaling
- Embedding a graph (similarity matrix) in a metric space
- Laplacian embedding

Project 4c. Computing a kernel. Embed it in low-dimensional space. Run kNN, centroids, linRegression, SVM. Run K-means.
Due May 6. Project 4 presentation at 4-6pm.

Week 14 - 15.
Semi-supervied Learning
- Large number of unclassified data; small number of data have class labels

Final Exam