NSF-III: Medium: Collaborative Research: Robust Large-Scale Electronic Medical Record Data Mining Framework to
Conduct Risk Stratification for Personalized Intervention
The increasingly large amounts of Electronic Medical Record (EMR) data provide us unprecedented opportunities for EMR data mining to enhance health care experiences for personalized intervention, improve different diseases risk stratifications, and facilitate understanding about disease and appropriate treatment. To solve the key and challenging problems in mining such large-scale heterogeneous EMRs, we develop a novel robust machine learning framework targeting to explore the following research tasks. First, we develop new computational tools to automate the EMRs processing, including missing values imputation by a new robust rank-k matrix completion method and unstructured free-text EMRs annotations via multi-label multi-instance learning model. Second, we investigate the new sparse multi-view learning model to integrate heterogeneous EMRs for predicting the readmission risk of Heart Failure (HF) patients and providing support to personalized intervention. Third, to identify the longitudinal patterns, we design novel high-order multi-task feature learning and classification methods. Fourth, we build the nonparametric Bayesian model for predicting the event time outcomes of the HF patients readmission. The developed sparse multi-view feature learning and robust multi-task longitudinal pattern finding frameworks enable new computational applications in a large number of research areas.
The proposed research work is innovative and crucial not only to addressing emerging EMRs applications, but also to facilitating machine learning and data mining techniques. We make the developed computational methods and tools online, available to the public. These methods and tools are expected to impact other EMR and public health research and enable investigators working on EMRs to effectively test risk prediction hypothesis. The developed algorithms and tools are expected to help knowledge extraction for applications in broader scientific and biomedical domains with massive high-dimensional and heterogonous data sets. This project will facilitate the development of novel educational tools to enhance several current courses at UT Arlington, UTSW, and SMU. The PIs are engaging the minority students and under-served populations in research activities to give them a better exposure to cutting-edge science research.
Nugget 1. Missing data imputation method for longitudinal data completion:
In the longitudinal electronic medical records, because many entries are measured and changed every 24 hours, e.g. cholesterol measures, blood pressure, it is necessary to impute the missing data using the time-series data. We developed a novel spatially-temporally consistent tensor completion method to utilize both spatial and temporal information of the longitudinal data. We introduce a new smoothness regularization to utilize the content continuity within data. Besides using the new smoothness regularization to keep the temporal consistency, we also minimize the trace norm of each individual time stage data to make use of the spatial correlations among data elements.
This work was published in AAAI 2014.
Nugget 2. Natural language processing based document abstractive summarization framework:
In electronic medical record data, there are unstructured data, such as physician notes, and it is challenging to extract useful information from such unstructured data. While much work has been done in the area of extractive summarization, there has been limited study in abstractive summarization as this is much harder to achieve. We propose a new weakly supervised abstractive summarization framework using pattern based approaches. Our system first generates meaningful patterns from sentences. Then, in order to precisely cluster patterns, we propose a novel semi-supervised pattern learning algorithm that leverages a hand-crafted list of topic-relevant keywords, which are the only weakly supervised information used by our framework to generate aspect-oriented summarization. After that, our system generates new patterns by fusing existing patterns and selecting top ranked new patterns via the recurrent neural network language model. Finally, we introduce a new pattern based surface realization algorithm to generate abstractive summaries. Our new abstractive summarization technique can help users to extract useful knowledge from the unstructured electronic medical record data, such as physician notes. The summarizations can also be converted as word vectors for risk prediction.
This work was published in CIKM 2015.
Nugget 3. A new structured sparse learning model to integrate multi-dimensional features for risk prediction:
We propose a novel structured sparsity-inducing norms based feature learning model to integrate the multi-dimensional features for risk prediction. The new mixed norms are designed to learn the importance of different features from both local and global point of views. We successfully integrate multi-dimensional data to enhance classification results. The empirical results show that the proposed new method can effectively integrate different types of features, and consistently outperforms related methods using the concatenated feature vectors.
This work was published in KDD 2015.
Nugget 4. Discrete missing value imputation software:
In electronic medical records, there are many missing values. Especially many values are discrete, not continuous. Existing missing value imputation methods mainly focus on the continuous value prediction. In these cases, an additional step to process the continuous results with either heuristic threshold parameters or complicated mappings is necessary, while it is inefficient and may diverge from the optimal solution. To address this issue, we proposed a novel optimal discrete matrix completion model, which is able to learn the optimal thresholds automatically and also guarantees an exact low-rank structure of the target matrix. We derive stochastic gradient descent algorithm to optimize the new objective with proper strategies to speed up the optimization. In the experiments, it is proved that our method can predict discrete values with high accuracy, very close to or even better than these values obtained by carefully tuned thresholds. Meanwhile, our model is able to handle online data and easy to parallelize for big missing data imputation.
This work was published in AAAI 2016.
Some Related Publications:
Song Zhang, Jing Cao, Chul Ahn, Statistical Inference and Sample Size Calculation for Paired Binary Outcomes with Missing Data, 2016, Statistics in Medicine, in press.Software 1: Structured sparse learning model to integrate multi-dimensional features for risk prediction:
Click to Download
Software 2: Discrete missing value imputation software:
Click to Download
Software 3: Robust metric learning using capped trace norm:
Click to Download
Software 4: Natural language processing based document abstractive summarization tool:
We are cleaning the code and will provide them soon.