Enhancing Predictive Power of Cluster-BoostedRegression With Text-Based Indexing

IEEE Project Abstract

Clustering prior to regression analysis improves the accuracy of prediction in clinical decisionmaking. However, most previously described methods focused on numerical data only. This paper investigated how well textual features can improve the accuracy of regression predictions. Preliminary diagnosis,diagnosis summary, and drug names used in prescriptions as provided in the MIMIC II dataset were usedto derive textual features. We proposed the bag-of-entities indexing method, which relies on named entityrecognition, a machine learning technique used for locating and identifying words into predefined classes.The proposed technique captured meaningful phrases from texts in health records and represented them innumerical vector format. Dimensionality of the data space was reduced using principal component analysis.The additional well-tuned textual features were then combined with existing numerical features in usingcluster-boosted regression to predict patient mortality in ICU. The experimental results showed predictionimprovement obtained from textual features over the use of numerical features only. We found that using theproposed indexing method outperformed traditional word-vector representation approaches (bag-of-wordsand bag-of-bigrams) as well as a state-of-the-art approach (Doc2vec) in terms of resulting accuracy inpredicting death status. Moreover, instead of directly interpreting, the identifiable individual features weregrouped into types and summarized. The summarized de-identified data of textual features handled by theproposed framework can support predictive classification while also reducing privacy concerns. Groupingof similar patients based on their electronic health records also benefits physicians through the improveddifferential diagnosis and effective treatment planning.

