Clinical Dataset Analysis and Patient Outcome Prediction via Machine Learning
AuthorBuettner, Alex Balin
AdvisorVasquez, Victor R
Chemical and Materials Engineering
StatisticsView Usage Statistics
We analyze and evaluate relevant machine learning methods for use in extract-ing and understanding clinical data sets in the context of optimization ofclinical processes. Three data sets were considered to demonstrate the types andstyle of data found in the healthcare field: (a) the Pima Indians diabetesdataset (PIDD), a non-time-dependent diabetes onset study, (b) an alcoholism EEGdataset (AED), studying responses of alcoholic and control subjects when exposedto image stimulus, and (c) the diabetes readmission dataset (DRD), thatfocuses on factors that relate to diabetic patient readmission times. Eachdataset is modeled using a variety of machine learning methods, includingBayesian, neural network, and decision tree methods, to better understand theadvantages and disadvantages as applicable to rapid dependency extraction andunderstanding of the information contained therein. The goal of this work is toanalyze the potential of machine learning for use in management of clinicalprocesses and operations. Neural network models are used to assess all threedata sets; two using dense neural networks, and one using convolutional neuralnetworks. The dense neural network model used on the PIDD resulted in a maximumprediction accuracy of 81.77%. In contrast, the use of neural network (NN)models on the much larger DRD demonstrated some drawbacks that were not expectedupon initial analysis of the data. We found that the NN model performs poorly onthis dataset, with classification accuracy no higher than 61.17%, due to thecomplexity of the dataset and potential need for more data. The use ofconvolutional neural networks for analysis of time series data wasdemonstrated on the alcoholism EEG dataset, resulting i in subjectclassification accuracies between 91.41% and 98.82% depending on the trainingand testing sets used to analyze the model. Bayesian methods are used to analyzeall three datasets, both in supervised and unsupervised manners. Supervisedlearning analysis on the PIDD showed improvement over published results, butare generally in agreement with the literature. Classifications accuracies ofresulted in a maximum of 84.49% on a preprocessed dataset, and 79.75% on anunmodified dataset. Similarly, supervised learning was applied to the DRD,resulting in maximum classification accuracies on a three-class and two-classmodel of 58% and 62%, respectively. Unsupervised Bayesian methods are appliedto the alcoholism EEG dataset in order to extract the true number of classespresent in the model, in which all trials correctly identified two subjectclasses without the aid of labeling. Hidden Markov models are also applied tothe alcoholism EEG dataset in an unsupervised fashion, allowing us to extractcharacteristic states in each EEG sample, for each subject class. The PIDD andDRD sets are also processed using decision tree models; gradient boostingclassifiers (GBC) are applied to the PIDD, and extreme gradient boostingclassifiers (XGBC) are applied to the DRD. The GBC model used to analyze thePIDD resulted in a maximum classification accuracy of 82.48% on preprocesseddata, and 80.60% on an unmodified dataset. The DRD showed difficulty in modeldevelopment, with maximum classification accuracy reaching approximately 55%,and with insensitivity to two of three data labels. The model seems unable tocapture the components of the third class, showing that the distinguishabilityof the classes may be lacking.