Читать книгу Biomedical Data Mining for Information Retrieval - Группа авторов - Страница 17
1.2 Review of Literature
ОглавлениеMany researchers applied different models in PhysioNet Challenge 2012 dataset and obtained different accuracy results.
Silva et al. [7] have developed a method for the prediction of mortality in an in-hospital death (0 takes as survivor and 1 taken as died in hospital). They have collected the data from PhysioNet website and perform the challenges. Dataset consists of three sets: sets A, B and C. Each set has 4,000 records. The challenges are given in two events: event I for a binary classifier measurement performance and event II for a risk estimator measurement performance. For event I scoring criteria are evaluated by using sensitivity and positive predictive value and for event II Hosmer–Lemeshow statistic [8] is used. A baseline algorithm (SAPS-I) is used and obtained score of 0.3125 and 68.58 for events I and II respectively and final score they obtained for events I and II are 0.5353 and 17.58. In Ref. [9] Johnson et al. have described a novel Bayesian ensemble algorithm for mortality prediction. Artifacts and erroneous recordings are removed using data pre-processing. The model is trained using 4,000 records from training set for set A and also with two datasets B and C. Jack-knifing method is performed to estimate the performance of the model. The model has obtained values of 0.5310 and 0.5353 as score 1 on the hidden datasets. Hosmer– Lemeshow statistic has given 26.44 and 29.86 as score 2. The model has re-developed and obtained 0.5374 and 18.20 for scores 1 and 2 on dataset C. The overall performance of the proposed model gives better performance than traditional SAPS model which have some advantages such as missing data handling etc. An improved version of model to estimate the in hospital mortality in the ICU using 37 time series variables is presented in Ref. [10]. They have estimated the performance of various models by using 10-fold cross validation. In the clinical data, it is common to have missing values. These missing values are imputed by using the mean value for patient’s age and gender. A logistic regression model is used and trained using the dataset. The performance of model is evaluated by the two events: Event 1 for the accuracy using low sensitivity and positive predictive value and Event 2 for the Hosmer–Lemeshow H static model for calibration. Their model has resulted 0.516 and 14.4 scores for events 1 and 2 for test set B and 0.482 and 51.7 scores for both the event for test set C. The model performance is better than the existing SAPS model. Another model in Ref. [11] has developed an algorithm to predict the in-hospital death of ICU patients for the event 1 and probability estimation in event 2. Here the missing values are imputed by zero and the data is normalized. Six support vector machine (SVM) classifiers are used for training. For each SVM positive examples and one sixth of the negative examples have taken in the training set. The obtained scores for events 1 and 2 are 0.5345 and 17.88 respectively. An artificial neural network model has developed for the prediction of in-hospital death patients in the ICU under the 48 h observations from the admission [12]. Missing values are handled using an artificial value based on assumption. From all feature sets, 26 features are selected for further process. For classification, two layered neural network having 15 neurons in the hidden layers is used. The model has used 100 voting classifiers and the output it produced is the average of 100 outputs. The mode is trained and tested using 5-fold cross validation. Fuzzy threshold is used to determine the output of the neural network. The model is resulted 0.5088 score for event 1 and 82.211 score for event 2 on the test data set. Ref. [13] has presented an approach that identify time series motifs to predict ICU patients in an in-hospital segmenting the variables into low, high and medium measurements. The method has outperformed the existing scoring systems, SAPS-II, APACHE-II and SOFA and obtained 0.46 score for event 1 and 56.45 score for event 2. An improved mortality prediction using logistic regression and Hidden Markov model has developed for an in-hospital death in Ref. [14]. The model is trained using 4,000 records of patients on set A and validation on other sets of unseen data of 4,000 records. Two different events: event 1 for minimum sensitivity and positive predictive value and for event 2 Hosmer–Lemeshow H statistic is used. The model has given 0.50, 0.50 for event 1 and 15.18, 78.9 for event 2 compared to SAPS-I whose event 1 scores are 0.3170, 0.312 and for event 2 66.03 and 68.58 respectively. An effective framework model for predicting in- hospital death mortality in the ICU stay has been suggested in Ref. [15]. Feature extraction is done by data interpolation and Histogram analysis. To reduce the complexity of feature extraction, it reduces the feature vector by evaluating measurement value of each variable. Then finally Cascaded Adaboost learning model is applied as mortality classifier and obtained the 0.806 score for event 1 and 24.00 score for event 2 on dataset A. On another dataset B the model has obtained 0.379 and 5331.15 score for both events 1 and 2. A decision support application for mortality prediction risk has been reported in Ref. [16]. For the clinical rules the authors have used fuzzy rule based systems. An optimizer is used with genetic algorithm which generates final solutions coefficients. The model FIS achieves 0.39 score for event 1 and 94 score for event 2. To predict the mortality in an ICU, a new method is proposed in Ref. [17]. The method, Simple Correspondence Analysis (SCA) is based on both clinical and laboratory data with the two previous models APACHE-II and SAPS-II. It collects the data from PhysioNet Challenge 2012 of total 12,000 records of Sets A, B and C and 37 time series variables are recorded. SCA method is applied to select variables. SCA combines these variables using traditional methods APACHE and SAPS. This method predicts whether the patient will survive or not. Finally, model has obtained 43.50% score 1 for set A, 42.25% score 1 for set B and 42.73% score1 for set C. The Naive Bayesian Classifier is used in [18] to predict mortality in an ICU and obtain high and small S1 and S2. For S1 sensitivity and predictive positive and for S2 Hosmer–Lemeshow H statistic is defined. It replaces the missing values by NaN (Not-a-Number) if variable is not measured. The model achieves 0.475 for S1 which is the eighth best solution and 12.820 for S2 which is the first best solution on set B. On set C, model has achieved 0.4928 score for event 1 (forth best solution) and 0.247 score for event 2 (third best solution). Di Marco et al. [19] have proposed a new algorithm for mortality prediction with better accuracy for data collected from the first 48 h of admission in ICU. A binary classifier model is applied to obtain result for event 1. The set A is selected which contains 41 variables of 4,000 patients. For feature selection forward sequential with logistic cost function is used. For classification a logistic regression model is used which obtained 54.9% score on set A and 44.0% on test set B. To predict mortality rate Ref. [20] has developed a model based on Support Vector Machine. Support Vector Machine is the machine learning algorithm which tries to minimize error and find the best hyperplane of maximum margin. The two classes represent 0 as survivor or 1 as died in-hospital. For training they read 3,000 data and for testing 1,000 data. They observed an over-fitting of SVM on set A and obtained 0.8158 score for event 1 and 0.3045 score for event 2. For phase 2 they set to improve the training strategies of SVM. They reduce the over-fitting of SVM. The final obtained for event 1 is 0.530 and for set B is 0.350 and for set C final score is 0.333. An algorithm based on artificial neural network has employed to predict patient’s mortality in the hospital in Ref. [21]. Features are extracted from the PhysioNet data and a method is used to detect solar ‘nanoflares’ due to the similarity between solar and time series data. Data preprocessing is done to remove outliers. Missing values are replaced by the mean value of each patient. Then the model is trained and yields 22.83 score for event 2 on set B and 38.23 score on set C. A logistic regression model is suggested in Ref. [22] for the purpose. It follows three phases. In phase 1 selection of derived variables on set A, calculation of the variable’s first value, average, minimum value, maximum value, total time, first difference and last value is done. Phase 2 has applied logistic regression model to predict patients in-hospital death (0 for survivor, 1 for died) on the set A. Third phase applies logistic regression model to obtain events 1 and 2 score. The results obtained are 0.4116 for score1 and 8.843 for score2. The paper [23] also reported a logistic regression model for the prediction of mortality. The experiment is done using 4,000 ICU patients for training in set A and 4,000 patients for testing in set B. During the filtering process it figures out 30 variables for building up model. Results obtained are score 0.451 for event 1 and score 2 45.010 for event 2. A novel cluster analysis technique is used in Ref. [24] to test the similarities between time series data for mortality prediction. For data preprocessing it uses a segmentation based approach to divide variables in several segments. The maximal and minimal values are used to maintain its statistical features. Weighted Euclidian distance based clustering and rule based classification is used. The average result obtained for death prediction is 22.77 to 33.08% and for live prediction is 75 to 86%.
In Ref. [25], the main goal is to improve the mortality prediction of the ICU patients by using the PhysioNet Challenge 2012 dataset. Mainly three objectives have accomplished (i) reduction of dimensions, (ii) reduction of uncontrolled variance and (iii) less dependency on training set. Feature reduction techniques such as Principal Component Analysis, Spectral Clustering, Factor Analysis and Tukey’s HSD Test are used. Classification is done using SVM that has achieved better accuracy result of 0.73 than the previous work. The authors in Ref. [26] have extracted 61,533 data from the MIMIC-III v1.4, excluded patients whose age is less than 16, patients who stay less than 4 h and patients whose data is not present in the flow sheet. Finally 50,488 cohort ICU stays are used for experiments. Features are extracted by using window of fixed length. The machine learning models used are Logistic Regression, LR with L1 regularization penalty using Least Absolute Shrinkage and Selection Operator (LASSO), LR with L2 regularization penalty and Gradient Boosting Decision Trees. Severity of illness is calculated using different scores such as APS III, SOFA, SAPS, LODS, SAPS II and OASIS. Two types of experiments are conducted i.e. Benchmarking experiment and Real-time experiment. Models are compared from which Gradient Boosting Algorithm obtained high AUROC of 0.920. Prediction of hospital mortality through time series analysis of an intensive care unit patient in an early stage, during the admission by using different data mining techniques is carried in [27]. Different traditional scoring system such as APACHE, SAPS and SOFA are used to obtain score. 4,000 ICU patients are selected from MIMIC database and 37 time series variables are selected from first 48 h of admission. Synthetic Minority Oversampling Technique (SMOTE) (original and smote) is used to modify datasets where they handle missing data by replacing with mean (rep1), then SMOTE (rep1 and smote) is applied. After replacing missing data, EM-Imputation (rep2) algorithm is applied. Finally, result is obtained by using different classifiers like Random Forest (RF), Partial Decision Tree (PART) and Bayesian Network (BN). Among all these three classifiers, Random Forest has obtained best result with AUROC of 0.83 ± 0.03 at 48 h on the rep1, with AUROC of 0.82 ± 0.03 on original, rep1 and smote at 40 h and with AUROC of 0.82 ± 0.03 on rep2 and smote at 48 h.
Sepsis is one of the reasons for high mortality rate and it should be recover quickly, because due to sepsis [28] there is a chance of increasing risk of death after discharge from hospital. The objective of the paper is to develop a model for one year mortality prediction. 5,650 admitted patients with sepsis were selected from MIMIC-III database and were divided into 70% patients for training and 30% patients for testing. Stochastic Gradient Boosting Method is used to develop one-year mortality prediction model. Variables are selected by using Least Absolute Shrinkage and Selection Operator (LASSO) and AUROC is calculated. 0.8039 with confidence level 95%: [0.8033–0.8045] of AUROC result is obtained in testing set. Finally, it is observed that Stochastic Gradient Boosting assembly algorithm is more accurate for one year mortality prediction than other traditional scoring systems—SAPS, OASIS, MPM or SOFA.
Deep learning is successfully applied in various large and complex data-sets. It is one of the new technique which is outperformed the traditional techniques. A multi-scale deep convolution neural network (ConvNets) model for mortality prediction is proposed in Ref. [29]. The dataset is taken from MIMIC-III database and 22 different variables are extracted for measurements from first 48 h for each patient. ConvNet is a multilayer neural network and discrete convolution operation is applied in the network. Convolution Neural Network models have been developed as a backend using different python packages i.e. Keras and TensorFlow. The result obtained by the proposed model gives better result of ROC AUC (0.8735, ± 0.0025) which satisfies the state of art of deep learning models.