Team members included: Shusaku Asai, Saahithi Rao, Michael Tang, Yi Feng

This blog post will describe the process of analyzing ICU data using machine learning methods and creating a dashboard using Streamlit. I will also describe what I learned while completing this project. The data used was obtained from the Medical Information Mart for Intensive Care (MIMIC-III) and can be found at https://mimic.mit.edu/.

View dashboard visualization: https://share.streamlit.io/delashu/bios823_project/main/scripts/dashboard/icu_dash.py

View github repository: https://github.com/delashu/BIOS823_Project

Our primary aim was to predict death in the ICU. Specifically, we wanted to utilize patient demographics, prescibed medications, procedures, and admission information to predict the probability of a patient dying in the ICU. Mortality risk predictions can be useful for hospitals to determine where to allocate resources and for doctors to better understand their patients' outcomes.

Source: https://link.springer.com/chapter/10.1007/978-3-319-43742-2_21#:~:text=A%20number%20of%20severity%20of,SOFA)%20score%20%5B7%5D.

In order to obtain access to the data, we had to complete CITI training. Because the data set was so large, we began by using a subset of the data, which can be found here: https://physionet.org/content/mimiciii-demo/1.4/. We used four different data sets: admissions, ICU, procedures, and prescriptions.

The first step was to perform feature engineering. We initialized an SQL database using sqlite3 with the necessary data frames. We queried the data using SQL as shown in the example code below. We initially tried using GoogleBit Query, but determined that a SQL database would be equally effective.

con = sqlite3.connect('MIMIC3_full.db')
admitdf = pd.read_sql('select * from admission', con)
icudf =  pd.read_sql('select * from icu', con)
procdf = pd.read_sql('select * from procedure', con)
drugdf = pd.read_sql('select * from prescription', con)

Feature Engineering

We identified the procedures and prescriptions with the most occurences in the data. Then, we created crosstables with the top 8 procedures and top 20 prescriptions in order to limit the number of features that would be included in the model.

top_twenty = drugdf['formulary_drug_cd'].value_counts().head(20).index.tolist()
prescription_list = ['ACET325', 'CALG1I', 'D5W1000', 'D5W250', 'FURO40I',
       'HEPA5I', 'INSULIN', 'KCL20P', 'KCL20PM', 'KCLBASE2', 'LR1000',
       'MAG2PM', 'METO25', 'MORP2I', 'NACLFLUSH', 'NS1000', 'NS250', 'NS500',
       'VANC1F', 'VANCOBASE']
with open("../../crosstables/prescription_list.txt", "wb") as dl:
    pickle.dump(prescription_list, dl)

Next, we created a function to perform feature engineering. This function takes the four data frames as inputs, cleans and merges the data, and returns an analytic data frame to be used for modeling. The code below shows an example how the data frames were merged and how we dealt with missingness (code is not exhaustive). Using lists of top procedures and medications, we subsetted the relevant data frames to those lists and converted the data to a long format, so that procedures and prescriptions could be used as features (columns) in the model. We converted nas to 0 (patient did not have that procedure or medication) in these columns. Similarly, for our outcome of death, we converted nas to 0 because we could not confirm that that patient had died. Using domain knowledge of diagnoses in the ICU that have a high likelihood of leading to death (https://pubmed.ncbi.nlm.nih.gov/17083735/), we categorized diagnoses into four groups and labelled the remaining diagnoses as "other". We determined the top two wards that had the most deaths and labelled the remaining wards as "other".

One limitation is that we should have used a similar method for admission location and first care unit as we later realized that there were values that appeared in the full data that did not appear in the demo data, which caused us to have to reevaluate all the features when modeling.

icu_admin = pd.merge(icudf, admitdf, how='left', on='hadm_id')
icu_full = (
            icu_admin.
            drop(columns=['subject_id_y']).
            rename(columns={"subject_id_x": "subject_id"})
)

procdf = procdf[procdf['ordercategoryname'].isin(procedure_list)].reset_index()
procdf = procdf[procdf['icustay_id'].notna()]
myproc_counts = procdf.groupby(['subject_id', 'icustay_id', 'ordercategoryname']).size().reset_index(name='counts')

# convert data to long format with procedures as columns
myproc_counts_long = myproc_counts.pivot(index = ['subject_id','icustay_id'], 
                                             columns = 'ordercategoryname',
                                             values = 'counts').reset_index()
myproc_counts_long = myproc_counts_long.replace(np.nan,0)

# convert nas in the outcome column to 0
analyticdf['hospital_expire_flag'] = analyticdf['hospital_expire_flag'].fillna(0)

# categorize diagnoses (based on domain knowledge)
analyticdf['diagnosis'] = np.where(analyticdf['diagnosis'].str.contains("congestive heart failure", case=False), "CV Failure", 
                            np.where(analyticdf['diagnosis'].str.contains("sepsis", case=False), "Sepsis",
                            np.where(analyticdf['diagnosis'].str.contains("seizure", case=False), "CNS Failure",
                            np.where(analyticdf['diagnosis'].str.contains("stroke", case=False), "CNS Failure",
                            np.where(analyticdf['diagnosis'].str.contains("tia", case=False), "CNS Failure",
                            np.where(analyticdf['diagnosis'].str.contains("ACUTE CHOLANGITIS", case=False), "Organ Failure",
                            np.where(analyticdf['diagnosis'].str.contains("GI BLEED", case=False), "Organ Failure",
                            np.where(analyticdf['diagnosis'].str.contains("lung failure", case=False), "Organ Failure",
                            np.where(analyticdf['diagnosis'].str.contains("liver failure", case=False), "Organ Failure",
                            np.where(analyticdf['diagnosis'].str.contains("MYOCARDIAL INFARCTION", case=False), "CV Failure", "Other"))))))))))

Model Building

Now, let's discuss the modeling! The analytic dataframe from feature engineering was loaded and categorical variables were converted to dummy variables. Training and test datasets were created using a 0.8 split as shown below. We ran several classification models including: dummy classifier, logistic regression, decision tree, KNN, SVC, random forest, xgboost random forest, and catboost. We assessed these models using area under the curve (AUC). Results are shown below.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1125, stratify=y)
DummyClassifier        AUC:0.497 STD: 0.15
LogisticRegression     AUC:0.700 STD: 0.22
DecisionTreeClassifier AUC:0.597 STD: 0.14
KNeighborsClassifier   AUC:0.640 STD: 0.21
SVC                    AUC:0.714 STD: 0.17
RandomForestClassifier AUC:0.762 STD: 0.14
XGBRFClassifier        AUC:0.738 STD: 0.15
CatBoostClassifier     AUC:0.749 STD: 0.20

We also performed stacking of the models to explore the models and help us determine which model to use. We chose to use xgboost random forest classifier for our prediction model and peformed a grid search to determine the best hyperparameters. Finally, we created a confusion matrix, ROC curve, precision-recall curve, and learning curve to assess our model. The plots of these curves showed that our xgboost random forest model performed well. The mean prediction accuracy for our model trained on the demo data was 76%. The code below shows the grid search and resulting best parameters.

clf = xgboost.XGBRFClassifier(eval_metric='logloss')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
roc_auc_score(
    y_test, clf.predict(X_test)
)

clf_ = xgboost.XGBRFClassifier(eval_metric='logloss')
params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0, 0.5, 1, 1.5, 2, 5],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [4, 5, 6, 7],
}

# perform grid search to determine best parameters
clf = model_selection.GridSearchCV(
    clf_, params, n_jobs=-1, 
).fit(X_train, y_train)
clf.score(X_test, y_test)
clf_best = xgboost.XGBRFClassifier(**clf.best_params_)
'min_child_weight': 1,
'gamma': 1,
'colsample_bytree': 1.0,
'max_depth': 7,

Using the best parameters from this model, we ran the xgboost random forest classifier model on the full data and peformed a grid search. This allowed the grid search to run more efficiently as we were able to specify a smaller set of parameters. Without this information, the grid search would have taken over 24 hours to complete. The grid search produced the best parameters that we then used to make predictions. We calculated the same metrics on the model using the full dataset (confusion matrix, precision-recall curve, ROC curve, and learning curves) and saved the model as a pickle file to be used for our dashboard. The prediction accuracy was 90%, which was quite a bit higher than our baseline model.

clf_ = xgboost.XGBRFClassifier(eval_metric='logloss')
params = {
    'min_child_weight': [1, 5],
    'gamma': [1, 1.5],
    'colsample_bytree': [0.8, 1.0],
    'max_depth': [4, 7],
}
clf = model_selection.GridSearchCV(
    clf_, params, n_jobs=-1, 
).fit(X_train, y_train)
clf.best_params_
clf.score(X_test, y_test)
clf_best = xgboost.XGBRFClassifier(**clf.best_params_)

Dashboard Creation

The final step was creating the product. We chose to visualize the data and ouput predictions from our model through a dashboard. The dashboard was created using Streamlit with intentions of taking user inputs for the different features (characteristics of a patient) to output a prediction of the probability of that patient dying. We integrated interactive exploratory plots of the continuous and categorical variables in our dataset. We also add user inputs for the different features in our model and used the user inputs to create a dataframe that we could input into our model to make a prediction. This dataframe is similar to one row in our original analytic dataframe. The dashboard would be useful for a doctor, who could input the characteristics of their patient and determine the probability their patient dying. Below is a sample of the code used in the dashboard.

st.subheader(str("**II: Predictive Modeling**"))
st.markdown(str("Input patient information below to output risk of ICU death."))
los_select = col5.number_input('Length of Stay', value = 5)
ACE_select = col6.number_input('ACET325', value=0, step=1)
CAL_select = col7.number_input('CALG1I', value=0, step=1)

# make pandas dataframe from the user input with default values
predat = {'los': [los_select], 'ACET325': [ACE_select],'CALG1I': [CAL_select], 'D5W1000': [DW_select],
'D5W250': [DW2_select]}

# data frame variables need to match features of model
# set categorical features based on user input
if fw_select == "52":
    pred_df['first_wardid_52'] = 1
elif fw_select == "Other":
    pred_df['first_wardid_Other'] = 1

#make np dataframe into array
pred_array = pred_df.to_numpy()

#make prediction!
prediction = clf.predict_proba(pred_array[:1])[:,1]
# output death predictions as score from 0 to 100
model_score = round(100*prediction[0],2)

#output the score
st.subheader(str("**Patient Death Prediction Score**"))
st.metric(label="Model Risk Prediction Score", value=str(model_score) + str("%"))

Individual Contributions

I contributed to various components of this project. I helped with the feature engineering. I explored the data and specifically the missingness and decided how to deal with these values. For example, I relabelled nas in other in many of the columns and as 0 in the outcome column. Once the crosstables of prescriptions and procedures had been created, I helped with writing a function that would take the four dataframes of interest, merge and clean the data, and output an analytic dataframe. I helped merge the dataframes and confirm that the data looked like what we expected. I determined how to categorize diagnoses. Once we had the analytic dataframe, I ran the baseline modeling on the demo data. This involved processing the data and splitting into training and test sets before running several classification models. I converted categorical variables into dummy variables and ran the models. I compared the AUC and SD of each of the models, performed stacking, and grid search to determine the best parameters. I passed along the best parameters to be used for the full data. I created plots of the confusion matrix, ROC curve, precision-recall curve, and learning curves. I saved the best model and used that to create a skeleton of model predictions in our dashboard until the full data could be used. I matched the user inputs to the model features by creating a dataframe from the user inputs that were in the same order and had the same number of features as in the model. This was long process because we didn't want the user inputs on the dashboard to look clunky, so we had to use a lot of if, else statements to create the dataframe. I, then, deployed the best model onto the dashboard that took the user inputs and would make a prediction of the probability of a patient dying in the ICU. I helped with publishing the dashboard.

Through completing this project, I learned how to perform machine learning algorithms in python and improved both my content knowledge of classification models and python proficiency with running models and making predictions. It was difficult to prioritize work on this project along with many other group projects, but I learned to set goals for each of our team meetings and slowly chip away at tasks for the project to keep it moving. I learned to utilize the skill sets of my team members as some were good at envisoning the dashboard and its different components and some were comfortable working with big data sets and to build on my own strengths of thinking through how the analytic data should look and trying different analytic methods. This project was a good culmination of the course as we got to implement different concepts that we learned and had the freedom to create a data product of our choice. I enjoyed exploring "real world" data and the different features as it felt like we were tackling a very relevant problem. I also enjoyed problem solving to figure out how to best create the analytic data frame and how to make model predictions from user inputs.