The goal of this blog post is to explain the process of creating a dashboard visalization using streamlit. This dashboard visualizes data about doctorate recipients including: demographic information, field of study, and postgraduation plans. Data was collected from the National Center for Science and Engineering Statistics (NCSES) and datasets can be found here: https://ncses.nsf.gov/pubs/nsf19301/data. Analyses were performed using the pandas library and visualizations were created using the matplotlib and plotly libraries along with streamlit's widget features.

View dashboard visualization: https://share.streamlit.io/saahithirao/bios-823-blog/hw4.py

View code: https://github.com/saahithirao/bios-823-blog/blob/master/hw4.py This code can be downloaded to a personal machine and run using >>streamlit run hw4.py

Doctorate recipients by gender & race from 2008-2017
This visualization displays an interactive data table of doctorate recipents by gender and race from 2008 to 2017. The user can click on a year on the sidebar to display data for that year and can select to view data by gender or race. Since there was no dataset that contained all of this information, I opted to combine two different datasets: one that contained data on females and on that contained data on males. I extracted the necessary information to create the visualization, as shown below, and merged the two dataframes. Then, using streamlit's widget features, I created a sidebar that allows the user to select a specific year that they want to see data for and/or filter the data by gender and race to explore the data further and make comparisons. Code for this is shown below and linked above.

import pandas as pd
df = pd.read_excel("https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab021.xlsx", header=3)
df = df.rename(columns={'Ethnicity, race, and citizenship status':'Race'})
df_female = (
        df.
        drop(df[df['Race'].str.contains('citizen')].index.tolist()).
        drop(df[df['Race'].str.contains('visa')].index.tolist()).
        drop(df[df['Race'].str.contains('Hispanic')].index.tolist()).
        drop(df[df['Race'].str.contains('Ethnicity')].index.tolist()).
        reset_index().
        drop(columns = ['index'])
    )
df_female["Gender"] = "Female"
df_female
Race 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Gender
0 All doctorate recipients 22494 23187 22488.0 22700 23527 24366 24816 25354 25256 25495 Female
1 American Indian or Alaska Native 70 78 65.0 79 65 64 55 77 73 62 Female
2 Asian 4795 4741 4662.0 4904 5117 5392 5465 5598 5521 5605 Female
3 Black or African American 1396 1552 1403.0 1396 1481 1554 1503 1660 1662 1756 Female
4 White 12704 13058 12525.0 12572 12941 13264 13324 13523 13623 13413 Female
5 More than one race 298 370 418.0 399 484 497 488 535 605 594 Female
6 Other race or race not reported 222 238 205.0 191 202 204 162 167 193 321 Female
st.sidebar.header("User Input")
selected_year = st.sidebar.selectbox('Year', list(reversed(range(2008,2017))))
phds = load_data(selected_year)
unique_gender = phds.Gender.unique()
select_gender = st.sidebar.multiselect('Gender', unique_gender, unique_gender)
df_selected = phds[(phds.Gender.isin(select_gender))]

Number of doctorate recipients by gender over time

This visualization displays an interactive plot illustrating number of doctorate recipients over time and separated by gender. The user can hover over points to view data for that year. This plot shows a similar trend over time for males and females with a slight decrease in number of PhDs in 2010 and a steady increase until 2015. The gap, however, between males and females does not seem to be decreasing over time, which shows that there is still a disparity between receiving a doctorate degree by gender. The code below shows how the plot was created.

import plotly.express as px
df_select = df_female[df_female["Race"] == "All doctorate recipients"]
df2 = (df_select.drop(['Gender'], axis=1))
df_long = pd.melt(df2,id_vars=['Race'],var_name='Year', value_name='phds')
df_long['Gender'] = ['Female']*10

df_select_male = df_male[df_male["Race"] == "All doctorate recipients"]
df3 = (df_select_male.drop(['Gender'], axis=1))
df_long2 = pd.melt(df3,id_vars=['Race'],var_name='Year', value_name='phds')
df_long2['Gender'] = ['Male']*10

df_combine_plot = pd.concat([df_long, df_long2], ignore_index=True)

fig = px.line(df_combine_plot, 
                x='Year', y='phds', color='Gender', 
                labels = {
                    "phds": "Number of PhDs"
                })
fig.update_traces(mode='markers+lines')

Summary of doctorate recipients across years by gender

This static data table displays summary statistics of aggregated data across time of doctorate recipients by gender. The table follows from the plot to understand, overall, the trends in receiving a PhD by gender.

summary = pd.DataFrame({'Gender': ['Female','Male'],
            'Min': [df_long['phds'].min(), df_long2['phds'].min()],
            'Mean' : [df_long['phds'].mean(), df_long2['phds'].mean()], 
            'Median': [df_long['phds'].median(), df_long2['phds'].median()],
            'Max': [df_long['phds'].max(), df_long2['phds'].max()]})

Visualizing all doctorate recipients by field of study in 2017

Now, turning to another aspect of the data, we will look closer at doctorates by field of study. This requires a new dataset. The data was transposed and all doctorate recipient information was extracted. In order to visualize number of doctorates by field of study, I created a simple bar plot. The user can hover over bars to display information specific to that field of study. This plot illustrates which fields awarded more PhDs. The code is shown below and the plot is displayed in the dashboard.

dat = pd.read_excel("https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab054.xlsx", header=3)
dat_all = dat.iloc[[0]]

dat_T = dat_all.T
dat_T = dat_T.rename(columns=dat_T.iloc[0]).reset_index()
dat_T = dat_T.iloc[1:]
dat_T

fig = px.bar(dat_T, x='index', y='All doctorate recipients (number)c',  
             labels={'All doctorate recipients (number)c':'All Doctorates', 'index':'Field of Study'})

Visualizing field of study by gender in 2017

In the visualizations above, we saw the breakdown of number of doctorate recipients by gender. Here, we will take a look at doctorate recipients by gender and field of study. This stacked bar plot was created by transposing the data into a wide format and separate bars were created for males and females. The user can hover over the bars to view data of a specific field by gender. This plot illustrates which fields had a smaller female to male ratio or vice versa. The code is shown below and the figure is displayed in the dashboard linked above.

from plotly import graph_objects as go
to_plot = dat[dat['Characteristic'].str.contains('ale')]

plot = to_plot.T
plot2 = (
            plot.
            rename(columns=plot.iloc[0]).
            drop(plot.index[0]).
            reset_index().
            drop(columns=['Female doctorate recipients (number)', 'Male doctorate recipients (number)'])
)

fig2 = go.Figure(
    data=[
        go.Bar(
            name="Male",
            x=plot2["index"],
            y=plot2["Male"],
            offsetgroup=1,
        ),
        go.Bar(
            name="Female",
            x=plot2["index"],
            y=plot2["Female"],
            offsetgroup=1,
            base=plot2["Male"],
            hovertext= [f'Count: {val}' for val in plot2["Female"]]
        )
    ],
    layout=go.Layout(
        title="Percent of Doctorate Recipients by Broad Field of Study and Gender",
        yaxis_title="Percent"
    )
)