The goal of this blog post is to explain the process of creating a dashboard visalization using streamlit. This dashboard visualizes data about doctorate recipients including: demographic information, field of study, and postgraduation plans. Data was collected from the National Center for Science and Engineering Statistics (NCSES) and datasets can be found here: https://ncses.nsf.gov/pubs/nsf19301/data. Analyses were performed using the pandas library and visualizations were created using the matplotlib and plotly libraries along with streamlit's widget features.

View dashboard visualization: https://share.streamlit.io/saahithirao/bios-823-blog/hw4.py

View code: https://github.com/saahithirao/bios-823-blog/blob/master/hw4.py This code can be downloaded to a personal machine and run using >>streamlit run hw4.py

Doctorate recipients by gender & race from 2008-2017
This visualization displays an interactive data table of doctorate recipents by gender and race from 2008 to 2017. The user can click on a year on the sidebar to display data for that year and can select to view data by gender or race. Since there was no dataset that contained all of this information, I opted to combine two different datasets: one that contained data on females and on that contained data on males. I extracted the necessary information to create the visualization, as shown below, and merged the two dataframes. Then, using streamlit's widget features, I created a sidebar that allows the user to select a specific year that they want to see data for and/or filter the data by gender and race to explore the data further and make comparisons. Code for this is shown below and linked above.

import pandas as pd
df = pd.read_excel("https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab021.xlsx", header=3)
df = df.rename(columns={'Ethnicity, race, and citizenship status':'Race'})
df_female = (
        df.
        drop(df[df['Race'].str.contains('citizen')].index.tolist()).
        drop(df[df['Race'].str.contains('visa')].index.tolist()).
        drop(df[df['Race'].str.contains('Hispanic')].index.tolist()).
        drop(df[df['Race'].str.contains('Ethnicity')].index.tolist()).
        reset_index().
        drop(columns = ['index'])
    )
df_female["Gender"] = "Female"
df_female

st.sidebar.header("User Input")
selected_year = st.sidebar.selectbox('Year', list(reversed(range(2008,2017))))
phds = load_data(selected_year)
unique_gender = phds.Gender.unique()
select_gender = st.sidebar.multiselect('Gender', unique_gender, unique_gender)
df_selected = phds[(phds.Gender.isin(select_gender))]

Number of doctorate recipients by gender over time

This visualization displays an interactive plot illustrating number of doctorate recipients over time and separated by gender. The user can hover over points to view data for that year. This plot shows a similar trend over time for males and females with a slight decrease in number of PhDs in 2010 and a steady increase until 2015. The gap, however, between males and females does not seem to be decreasing over time, which shows that there is still a disparity between receiving a doctorate degree by gender. The code below shows how the plot was created.

import plotly.express as px
df_select = df_female[df_female["Race"] == "All doctorate recipients"]
df2 = (df_select.drop(['Gender'], axis=1))
df_long = pd.melt(df2,id_vars=['Race'],var_name='Year', value_name='phds')
df_long['Gender'] = ['Female']*10

df_select_male = df_male[df_male["Race"] == "All doctorate recipients"]
df3 = (df_select_male.drop(['Gender'], axis=1))
df_long2 = pd.melt(df3,id_vars=['Race'],var_name='Year', value_name='phds')
df_long2['Gender'] = ['Male']*10

df_combine_plot = pd.concat([df_long, df_long2], ignore_index=True)

fig = px.line(df_combine_plot, 
                x='Year', y='phds', color='Gender', 
                labels = {
                    "phds": "Number of PhDs"
                })
fig.update_traces(mode='markers+lines')

Summary of doctorate recipients across years by gender

This static data table displays summary statistics of aggregated data across time of doctorate recipients by gender. The table follows from the plot to understand, overall, the trends in receiving a PhD by gender.

summary = pd.DataFrame({'Gender': ['Female','Male'],
            'Min': [df_long['phds'].min(), df_long2['phds'].min()],
            'Mean' : [df_long['phds'].mean(), df_long2['phds'].mean()], 
            'Median': [df_long['phds'].median(), df_long2['phds'].median()],
            'Max': [df_long['phds'].max(), df_long2['phds'].max()]})

Visualizing all doctorate recipients by field of study in 2017

Now, turning to another aspect of the data, we will look closer at doctorates by field of study. This requires a new dataset. The data was transposed and all doctorate recipient information was extracted. In order to visualize number of doctorates by field of study, I created a simple bar plot. The user can hover over bars to display information specific to that field of study. This plot illustrates which fields awarded more PhDs. The code is shown below and the plot is displayed in the dashboard.

dat = pd.read_excel("https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab054.xlsx", header=3)
dat_all = dat.iloc[[0]]

dat_T = dat_all.T
dat_T = dat_T.rename(columns=dat_T.iloc[0]).reset_index()
dat_T = dat_T.iloc[1:]
dat_T

fig = px.bar(dat_T, x='index', y='All doctorate recipients (number)c',  
             labels={'All doctorate recipients (number)c':'All Doctorates', 'index':'Field of Study'})

Visualizing field of study by gender in 2017

In the visualizations above, we saw the breakdown of number of doctorate recipients by gender. Here, we will take a look at doctorate recipients by gender and field of study. This stacked bar plot was created by transposing the data into a wide format and separate bars were created for males and females. The user can hover over the bars to view data of a specific field by gender. This plot illustrates which fields had a smaller female to male ratio or vice versa. The code is shown below and the figure is displayed in the dashboard linked above.

from plotly import graph_objects as go
to_plot = dat[dat['Characteristic'].str.contains('ale')]

plot = to_plot.T
plot2 = (
            plot.
            rename(columns=plot.iloc[0]).
            drop(plot.index[0]).
            reset_index().
            drop(columns=['Female doctorate recipients (number)', 'Male doctorate recipients (number)'])
)

fig2 = go.Figure(
    data=[
        go.Bar(
            name="Male",
            x=plot2["index"],
            y=plot2["Male"],
            offsetgroup=1,
        ),
        go.Bar(
            name="Female",
            x=plot2["index"],
            y=plot2["Female"],
            offsetgroup=1,
            base=plot2["Male"],
            hovertext= [f'Count: {val}' for val in plot2["Female"]]
        )
    ],
    layout=go.Layout(
        title="Percent of Doctorate Recipients by Broad Field of Study and Gender",
        yaxis_title="Percent"
    )
)

	Race	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	Gender
0	All doctorate recipients	22494	23187	22488.0	22700	23527	24366	24816	25354	25256	25495	Female
1	American Indian or Alaska Native	70	78	65.0	79	65	64	55	77	73	62	Female
2	Asian	4795	4741	4662.0	4904	5117	5392	5465	5598	5521	5605	Female
3	Black or African American	1396	1552	1403.0	1396	1481	1554	1503	1660	1662	1756	Female
4	White	12704	13058	12525.0	12572	12941	13264	13324	13523	13623	13413	Female
5	More than one race	298	370	418.0	399	484	497	488	535	605	594	Female
6	Other race or race not reported	222	238	205.0	191	202	204	162	167	193	321	Female