Visualizing characteristics of doctorate recipients
The goal of this blog post is to explain the process of creating a dashboard visalization using streamlit. This dashboard visualizes data about doctorate recipients including: demographic information, field of study, and postgraduation plans. Data was collected from the National Center for Science and Engineering Statistics (NCSES) and datasets can be found here: https://ncses.nsf.gov/pubs/nsf19301/data. Analyses were performed using the pandas library and visualizations were created using the matplotlib and plotly libraries along with streamlit's widget features.
View dashboard visualization: https://share.streamlit.io/saahithirao/bios-823-blog/hw4.py
View code: https://github.com/saahithirao/bios-823-blog/blob/master/hw4.py This code can be downloaded to a personal machine and run using >>streamlit run hw4.py
Doctorate recipients by gender & race from 2008-2017
This visualization displays an interactive data table of doctorate recipents by gender and race from 2008 to 2017. The user can click on a year on the sidebar to display data for that year and can select to view data by gender or race. Since there was no dataset that contained all of this information, I opted to combine two different datasets: one that contained data on females and on that contained data on males. I extracted the necessary information to create the visualization, as shown below, and merged the two dataframes. Then, using streamlit's widget features, I created a sidebar that allows the user to select a specific year that they want to see data for and/or filter the data by gender and race to explore the data further and make comparisons. Code for this is shown below and linked above.
import pandas as pd
df = pd.read_excel("https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab021.xlsx", header=3)
df = df.rename(columns={'Ethnicity, race, and citizenship status':'Race'})
df_female = (
df.
drop(df[df['Race'].str.contains('citizen')].index.tolist()).
drop(df[df['Race'].str.contains('visa')].index.tolist()).
drop(df[df['Race'].str.contains('Hispanic')].index.tolist()).
drop(df[df['Race'].str.contains('Ethnicity')].index.tolist()).
reset_index().
drop(columns = ['index'])
)
df_female["Gender"] = "Female"
df_female
st.sidebar.header("User Input")
selected_year = st.sidebar.selectbox('Year', list(reversed(range(2008,2017))))
phds = load_data(selected_year)
unique_gender = phds.Gender.unique()
select_gender = st.sidebar.multiselect('Gender', unique_gender, unique_gender)
df_selected = phds[(phds.Gender.isin(select_gender))]
Number of doctorate recipients by gender over time
This visualization displays an interactive plot illustrating number of doctorate recipients over time and separated by gender. The user can hover over points to view data for that year. This plot shows a similar trend over time for males and females with a slight decrease in number of PhDs in 2010 and a steady increase until 2015. The gap, however, between males and females does not seem to be decreasing over time, which shows that there is still a disparity between receiving a doctorate degree by gender. The code below shows how the plot was created.
import plotly.express as px
df_select = df_female[df_female["Race"] == "All doctorate recipients"]
df2 = (df_select.drop(['Gender'], axis=1))
df_long = pd.melt(df2,id_vars=['Race'],var_name='Year', value_name='phds')
df_long['Gender'] = ['Female']*10
df_select_male = df_male[df_male["Race"] == "All doctorate recipients"]
df3 = (df_select_male.drop(['Gender'], axis=1))
df_long2 = pd.melt(df3,id_vars=['Race'],var_name='Year', value_name='phds')
df_long2['Gender'] = ['Male']*10
df_combine_plot = pd.concat([df_long, df_long2], ignore_index=True)
fig = px.line(df_combine_plot,
x='Year', y='phds', color='Gender',
labels = {
"phds": "Number of PhDs"
})
fig.update_traces(mode='markers+lines')
Summary of doctorate recipients across years by gender
This static data table displays summary statistics of aggregated data across time of doctorate recipients by gender. The table follows from the plot to understand, overall, the trends in receiving a PhD by gender.
summary = pd.DataFrame({'Gender': ['Female','Male'],
'Min': [df_long['phds'].min(), df_long2['phds'].min()],
'Mean' : [df_long['phds'].mean(), df_long2['phds'].mean()],
'Median': [df_long['phds'].median(), df_long2['phds'].median()],
'Max': [df_long['phds'].max(), df_long2['phds'].max()]})
Visualizing all doctorate recipients by field of study in 2017
Now, turning to another aspect of the data, we will look closer at doctorates by field of study. This requires a new dataset. The data was transposed and all doctorate recipient information was extracted. In order to visualize number of doctorates by field of study, I created a simple bar plot. The user can hover over bars to display information specific to that field of study. This plot illustrates which fields awarded more PhDs. The code is shown below and the plot is displayed in the dashboard.
dat = pd.read_excel("https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab054.xlsx", header=3)
dat_all = dat.iloc[[0]]
dat_T = dat_all.T
dat_T = dat_T.rename(columns=dat_T.iloc[0]).reset_index()
dat_T = dat_T.iloc[1:]
dat_T
fig = px.bar(dat_T, x='index', y='All doctorate recipients (number)c',
labels={'All doctorate recipients (number)c':'All Doctorates', 'index':'Field of Study'})
Visualizing field of study by gender in 2017
In the visualizations above, we saw the breakdown of number of doctorate recipients by gender. Here, we will take a look at doctorate recipients by gender and field of study. This stacked bar plot was created by transposing the data into a wide format and separate bars were created for males and females. The user can hover over the bars to view data of a specific field by gender. This plot illustrates which fields had a smaller female to male ratio or vice versa. The code is shown below and the figure is displayed in the dashboard linked above.
from plotly import graph_objects as go
to_plot = dat[dat['Characteristic'].str.contains('ale')]
plot = to_plot.T
plot2 = (
plot.
rename(columns=plot.iloc[0]).
drop(plot.index[0]).
reset_index().
drop(columns=['Female doctorate recipients (number)', 'Male doctorate recipients (number)'])
)
fig2 = go.Figure(
data=[
go.Bar(
name="Male",
x=plot2["index"],
y=plot2["Male"],
offsetgroup=1,
),
go.Bar(
name="Female",
x=plot2["index"],
y=plot2["Female"],
offsetgroup=1,
base=plot2["Male"],
hovertext= [f'Count: {val}' for val in plot2["Female"]]
)
],
layout=go.Layout(
title="Percent of Doctorate Recipients by Broad Field of Study and Gender",
yaxis_title="Percent"
)
)