Creating Effective Visualizations about Malaria
The goal of this blog post is to create and explain three effective visualizations about Malaria incidence and deaths using the datasets fron the github repository linked above. Seaborn and plotly libraries will be used to create both static and interactive plots.
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import plotly.offline as offline
import seaborn as sns
The first step is to read in the data and explore the observations and variables in the datasets. The first thing I noticed was that the rows represented countries, but there were several rows that were not countries (did not have a country code) in all the datasets. I thought these observations would be interesting to explore because they aggregated the data in a way that would be suitable to visualize. I also renamed some columns for ease of use later on.
url_inc = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_inc.csv'
url_deaths = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths.csv"
url_deaths_age = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths_age.csv"
df_inc = pd.read_csv(url_inc)
df_deaths = pd.read_csv(url_deaths)
df_deaths_age = pd.read_csv(url_deaths_age)
df_deaths = df_deaths.rename(columns=
{'Deaths - Malaria - Sex: Both - Age: Age-standardized (Rate) (per 100,000 people)': 'Deaths'})
df_inc = df_inc.rename(columns=
{"Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)": "Incidence"})
df1 = (df_inc[df_inc.Code.isnull()].
groupby("Entity")
)
df1.head()
The first plot will explore malaria incidence by region and by income level. I chose to use seaborn for this task because I wanted to create two customizable side-by-side static plots and seaborn works well for this task. Seaborn also works well with pandas dataframes.
This plot illustrates that Sub-Saharn Africa consistently has the highest malaria incidence , but is decreasing over time at a much faster rate than the other world regions. Comparing the region plot to the income plot, we see that lower income levels have higher incidence rates. Also, East Asia and Pacific and Latin America and the Caribbean have similar incidence rates as countries with a high-income level.
region = ["East Asia & Pacific", "South Asia", "Sub-Saharan Africa", "Latin America & Caribbean"]
income = ["Low income", "Middle income", "Lower middle income", "Upper middle income"]
df_region = df_inc[df_inc.Entity.isin(region)]
df_income = df_inc[df_inc.Entity.isin(income)]
fig, ax = plt.subplots(1, 2, figsize=(17,5), sharey=True)
g = sns.lineplot(
data = df_region,
x = 'Year',
y = 'Incidence',
hue = 'Entity',
ax = ax[0]
)
g.set(xlabel ="Year", ylabel = "Incidence of Malaria (per 1,000 population at risk)", title ='Malaria Incidence by Region')
g2 = sns.lineplot(
data = df_income,
x = 'Year',
y = 'Incidence',
hue = 'Entity',
ax = ax[1],
palette="Blues_r"
)
g2.set(xlabel ="Year", ylabel = "", title ='Malaria Incidence by Income Level of Countries')
plt.show()
I wanted to further explore incidence of malaria in Africa after identifying from the plot above that Sub-Saharan Africa has the highest incidence of malaria compared to other regions across the world.
I thought a map would be an effective way to visualize incidence rates in Africa. The first step, then, was to aggregate the data by continents. Since continent data was not available in the dataset, I used a crosstable with country codes and continent values. I merged the incidence dataset with the crosstable using the country codes. Then, I subsetted the data to Africa to view change in incidence over time among African countries.
I chose to use the plotly library, because I wanted to make an interactive map to visualize the data and plotly works well for creating both maps and interactive visualizations.
Plotly has a choropleth function that will be used to create the plot. First, I created a dataset for each year in the dataset and specified a choropleth mapping and added each year's data to a list that will be used to create the slider. The next step was to create the slider. I referenced this blog post to create the slider object https://medium.com/@anikanacey/adventures-in-plotly-an-interactive-choropleth-map-646f6a2f4e3a. Lastly, I created the image (the continent of Africa with countries) for the data to be plotted on and used plotly's offline module to call the plot. This plot is interactive and shows the data for each country when hovered over.
This map illustrates the change in malaria incidence over time in Africa with the dark organges showing a shift in malaria cases across the continent from east Africa to west Africa. Note, if the map does not appear, refresh the webpage.
url2 = "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
df_cont = pd.read_csv(url2)
df_cont = df_cont[["name", "alpha-3", "region"]]
# rename column for merge
df_cont = df_cont.rename(columns = {'alpha-3': 'Code'})
df_map = (df_inc[df_inc.Code.notnull()])
# merge data sets and subset to Africa
df_africa = pd.merge(df_map, df_cont, how="left", on="Code")
df_africa = df_africa.loc[df_africa['region']== "Africa"]
slider_data = []
year_steps = []
for year in df_africa['Year'].unique():
# create dataset for each year
df_year = df_africa.loc[df_africa['Year'] == year]
df_year = df_year.astype(str)
one_year = dict(
type='choropleth',
locations = df_year['name'],
z=df_year['Incidence'].astype(float),
locationmode='country names',
colorscale = "oranges",
colorbar= {'title':'Malaria Incidence (per 1,000 population at risk)'})
slider_data.append(one_year)
for i in range(len(slider_data)):
# create slider object
step = dict(method='restyle',
args=['visible', [False] * len(slider_data)],
label='{}'.format(5*i + 2000))
step['args'][1][i] = True
year_steps.append(step)
sliders = [dict(active=0, pad={"t": 1}, steps=year_steps)]
layout = dict(title ='Incidence of Malaria in Africa from 2000 - 2015',
geo=dict(scope='africa', showcountries = True, projection={'type': 'equirectangular'}),
sliders=sliders)
fig_africa = dict(data=slider_data, layout=layout)
offline.iplot(fig_africa)
The third and final plot illustrates deaths from Malaria by continent over time. After looking at incidence, I wanted to visualize deaths from malaria and visualize across the world to compare the data to the previous two plots.
I once again used the crosstable with country codes and continents to aggregate the deaths dataset by continent. I first created a scatter plot, but it was difficult to see trends over time, so I opted for a line plot. I utilized a filled area line plot in plotly to distinguish between contients that have a lot of deaths (large area) from malaria and continents that do not. I also added labels to each lines, so when you hover over the lines, you can see the data for that specific country.
This plot illustrates the change in deaths from malaria over time. The exact numbers from the y-axis of deaths are not particularly relevant, but the area of each continent on the plot is informative. Africa covers the largest area showing that it has the highest number of deaths, which aligns with the incidence rates we saw in Africa earlier. The data by each country is highlighted by the lines in the plot. Asia and the Americas cover very little area and Oceania begins with a large area but decreases to very little over time, which shows that deaths in these continents are little in comparison to Africa.
df2_deaths = df_deaths[df_deaths.Code.notnull()]
# merging continents and deaths dataset
df_combine = pd.merge(df2_deaths, df_cont, how="left", on="Code")
df_combine = df_combine.dropna()
fig_cont = px.area(df_combine,
x="Year", y="Deaths",
color="region", line_group="name",
labels={
"region": "Continent",
"name": "Country",
"Deaths": "Malaria Deaths (per 100,000 people)"
},
title="Deaths from Malaria by Continent over Time (as measured by area on plot)")
# adjusting figure size
fig_cont.update_layout(
autosize=False,
width=900,
height=700)
offline.iplot(fig_cont)