Analysis of Income's Impact on Health

Nicholas Breden

Introduction

Staying healthy is an important aspect of peoples lives however, having the ability to do so is not something everyone can afford. Whether it be the price of something like a gym membership or the higher costs of healthier foods (source), being healthy is not available for everyone. But how much does income impact someone's health? Looking deeper into that topic, as well as walking through the data science pipeline; data collection, data processing, exploratory data analysis, machine learning, and conclusions is the purpose of this project.

Data Collection

The first step in the data science pipeline is data collection, so we need to find a dataset relevant to the topic.

The data used in this project is from the Behavioral Risk Factor Surveillance (BRFSS) survey conducted by the CDC every year. This is a telephone survey meant to collect information on health-related behaviors, "BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world." (source). Validity of data is important to think about at the start of any project, and since this data is from the CDC, it's safe to assume it's good data.

Initially 2020's BRFSS data was used, however when analyzing that data, it left a little bit to be desired. The biggest problem with it was that 2020's data does not have any information on diet. Going back to 2019's data does have questions regarding diet, so 2019's data will be used instead. While it would be nice to use 2020's more recent data, 2019 is still recent enough and has more information in it. 2019's data can be retrieved here.

That link will bring you to a page containing many downloads, the one used in this project is the "2019 BRFSS Data (SAS Transport Format)". Clicking that will download a .zip'd version of the data in SAS's format, which upon being unzipped can be directly imported into Pandas using the pd.read_sas() function. Also on that same page is a link to the codebook for the data (here) which was used to figure out the meaning of each variable in the data.

Data Cleaning

Data cleaning is the process of transforming the raw data into a more usable format. Here, the raw data from the BRFSS survey has a lot of information that is not relevant to the topic at hand. Specifically, it has 342 variables in the SAS output, those have to be sifted through to only keep what is useful.

To begin this process we need to import the python modules that are going to be used throughout this project.

In [ ]:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandasql as ps

from scipy import stats

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

With those imports out of the way, the next step is to import the data.

In [ ]:
# Read in data from XPT
#   iterator=False and chunkSize=None results in the output being a dataFrame
#   encoding="utf-8" makes the data not type bytes in dataFrame
#   feed it to pd.dataFrame for proper linting
original_data_fragmented =  pd.DataFrame(pd.read_sas("LLCP2019.XPT_", format="xport", iterator=False, chunksize=None, encoding="utf-8"))

# Original data is fragmented, copy it here to get a de-fragmented frame
original_data = original_data_fragmented.copy()
original_data.head()
/home/nbreden/.local/lib/python3.8/site-packages/pandas/io/sas/sas_xport.py:475: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  df[x] = v
Out[ ]:
_STATE FMONTH IDATE IMONTH IDAY IYEAR DISPCODE SEQNO _PSU CTELENM1 ... _VEGESU1 _FRTLT1A _VEGLT1A _FRT16A _VEG23A _FRUITE1 _VEGETE1 _FLSHOT7 _PNEUMO3 _AIDTST4
0 1.0 1.0 01182019 01 18 2019 1100.0 2019000001 2.019000e+09 1.0 ... 114.0 1.0 1.0 1.0 1.0 5.397605e-79 5.397605e-79 2.0 1.0 2.0
1 1.0 1.0 01132019 01 13 2019 1100.0 2019000002 2.019000e+09 1.0 ... 121.0 1.0 1.0 1.0 1.0 5.397605e-79 5.397605e-79 1.0 1.0 2.0
2 1.0 1.0 01182019 01 18 2019 1100.0 2019000003 2.019000e+09 1.0 ... 164.0 1.0 1.0 1.0 1.0 5.397605e-79 5.397605e-79 1.0 2.0 2.0
3 1.0 1.0 01182019 01 18 2019 1200.0 2019000004 2.019000e+09 1.0 ... NaN 9.0 9.0 1.0 1.0 1.000000e+00 1.000000e+00 9.0 9.0 NaN
4 1.0 1.0 01042019 01 04 2019 1100.0 2019000005 2.019000e+09 1.0 ... 178.0 1.0 1.0 1.0 1.0 5.397605e-79 5.397605e-79 2.0 1.0 2.0

5 rows × 342 columns

As previously mentioned, there's 342 columns of data here, most of which are not relevant. The description of each column is provided in the 2019 BRFSS codebook (link). From here the columns that are of interest for this project can be found, and the DataFrame can be cut down to only those.

The columns of interest are:

Column Description
_STATE State of resident
INCOME2 Annual household income from all sources (stratified into different categories)
GENHLTH "Would you say that in general your health is"
_BMI5 Body Mass Index (Calculated)
EXERANY2 Whether they have exercises in the past week or not
_SEX Calculated sex variable (accounts for gender identity)
EDUCA "What is the highest grade or year of school you completed?"
PA2MIN_ Calculated amount physical activity per week (minutes)
_FRUTSU1 Calculated number of fruits consumed per day
_VEGESU1 Calculated number of vegetables consumed per day
In [ ]:
# Columns to keep are keys, renamed columns are the values
columns_to_keep = {
    "_STATE": "state",
    "INCOME2": "income",
    "GENHLTH": "general_health",
    "_BMI5": "bmi",
    "EXERANY2": "any_exercise",
    "_SEX": "sex",
    "EDUCA": "education",
    "PA2MIN_": "physical_activity_per_week",
    "_FRUTSU1": "fruits_per_day",
    "_VEGESU1": "vegetables_per_day"
}

# Get columns from keys of columns_to_keep
data_raw = original_data[list(columns_to_keep.keys())].copy()

# Rename columns from values of columns_to_keep
data_raw.rename(columns=columns_to_keep, inplace=True)

data_raw.head()
Out[ ]:
state income general_health bmi any_exercise sex education physical_activity_per_week fruits_per_day vegetables_per_day
0 1.0 3.0 3.0 2817.0 2.0 2.0 3.0 NaN 200.0 114.0
1 1.0 5.0 4.0 1854.0 1.0 2.0 5.0 778.0 100.0 121.0
2 1.0 7.0 3.0 3162.0 1.0 2.0 6.0 190.0 114.0 164.0
3 1.0 6.0 4.0 2030.0 NaN 2.0 5.0 NaN NaN NaN
4 1.0 99.0 2.0 2148.0 2.0 2.0 5.0 NaN 143.0 178.0

An important part of data cleaning is handling NaN (Not a Number) values. These could have different meaning in different data contexts, in this case it could be because the respondent was not asked a question or did not answer the question. Either way for us, the NaN values should be dropped from the data. While doing this we ran into a problem though. In the column "physical_activity_per_week" values of 0 were not present, and instead were NaN values. This is not good since when calling .dropna() on the pandas DataFrame, it would drop all values of NaN in physical_activity_per_week heavily skewing the data since people who have 0 minutes of physical activity would not be present. To solve this before dropping the NaN values, we iterate through "any_exercise" and if the response there is no, we set "physical_activity_per_week" to 0 minutes. After that, we can safely drop NaN values since "physical_activity_per_week" was the only column with this issue.

Dropping NaN is not the only cleaning that has to be done to this data. For the questions asked valid responses are refusing to respond or not knowing the answer to the question. In these cases the value in the data is not NaN but a special reserved value for each question (the value of which can be retrieved from the codebook). These values need to be dropped as well since they do not have any information that is interesting for this project.

In [ ]:
# Copy data, drop any rows with NaN values
data_dropped = data_raw.copy()

# Before dropping NaN, any physical_activity_per_week value that is NaN could be 0
# so if any_exercise says NO, set physical activity per week to 0 instead of NaN
data_dropped["physical_activity_per_week"] = data_dropped.apply(lambda row: row["physical_activity_per_week"] if row["any_exercise"] != 2 else 0, axis=1)

# Now we can drop NaN
data_dropped = data_dropped.dropna()

def drop_row_val(df:pd.DataFrame, column: str, value):
    return df.drop(df[df[column] == value].index)

# Dict containing values to drop when rows have that value in that column
# Keys => column
# Value => Value in column to drop
values_to_drop = {
    "income": [77, 99],
    "general_health": [7, 9],
    "education": [9]
}

# Go through values_to_drop, dropping the specified values from the columns
for column, values in values_to_drop.items():
    for val in values:
        data_dropped = drop_row_val(data_dropped, column, val)

data_dropped.head()
Out[ ]:
state income general_health bmi any_exercise sex education physical_activity_per_week fruits_per_day vegetables_per_day
0 1.0 3.0 3.0 2817.0 2.0 2.0 3.0 0.0 200.0 114.0
1 1.0 5.0 4.0 1854.0 1.0 2.0 5.0 778.0 100.0 121.0
2 1.0 7.0 3.0 3162.0 1.0 2.0 6.0 190.0 114.0 164.0
6 1.0 7.0 2.0 3298.0 1.0 1.0 6.0 270.0 120.0 134.0
8 1.0 5.0 3.0 2399.0 1.0 1.0 4.0 120.0 200.0 192.0

The data is looking much better now, no NaN values and as values of 0 in the "physical_activity_per_week" column. The next step in data cleaning here is to make the data more readable / usable. For example, a BMI of 2817 is not a reasonable BMI and the reason that value is present is that in the raw data there is two implied decimal places that are not present when just reading in the data. "fruits_per_day" and "vegetables_per_day" also have two implied decimal places that should be added. Some columns have integers the represent categories or string values, such as "state" = 1 representing Alabama, or "education" = 3 representing "Some highschool" (these values once again retrieved from the codebook). These values should be transformed from the integer representation to the actual value.

In [ ]:
# Dictionaries containing conversion from values in CDC data to more 
# readable values.

int_to_state = {
    1: "Alabama",
    2: "Alaska",
    4: "Arizona",
    5: "Arkansas",
    6: "California",
    8: "Colorado",
    9: "Connecticut",
    10: "Delaware",
    11: "District of Columbia",
    12: "Florida",
    13: "Georgia",
    15: "Hawaii",
    16: "Idaho",
    17: "Illinois",
    18: "Indiana",
    19: "Iowa",
    20: "Kansas",
    21: "Kentucky",
    22: "Louisiana",
    23: "Maine",
    24: "Maryland",
    25: "Massachusetts",
    26: "Michigan",
    27: "Minnesota",
    28: "Mississippi",
    29: "Missouri",
    30: "Montana",
    31: "Nebraska",
    32: "Nevada",
    33: "New Hampshire",
    34: "New Jersey",
    35: "New Mexico",
    36: "New York",
    37: "North Carolina",
    38: "North Dakota",
    39: "Ohio",
    40: "Oklahoma",
    41: "Oregon",
    42: "Pennsylvania",
    44: "Rhode Island",
    45: "South Carolina",
    46: "South Dakota",
    47: "Tennessee",
    48: "Texas",
    49: "Utah",
    50: "Vermont",
    51: "Virginia",
    53: "Washington",
    54: "West Virginia",
    55: "Wisconsin",
    56: "Wyoming",
    66: "Guam",
    72: "Puerto Rico"
}

int_to_sex = {
    1: "M",
    2: "F"
}

int_to_education = {
    1: "Kindergarten",
    2: "Elementary",
    3: "Some highschool",
    4: "Highschool",
    5: "Some college / technical school",
    6: "College",
    9: "Refused"
}

int_to_exercise = {
    1: 1,
    2: 0,
}

int_to_income = {
    1: "<= $9,999",
    2: "$10,000 - $14,999",
    3: "$15,000 - $19,999",
    4: "$20,000 - $24,999",
    5: "$25,000 - $34,999",
    6: "$35,000 - $49,999",
    7: "$50,000 - $74,999",
    8: ">= $75,000",
}

int_to_general_health = {
    1: "Excellent",
    2: "Very good",
    3: "Good",
    4: "Fair",
    5: "Poor",
    7: "Don't know / Not Sure",
    9: "Refused"
}

# Copy data so it can be gone back to if needed
data = data_dropped.copy()

# Add in the implied two decimal places for columns that need it
for column in ["bmi", "fruits_per_day", "vegetables_per_day"]:
    data[column] = data[column].apply(lambda x: x/100)

# Replace values in columns from the dictionaries created above
data = data.replace({
    "state": int_to_state, 
    "sex": int_to_sex,
    "education": int_to_education,
    "any_exercise": int_to_exercise,
    "income": int_to_income,
    "general_health": int_to_general_health
    }
)

data.head()
Out[ ]:
state income general_health bmi any_exercise sex education physical_activity_per_week fruits_per_day vegetables_per_day
0 Alabama $15,000 - $19,999 Good 28.17 0.0 F Some highschool 0.0 2.00 1.14
1 Alabama $25,000 - $34,999 Fair 18.54 1.0 F Some college / technical school 778.0 1.00 1.21
2 Alabama $50,000 - $74,999 Good 31.62 1.0 F College 190.0 1.14 1.64
6 Alabama $50,000 - $74,999 Very good 32.98 1.0 M College 270.0 1.20 1.34
8 Alabama $25,000 - $34,999 Good 23.99 1.0 M Highschool 120.0 2.00 1.92

Sometimes in data cleaning it can be useful to add a column based on another column. Here we are going to add a column that contains the weight classification based on bmi. This will be used later in the machine learning section. The different BMI to weight classifications is defined by the CDC here, and is as follows:

BMI Weight Classification
bmi < 18.5 Underweight
18.5 ≤ bmi < 25 Healthy weight
25 ≤ bmi < 30 Overweight
30 ≤ bmi Obese
In [ ]:
# Define function to convert float bmi to string weight classification
def convert_bmi_to_range(bmi_):
    if bmi_ < 18.5:
        return "underweight"
    elif bmi_ < 25:
        return "healthy"
    elif bmi_ < 30:
        return "overweight"
    else:
        return "obese"

# Create new column of bmi class
data["bmi_class"] = data.apply(lambda row: convert_bmi_to_range(row["bmi"]), axis=1)
data.head()
Out[ ]:
state income general_health bmi any_exercise sex education physical_activity_per_week fruits_per_day vegetables_per_day bmi_class
0 Alabama $15,000 - $19,999 Good 28.17 0.0 F Some highschool 0.0 2.00 1.14 overweight
1 Alabama $25,000 - $34,999 Fair 18.54 1.0 F Some college / technical school 778.0 1.00 1.21 healthy
2 Alabama $50,000 - $74,999 Good 31.62 1.0 F College 190.0 1.14 1.64 obese
6 Alabama $50,000 - $74,999 Very good 32.98 1.0 M College 270.0 1.20 1.34 obese
8 Alabama $25,000 - $34,999 Good 23.99 1.0 M Highschool 120.0 2.00 1.92 healthy

Now that data collection and data cleaning is done, the next step in the data science pipeline is exploratory data analysis.

Exploratory Data Analysis

The purpose of exploratory data analysis is to get an idea of what trends are present in the data, and what should be looked into more. A good way to do this is to plot various variables against one another and see if it looks like there could be a relationship there.

Since this project is interested in the relationship between income and health, first we'll make a function that given a SQL query for the y value will plot the value against the categories of income. This function will be called and calculate the average of the given statistic inside of each income bracket in the data.

In [ ]:
def plot_by_income(select_query_part: str, title: str, ylabel: str, scale=None):
    # Create ordering for income levels used in graphs
    income_mapping = {day: i for i, day in enumerate(int_to_income.values())}
    curr_data = ps.sqldf(f"""
    SELECT {select_query_part} as "value", income
    FROM data 
    GROUP BY income
    """)
    # Create ordering that will sort rows in ascending order of income
    key = curr_data["income"].map(income_mapping)

    # Sort rows in curr_data using key generated above
    curr_data = curr_data.iloc[key.argsort()]

    # Plot the data
    plt.figure(figsize=(10, 3))
    bars = plt.bar(curr_data["income"], curr_data["value"], width=0.5)
    plt.title(title)
    plt.xlabel("Household Income Levels (USD)")
    plt.ylabel(ylabel)
    
    plt.xticks(rotation=45)
    if scale is not None:
        plt.ylim(scale)
    plt.bar_label(bars, [round(i, 2) for i in curr_data["value"]])
    return bars
In [ ]:
plot_by_income("AVG(bmi)", 
               "Average BMI by Income",
               "BMI",
               scale=[27.5, 30])
Out[ ]:
<BarContainer object of 8 artists>

BMI vs Income does have the trend present that going into this project we thought would be present, that being as income increases bmi decreases. It isn't perfect, with it increasing from <$9,999 to $10,000 - $14,999, but the general trend is definitely there and should be investigated further.

In [ ]:
plot_by_income("AVG(physical_activity_per_week)", 
               "Average Minutes of Physical Activity in Past Week by Income",
               "Physical Activity in Past Week (minutes)",
               scale=[200,500])
Out[ ]:
<BarContainer object of 8 artists>

Average minutes of physical activity vs Income has a much stronger trend than BMI did. The difference between lowest and highest levels of income being 110.88 minutes per week of extra physical activity, which is a pretty significant amount.

In [ ]:
plot_by_income("AVG(fruits_per_day)", 
               "Fruits Consumed Per Day by Income",
               "Fruits Consumed Per Day",
               scale=[1.2, 1.55])
Out[ ]:
<BarContainer object of 8 artists>

Fruits consumed per day vs Income, while present, is not that strong of a trend. The increase from lowest income to highest income is 0.1 fruits per day. So, while there may be a relationship here this might not be something that has interesting results for this project.

In [ ]:
plot_by_income("AVG(vegetables_per_day)", 
               "Vegetables Consumed Per Day by Income",
               "Vegetables Consumed Per Day",
               scale=[1.5, 2.5])
Out[ ]:
<BarContainer object of 8 artists>

Vegetables consumed per day vs Income has a much larger difference than fruit. The increase from lowest income to highest income is 0.42 vegetables per day. While this is a much larger difference than fruit, ultimately it's not that big of a difference and might not have interesting results either.

In all, the results of the exploratory data analysis is positive, it seems like there is a relationship between income and various health related variables. While the strength of their relationship seems to vary significantly, we can use machine learning to gain more insight into this relationship. That is the next step in the data science pipeline, machine learning to provide analysis.

Machine Learning to Provide Analysis

Machine learning can be used to provide analysis in many ways, in this project we will attempt to predict various factors using income, education, and sex. The inclusion of education and sex is to provide the models with more to go off of.

The first step here is to handle the categorical variables that are going to be used for predictions. This includes all of income, education, and sex. To do this we will be using one hot encoding to encode each possible category from the original column as a new column of 1's or 0's on whether the value is that category or not. By doing this, the one hot encoded columns can be used in the machine learning models used later on.

In [ ]:
# List to hold column names of each new column, used in machine learning section
explanatory_columns = []

# List of columns encode
columns_to_encode = ["income", "education", "sex"]

# List of resulting dataframes from encoding, later combined into one
data_parts = [data]
for col in columns_to_encode:
    # Create one hot encoding of current column
    temp_data = pd.get_dummies(data[col], prefix=col)

    # Store the encoding
    data_parts.append(temp_data)

    # Add column names to column name list
    explanatory_columns += list(temp_data.columns)

# Combine all data back into one data frame
one_hot_data = pd.concat(data_parts, axis=1)

# Drop the columns that were encoded
one_hot_data.drop(["state", "income", "general_health", "education"], axis=1, inplace=True)

one_hot_data.head()
Out[ ]:
bmi any_exercise sex physical_activity_per_week fruits_per_day vegetables_per_day bmi_class income_$10,000 - $14,999 income_$15,000 - $19,999 income_$20,000 - $24,999 ... income_<= $9,999 income_>= $75,000 education_College education_Elementary education_Highschool education_Kindergarten education_Some college / technical school education_Some highschool sex_F sex_M
0 28.17 0.0 F 0.0 2.00 1.14 overweight 0 1 0 ... 0 0 0 0 0 0 0 1 1 0
1 18.54 1.0 F 778.0 1.00 1.21 healthy 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
2 31.62 1.0 F 190.0 1.14 1.64 obese 0 0 0 ... 0 0 1 0 0 0 0 0 1 0
6 32.98 1.0 M 270.0 1.20 1.34 obese 0 0 0 ... 0 0 1 0 0 0 0 0 0 1
8 23.99 1.0 M 120.0 2.00 1.92 healthy 0 0 0 ... 0 0 0 0 1 0 0 0 0 1

5 rows × 23 columns

As you can see, there's a lot of columns now that are just 1's or 0's, which is exactly what is needed to use the machine learning models next. The two types of machine learning models that we'll use are logistic regression and decision tree. These were chosen because they work well in classification problems given the type of data we are working with.

To ease the use of doing this for different response variables, first we will define a function to fit both a logistic regression model and a separate function to fit a decision tree model. These will take in X and Y values, split them into training and test partitions, train the model, and finally report the accuracy and plot the results. Splitting into test and training data is an important part of this because it ensures your model is exposed to different parts of the data and not only one part. This will increase the insight you have into the model.

In [ ]:
def fit_logistic(X, Y, title):

    # Divide dataset into current training / testing partitions
    x_train, x_test, y_train, y_test = train_test_split(X, Y)

    # Create model and fit to the training data
    model = LogisticRegression()
    model.fit(x_train, y_train)

    # Record the score of the model using the test data
    score = model.score(x_test, y_test)

    # Create a plot with space for two subplots in it, the test data and 
    # the model data
    figure, axis = plt.subplots(1, 2, facecolor="white")

    # Add title for whole plot
    figure.suptitle(title, fontsize=24, x=0.75)

    # Get the frequency of each label in test data
    pie_data = pd.Series(y_test).value_counts()

    # Plot pie chart of test data
    axis[0].pie(pie_data, labels=pie_data.index, autopct='%1.1f%%')
    axis[0].set_title("Test Data", fontsize=18)
    
    # Get the frequency of each label in model's prediction
    pie_data = pd.Series(model.predict(x_test)).value_counts()

    # Plot pie chart of model's prediction
    axis[1].pie(pie_data, labels=pie_data.index, autopct='%1.1f%%')
    axis[1].set_title("Model's Prediction", fontsize=18)
    
    # Add spacing between the two pie charts
    plt.subplots_adjust(right=1.5, top=.75)
    
    # Show subplots, print rounded accuracy
    plt.show()

    print("Accuracy: ", round(score, 4))
    return score
In [ ]:
def fit_decision_tree(X, Y, title):
    # Divide dataset into current training / testing partitions
    x_train, x_test, y_train, y_test = train_test_split(X, Y)

    # Create model and fit to the training data
    model = DecisionTreeClassifier()
    model.fit(x_train, y_train)

    # Record the score of the model using the test data
    score = model.score(x_test, y_test)

    # Create a plot with space for two subplots in it, the test data and 
    # the model data
    figure, axis = plt.subplots(1, 2, facecolor="white")

    # Add title for whole plot
    figure.suptitle(title, fontsize=24, x=0.75)

    # Get the frequency of each label in test data
    pie_data = pd.Series(y_test).value_counts()

    # Plot pie chart of test data
    axis[0].pie(pie_data, labels=pie_data.index, autopct='%1.1f%%')
    axis[0].set_title("Test Data", fontsize=18)
    
    # Get the frequency of each label in model's prediction
    pie_data = pd.Series(model.predict(x_test)).value_counts()

    # Plot pie chart of model's prediction
    axis[1].pie(pie_data, labels=pie_data.index, autopct='%1.1f%%')
    axis[1].set_title("Model's Prediction", fontsize=18)
    
    # Add spacing between the two pie charts
    plt.subplots_adjust(right=1.5, top=.75)
    
    # Show subplots, print rounded accuracy
    plt.show()
    print("Accuracy: ", round(score, 4))
    return score
In [ ]:
def fit_both(X, Y, regression_tile, decision_title):
    return fit_logistic(X, Y, regression_tile), fit_decision_tree(X, Y, decision_title)

With those functions defined, next we will create a dictionary that will contain the results of each model to each response variable. This will be used later on to plot the total results. After that, it is time to do the machine learning.

In [ ]:
# Create dictionary containing results for each type of model and response variable
results = {
    "bmi": {},
    "exercise": {},
    "fruit": {},
    "vegetable": {}
}
In [ ]:
# Predicting BMI
# Scale x values
scaler = StandardScaler().fit(one_hot_data[explanatory_columns])
x = pd.DataFrame(scaler.transform(one_hot_data[explanatory_columns]))
y = np.ravel(data["bmi_class"])
results["bmi"]["logistic_regression"], results["bmi"]["decision_tree"] = \
    fit_both(x, 
             y, 
             "Predicting BMI using Income, Education, Sex - Logistic Regression",
             "Predicting BMI using Income, Education, Sex - Decision Tree")
Accuracy:  0.3999
Accuracy:  0.4009

First up is predicting BMI. The accuracy vs the resulting pie graphs seems a little misleading. From the pie charts we would expect accuracy to be much higher than 40%, but that just means that while the overall proportion of weight classification is similar, the actual classification of a person was wrong more often but resulted in a similar percentage of each category. However, a 40% accuracy is not that good, meaning that predicting BMI is not as strong as we thought it would be from the exploratory data analysis. We'll see similar findings in the next models.

In [ ]:
# Any exercise
x = one_hot_data[explanatory_columns] 
y = np.ravel(data["any_exercise"])
results["exercise"]["logistic_regression"], results["exercise"]["decision_tree"] = \
    fit_both(x, 
             y, 
             "Predicting Exercise using Income, Education, Sex - Logistic Regression",
             "Predicting Exercise using Income, Education, Sex - Decision Tree")
Accuracy:  0.7414
Accuracy:  0.7457

Predicting whether someone has exercised in the past week was bad. The accuracy of 74% for both logistic regression and decision tree is just a result of it being a binary classification and the model predicting "yes" for almost 100% of inputs. By the definition of binary classification this would result in being correct the same percentage as there is "yes".

In [ ]:
# More than one fruit per day
x = one_hot_data[explanatory_columns] 
y = np.ravel(np.array([1 if i > 1 else 0 for i in data["fruits_per_day"]]))
results["fruit"]["logistic_regression"], results["fruit"]["decision_tree"] = \
    fit_both(x, 
             y, 
             "Predicting Eating a Fruit a Day using Income, Education, Sex - Logistic Regression", 
             "Predicting Eating a Fruit a Day using Income, Education, Sex - Decision Tree")
Accuracy:  0.5473
Accuracy:  0.5485

Predicting eating a fruit a day seems pretty good from the pie charts but must have a similar problem as weight classification had, resulting in the lower accuracy than we would expect from just the pie charts.

In [ ]:
# More than one vegetable per day
x = one_hot_data[explanatory_columns] 
y = np.ravel(np.array([1 if i > 1 else 0 for i in data["vegetables_per_day"]]))
results["vegetable"]["logistic_regression"], results["vegetable"]["decision_tree"] = \
    fit_both(x, 
             y, 
             "Predicting Eating a Vegetable a Day using Income, Education, Sex - Logistic Regression", 
             "Predicting Eating a Vegetable a Day using Income, Education, Sex - Decision Tree")
Accuracy:  0.7993
Accuracy:  0.7971

Predictably, the model for eating vegetables a day has the same problem as the model for eating fruit per day.

In [ ]:
# Go through each response variable, and each model for that variable and
# plot its accuracy
for response in results.keys():
    for model in results[response]:
        plt.barh([response.capitalize() + " - " + model.replace("_", " ").capitalize()], 1, height=0.5, alpha=0.55, color="red")
        plt.barh([response.capitalize() + " - " + model.replace("_", " ").capitalize()], results[response][model], height=0.5, color="green")
plt.title("Accuracies of: Response Variables - Model")
plt.xlabel("Accuracy (percentage)")
Out[ ]:
Text(0.5, 0, 'Accuracy (percentage)')

Here the accuracies of each model for each response variable is plotted. However, this paints a much brighter picture of the models than reality. As seen from the outputted pie charts above, the models were not that good. This can either mean that the relationship between this variables is not strong enough to create a good model for, or that the models themselves are flawed. If the former is the case then there's nothing to be done except make conclusions, if it's the latter then more investigation into the models should be done to see if they can be improved.

We will do more investigation into if there is a relationship by performing a one-Way ANOVA test on the variables of interest. The null hypothesis of a one-way ANOVA is that the mean of each sample is the same. Meaning, that if the resulting p-value is significantly small enough (< 0.5) we can reject the null hypothesis and say that the mean of each sample is not the same. By doing this we can see if there is in fact a difference, statistically speaking, in each variable by income.

To use a one-way ANOVA test the following assumptions must be met:

  1. Simple random samples, randomized experiments.
  2. Independent samples
  3. Normal populations
  4. Equal standard deviations

All of these hold so we can use the one-way ANOVA, and it is of note that the one-way ANOVA test is robust to moderate violations of assumptions 3 and 4.

To begin this process we will first create a function that will perform the one-way ANOVA test with each sample being the specified column by income level. Then the test will be performed for physical activity per week, bmi, and fruits and vegetables consumed per day.

In [ ]:
def f_oneway_income(col):
    return stats.f_oneway(
                          data[data["income"] == "<= $9,999"][col],
                          data[data["income"] == "$10,000 - $14,999"][col],
                          data[data["income"] == "$15,000 - $19,999"][col],
                          data[data["income"] == "$20,000 - $24,999"][col],
                          data[data["income"] == "$25,000 - $34,999"][col],
                          data[data["income"] == "$35,000 - $49,999"][col],
                          data[data["income"] == "$50,000 - $74,999"][col],
                          data[data["income"] == ">= $75,000"][col]
                )
In [ ]:
print("Physical activity per week: \t", f_oneway_income("physical_activity_per_week"))
print("BMI: \t\t\t\t", f_oneway_income("bmi"))
print("Fruits consumed per day: \t", f_oneway_income("fruits_per_day"))
print("Vegetables consumed per day: \t", f_oneway_income("vegetables_per_day"))
Physical activity per week: 	 F_onewayResult(statistic=67.59798591548409, pvalue=5.624305933296415e-98)
BMI: 				 F_onewayResult(statistic=232.0294496946514, pvalue=0.0)
Fruits consumed per day: 	 F_onewayResult(statistic=20.657227476068872, pvalue=5.611956432631034e-28)
Vegetables consumed per day: 	 F_onewayResult(statistic=121.09195644059491, pvalue=1.7958447724739562e-178)

All of the p-values are < 0.5, so we reject the null hypothesis that the mean is the same in each income category for each variable of interest. This means that there is a difference for each of the categories we have investigated, even though the trained models from above do not necessarily show the same thing. We believe this to not be the fault of the models, but rather the fault of what data is available here. This will be talked about more in the interpretation section.

Interpretation

The final step in the data science pipeline is interpreting the findings and making final conclusions. In the exploratory data analysis step we saw a relationship between income and the various other categories of interest (physical activity, BMI, fruits consumed per day, and vegetables consumed per day). Our next step was to see if we could make predictions of these categories using machine learning models, which overall was not too successfully. We believe this is not fault of the models, but of the data present. The models are setup correctly to predict the relationship, but does not have accurate results. So, while we did find that there is a statistically significant difference in each categories prevalence per income level, we believe that the factors we have in this data are not enough to accurately make predictions for these categories.

If we were to look at this topic again a more comprehensive dataset should be used, in order to improve the likelihood that the machine learning models would be successful.

Hopefully this has been an interesting and informative look into the data science pipeline through the lens of looks at income's impact on health.