Regression, Optimization, and Life Expectancy

Regression, Optimization, and Life Expectancy#

In this project we had to find a dataset suitible for linear regression. Through testing, fitting, and optimizing models we needed to Interpret the model and its performance in terms of the application.

# basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# models classes
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn import cluster
from sklearn import svm
# datasets
from sklearn import datasets, linear_model
# model selection tools
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, ShuffleSplit
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score

Description:#

This Dataset contains statistics from 2000 to 2015 across 193 countries, and contains health and economic data from the WHO and UN. There are 2938 rows with 22 columns with features that include the country and year collected along side metrics that show adult mortality rates, infant deaths, education, various reported disease numbers/metrics, population, and more. In this prediction task Ill be attempting is to determine Life expectancy based off of Adult mortality rate and other features. Ill remove some non contiguous features such as country, year, and status since they are not contiguous. Linear regression is a good choice for targeting Life Expectancy since these features, specifically things like Adult Mortality and Schooling, should intuitively increase or decrease alongside it.

lifeExp_df = pd.read_csv('LifeExpectancyData.csv')
lifeExp_df.head()

	Country	Year	Status	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	...	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness 1-19 years	thinness 5-9 years	Income composition of resources	Schooling
0	Afghanistan	2015	Developing	65.0	263.0	62	0.01	71.279624	65.0	1154	...	6.0	8.16	65.0	0.1	584.259210	33736494.0	17.2	17.3	0.479	10.1
1	Afghanistan	2014	Developing	59.9	271.0	64	0.01	73.523582	62.0	492	...	58.0	8.18	62.0	0.1	612.696514	327582.0	17.5	17.5	0.476	10.0
2	Afghanistan	2013	Developing	59.9	268.0	66	0.01	73.219243	64.0	430	...	62.0	8.13	64.0	0.1	631.744976	31731688.0	17.7	17.7	0.470	9.9
3	Afghanistan	2012	Developing	59.5	272.0	69	0.01	78.184215	67.0	2787	...	67.0	8.52	67.0	0.1	669.959000	3696958.0	17.9	18.0	0.463	9.8
4	Afghanistan	2011	Developing	59.2	275.0	71	0.01	7.097109	68.0	3013	...	68.0	7.87	68.0	0.1	63.537231	2978599.0	18.2	18.2	0.454	9.5

5 rows × 22 columns

Cleaning:#

lifeExp_df.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

lifeExp_df = lifeExp_df.rename(columns={'Life expectancy ': 'Life expectancy'})

lifeExp_df.shape

(2938, 22)

lifeExp_df['Life expectancy'].describe()

count    2928.000000
mean       69.224932
std         9.523867
min        36.300000
25%        63.100000
50%        72.100000
75%        75.700000
max        89.000000
Name: Life expectancy, dtype: float64

lifeExp_clean = lifeExp_df.drop(columns=['Year','Country', 'Status'])
lifeExp_clean.head(1)

	Life expectancy	Adult Mortality	infant deaths	Alcohol	percentage expenditure	Hepatitis B	Measles	BMI	under-five deaths	Polio	Total expenditure	Diphtheria	HIV/AIDS	GDP	Population	thinness 1-19 years	thinness 5-9 years	Income composition of resources	Schooling
0	65.0	263.0	62	0.01	71.279624	65.0	1154	19.1	83	6.0	8.16	65.0	0.1	584.25921	33736494.0	17.2	17.3	0.479	10.1

lifeExp_clean.shape

(2938, 19)

lifeExp_clean = lifeExp_clean.dropna()
lifeExp_clean.shape

(1649, 19)

Instantiate/fit:#

lifeExp_allFeat_X = lifeExp_clean.drop(columns=['Life expectancy'])

lifeExp_target_y = lifeExp_clean['Life expectancy'] 

X_train, X_test, y_train, y_test = train_test_split(lifeExp_allFeat_X, lifeExp_target_y, test_size=0.25)

lifeExp_reg = linear_model.LinearRegression()
lifeExp_reg.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_pred = lifeExp_reg.predict(X_test)

plt.scatter(y_test, y_pred, color='blue')  #Pred
plt.scatter(y_test, y_test, color='black')  #Actual 
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")

Text(0, 0.5, 'Predicted Values')

png

r2_score(y_test,y_pred)

0.8373483396212271

mean_abs_error = np.sqrt(mean_squared_error(y_test,y_pred))
mean_abs_error

3.659670213439647

Test on 25% held out test data#

The r2_score shows an 82% accuracy score which is close to 1, showing a very good variance and good fit to the model. The mean_abs_error shows the models predictions tend to be 3.7 years off from the true correct values, this again shows the model is fit well. The line plot above also is a good visual showing the model is finding the general trend of life expectancy correctly.

# draw vertical lines from each data point to its predict value
[plt.plot([yt, yt], [yt, yp], color='red', linewidth=1, markevery=[0], marker ='.')
                 for yp, yt in zip(y_pred,y_test)];

# plot these last so they are visually on top
plt.scatter(y_test, y_pred, color='blue')
plt.scatter(y_test, y_test, color='black')

plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")

Text(0, 0.5, 'Predicted Values')

png

lifeExp_reg.coef_

array([-1.60815216e-02,  1.00404499e-01, -6.42892162e-02,  3.05878087e-04,
       -6.51225345e-03, -4.00758558e-06,  3.01255941e-02, -7.61223293e-02,
        7.78260211e-03,  1.33933106e-01,  1.90298753e-02, -4.57770510e-01,
        2.21243401e-05,  4.21432307e-10, -3.29351695e-02, -4.88844146e-02,
        8.90015376e+00,  8.79985746e-01])

Coefficients and Residuals#

The positive coefficients are showing features that increase the life expectancy while the negatives are doing the opposite. We can also see which features are contributing the most to determining the models predictions by how close they are to zero. I f we match up the index’s to the cols in the clean_df the first number should be Adult Mortality and the second is infant deaths, since infant deaths is a tangible number and not a rate it makes sense it would have a higher effect on predicting Life expectancy since infants would generally be around the age of 1-3.

Repeat the split, train, and test steps 5 times#

r2_cva = cross_val_score(lifeExp_reg, lifeExp_allFeat_X, lifeExp_target_y)
r2_cva

array([0.83532928, 0.77805879, 0.83898485, 0.70770205, 0.78634137])

np.mean(r2_cva)

0.7892832697235778

Using Kfold method we see that there can be a possibility of a lower score, specifiacally we can see a lower score of 70.7%. We have an average around 78%, this is still a high enough score to make me trust the model.

Interpret the model and its performance in terms of the application#

I think in general this model shows there is a trend fairly well, I personally don’t think it would be good enough for real world use due to some of its lower splits being right at 70 and even at its highest being 83%. Sure we can see a trend of life expectancy going up but the information from the graphs and scores don’t really tell us why they go up just only that they can go up. It would be more useful to try linear regression on individual statistics to find specific trends rather than a general one. We can however learn from these statistics that there is a way to increase mortality rate we just now need to dig deeper with a more complex model, as I do believe machine learning can help us with these types of statistics.

Testing on One Feature#

Logically adult mortality seemed linked to life expectancy, where a higher mortality rate would lead to a lower Life expectancy so I wanted to see if the model/data reflected this idea. Looking at lifeExp_reg.coef_ Adult Mortality got a score close to zero around -1 meaning it has lesser effect on the model so I wanted to see why. After Using this one feature It seems like the line like figure shown in the plot below might have to do with its low score prevoiusly. A coefficient of “-0.04908872” shows us as Adult Mortality increases life expectancy decreases. This makes sense, if people are dying sooner than mortality rates should rise. The residuals in the start of the plot show a huge discrepancy, with such a low mortality rate we also have some extremely low life expectancy ages. This could mean some other demographic data was causing a lower age expectancy in specific countries or other historical factors.

lifeExp_X1 = lifeExp_clean['Adult Mortality'].values[:,np.newaxis]

lifeExp_X1.shape

(1649, 1)

lifeExp_y1 = lifeExp_clean['Life expectancy'] 

lifeExp_y1.shape

(1649,)

X1_train, X1_test, y1_train, y1_test = train_test_split(lifeExp_X1, lifeExp_y1, test_size=0.25)

lifeExp_reg1 = linear_model.LinearRegression()
lifeExp_reg1.fit(X1_train, y1_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

lifeExp_reg1.coef_

array([-0.05063989])

y1_pred = lifeExp_reg1.predict(X1_test)

plt.scatter(X1_test, y1_test, color = 'black') #Actual Data
plt.scatter(X1_test, y1_pred, color='blue') #Prediction

plt.xlabel("Adult Mortality")
plt.ylabel("Life expectancy")

Text(0, 0.5, 'Life expectancy')

png

r2_score(y1_test,y1_pred)

0.42099041138443094

mean_abs_error1 = np.sqrt(mean_squared_error(y1_test,y1_pred))
mean_abs_error1

6.983226243958276

Test on 25% held out test data and measure the fit with two metrics and one plot#

The r2 score is low which probably has to do with the large line like figure in the actual data at the start of the plot. The mean_abs_error shows that on average the model’s prediction is about six and a half years off for life expectancy.

# plot line prediction
plt.plot(X1_test, y1_pred, color='blue', linewidth=3);

# draw vertical lines from each data point to its predict value
[plt.plot([x,x],[yp,yt], color='red', linewidth=3, markevery=[0], marker ='^')
                 for x, yp, yt in zip(X1_test, y1_pred,y1_test)];

# plot these last so they are visually on top
plt.scatter(X1_test,y1_test,  color='black');

plt.xlabel("Adult Mortality")
plt.ylabel("Life expectancy")

Text(0, 0.5, 'Life expectancy')

png

Optimizing a more complex regression model#

Instantiate, Fit, and Score#

dt = tree.DecisionTreeRegressor()
dt.fit(X_train, y_train)

DecisionTreeRegressor()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

dt.score(X_train, y_train)

1.0

params_dt = {'criterion' : ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],'max_depth': [2,5,10,20],'min_samples_split': [2,5,10,20], 'min_samples_leaf': [2,5,10,20]}

dt_opt = GridSearchCV(dt,params_dt)

dt_opt.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeRegressor(max_depth=5, min_samples_leaf=5,
                                         min_samples_split=10),
         param_grid={&#x27;criterion&#x27;: [&#x27;squared_error&#x27;, &#x27;friedman_mse&#x27;,
                                   &#x27;absolute_error&#x27;, &#x27;poisson&#x27;],
                     &#x27;max_depth&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_leaf&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_split&#x27;: [2, 5, 10, 20]})</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-26" type="checkbox" ><label for="sk-estimator-id-26" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">&nbsp;&nbsp;GridSearchCV<a class="sk-estimator-doc-link fitted" rel="noreferrer" target="_blank" href="https://scikit-learn.org/1.4/modules/generated/sklearn.model_selection.GridSearchCV.html">?<span>Documentation for GridSearchCV</span></a><span class="sk-estimator-doc-link fitted">i<span>Fitted</span></span></label><div class="sk-toggleable__content fitted"><pre>GridSearchCV(estimator=DecisionTreeRegressor(max_depth=5, min_samples_leaf=5,
                                         min_samples_split=10),
         param_grid={&#x27;criterion&#x27;: [&#x27;squared_error&#x27;, &#x27;friedman_mse&#x27;,
                                   &#x27;absolute_error&#x27;, &#x27;poisson&#x27;],
                     &#x27;max_depth&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_leaf&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_split&#x27;: [2, 5, 10, 20]})</pre></div> </div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-27" type="checkbox" ><label for="sk-estimator-id-27" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">estimator: DecisionTreeRegressor</label><div class="sk-toggleable__content fitted"><pre>DecisionTreeRegressor(max_depth=5, min_samples_leaf=5, min_samples_split=10)</pre></div> </div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-28" type="checkbox" ><label for="sk-estimator-id-28" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">&nbsp;DecisionTreeRegressor<a class="sk-estimator-doc-link fitted" rel="noreferrer" target="_blank" href="https://scikit-learn.org/1.4/modules/generated/sklearn.tree.DecisionTreeRegressor.html">?<span>Documentation for DecisionTreeRegressor</span></a></label><div class="sk-toggleable__content fitted"><pre>DecisionTreeRegressor(max_depth=5, min_samples_leaf=5, min_samples_split=10)</pre></div> </div></div></div></div></div></div></div></div></div>


dt_opt.score(X_train, y_train)


0.9772450825467713


dt_opt.best_params_


{'criterion': 'poisson',
 'max_depth': 20,
 'min_samples_leaf': 5,
 'min_samples_split': 10}

Examining the best fit model vs the Origninal#

The grid search determined poisson max_depth=20, min_samples_leaf=5, min_samples_split=10 while the default would have been squared_error, max_depth=None, min_samples_leaf=1, min_samples_split=2. Doing this has given a proper score where the default was giving a perfect 1 which points to over fitting. Ive changed max_depth to try higher numbers and seems to always pick the larger number. After 70 the it takes to long to compute and times out, the more depth available the more accurate the score will be. It also picks Poisson over squared_error, looking at the time it takes to fit/score poisson takes much less time. When I did a 10 fold on the data it showed almost the same exact score which means its a stable and not over fitting data.

dt_5cv_df = pd.DataFrame(dt_opt.cv_results_)
dt_5cv_df

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_criterion	param_max_depth	param_min_samples_leaf	param_min_samples_split	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.005177	0.004539	0.002995	0.004118	squared_error	2	2	2	{'criterion': 'squared_error', 'max_depth': 2,...	0.753203	0.722127	0.759607	0.718842	0.688256	0.728407	0.025812	196
1	0.003815	0.004433	0.003431	0.003555	squared_error	2	2	5	{'criterion': 'squared_error', 'max_depth': 2,...	0.753203	0.722127	0.759607	0.718842	0.688256	0.728407	0.025812	196
2	0.005172	0.003385	0.000000	0.000000	squared_error	2	2	10	{'criterion': 'squared_error', 'max_depth': 2,...	0.753203	0.722127	0.759607	0.718842	0.688256	0.728407	0.025812	223
3	0.004086	0.005005	0.002006	0.004013	squared_error	2	2	20	{'criterion': 'squared_error', 'max_depth': 2,...	0.753203	0.722127	0.759607	0.718842	0.688256	0.728407	0.025812	196
4	0.004460	0.003937	0.001391	0.000695	squared_error	2	5	2	{'criterion': 'squared_error', 'max_depth': 2,...	0.753203	0.722127	0.759607	0.718842	0.688256	0.728407	0.025812	196
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
251	0.009104	0.001611	0.000820	0.001641	poisson	20	10	20	{'criterion': 'poisson', 'max_depth': 20, 'min...	0.919820	0.893255	0.898688	0.912637	0.905190	0.905918	0.009503	39
252	0.008193	0.000048	0.000000	0.000000	poisson	20	20	2	{'criterion': 'poisson', 'max_depth': 20, 'min...	0.913402	0.882487	0.894023	0.912446	0.874443	0.895360	0.015636	92
253	0.007534	0.001742	0.001531	0.001881	poisson	20	20	5	{'criterion': 'poisson', 'max_depth': 20, 'min...	0.913402	0.882487	0.894023	0.912446	0.874443	0.895360	0.015636	94
254	0.008284	0.000119	0.000814	0.001628	poisson	20	20	10	{'criterion': 'poisson', 'max_depth': 20, 'min...	0.913402	0.882487	0.894023	0.912446	0.874443	0.895360	0.015636	92
255	0.007509	0.001728	0.000880	0.001760	poisson	20	20	20	{'criterion': 'poisson', 'max_depth': 20, 'min...	0.913402	0.882487	0.894514	0.912446	0.874443	0.895458	0.015629	90

256 rows × 17 columns

More Folds#

dt_opt_10F = GridSearchCV(dt,params_dt, cv=10)
dt_opt_10F.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),
         param_grid={&#x27;criterion&#x27;: [&#x27;squared_error&#x27;, &#x27;friedman_mse&#x27;,
                                   &#x27;absolute_error&#x27;, &#x27;poisson&#x27;],
                     &#x27;max_depth&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_leaf&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_split&#x27;: [2, 5, 10, 20]})</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-33" type="checkbox" ><label for="sk-estimator-id-33" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">&nbsp;&nbsp;GridSearchCV<a class="sk-estimator-doc-link fitted" rel="noreferrer" target="_blank" href="https://scikit-learn.org/1.4/modules/generated/sklearn.model_selection.GridSearchCV.html">?<span>Documentation for GridSearchCV</span></a><span class="sk-estimator-doc-link fitted">i<span>Fitted</span></span></label><div class="sk-toggleable__content fitted"><pre>GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),
         param_grid={&#x27;criterion&#x27;: [&#x27;squared_error&#x27;, &#x27;friedman_mse&#x27;,
                                   &#x27;absolute_error&#x27;, &#x27;poisson&#x27;],
                     &#x27;max_depth&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_leaf&#x27;: [2, 5, 10, 20],
                     &#x27;min_samples_split&#x27;: [2, 5, 10, 20]})</pre></div> </div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-34" type="checkbox" ><label for="sk-estimator-id-34" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">estimator: DecisionTreeRegressor</label><div class="sk-toggleable__content fitted"><pre>DecisionTreeRegressor()</pre></div> </div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-35" type="checkbox" ><label for="sk-estimator-id-35" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">&nbsp;DecisionTreeRegressor<a class="sk-estimator-doc-link fitted" rel="noreferrer" target="_blank" href="https://scikit-learn.org/1.4/modules/generated/sklearn.tree.DecisionTreeRegressor.html">?<span>Documentation for DecisionTreeRegressor</span></a></label><div class="sk-toggleable__content fitted"><pre>DecisionTreeRegressor()</pre></div> </div></div></div></div></div></div></div></div></div>


dt_opt_10F.score(X_train, y_train)


0.9726365933735092

Regression, Optimization, and Life Expectancy

Contents

Regression, Optimization, and Life Expectancy#

Description:#

Cleaning:#

Instantiate/fit:#

Test on 25% held out test data#

Coefficients and Residuals#

Repeat the split, train, and test steps 5 times#

Interpret the model and its performance in terms of the application#

Testing on One Feature#

Test on 25% held out test data and measure the fit with two metrics and one plot#

Optimizing a more complex regression model#

Instantiate, Fit, and Score#

Examining the best fit model vs the Origninal#

More Folds#