Introduction

The objective of this paper is to try and find and showcaseconnections between various heart measurements and the presence of heartdisease in patients. The data used in this paper includes 303 medicalrecords from patients at risk of heart disease. The 14 terms recordedare as follows:

Medical Records
Terms Definitions
ageAge in years
sexMale/Female
cpChest pain level (0-4)
trestbpsResting blood pressure (mmHg)
cholSerum cholesterol (mg/dl)
fbsFasting blood sugar (high/low)
restechResting electrocardiographic normality
thalachMaximum heart rate achieved
exangExercise induced angina (yes/no)
oldpeakST depression induced by exercise relative to rest
slopePeak exercise ST segment slope (up, flat, down)
caNumber of major vessels colored by flourosopy (0-3)
thalHeart defect (none/fixed/reversible)
numPresence of heart disease (none/levels of severity 1-4)

Can we show all the data at once?

Yes (but we shouldn’t.)

The above chart is the representation of all 14 variables in a2-dimensional plot. The one big observation that can be made is thatpatients with heart disease (colored) are grouped on the right andpatients without (black) are grouped on the left. Other notablegroupings are that males are more likely to be closer to the top, whilefemales will likely fall towards the bottom, older people will be higherup than younger people, and the higher the heart rate, the farther leftthe person should fall. Aside from these observations, you can vaguelyinterpret what the arrows mean. However, it is not easy to make anyconcrete observations off this image alone. Plus, only 37.55% of thevariability is explained in this chart, making any observations not veryaccurate.

How we can we visualize the data more effectively?

By conducting factor analysis and rotating the data in two differentways, we can group the data in two separate ways; each containinginteresting observations. The first grouping was found using a Varimaxrotation and contains 11/14 variables, not including: fbs, chol, andrestecg which were deemed too independent from the other variables toinclude. The second grouping was found using a Promax rotation and onlyincludes the quantitative variables: thalach, oldpeak, age, chol,trestbps, and the presence of heart disease: num. The first rotationresulted in 4 groups and the second rotation resulted in 2 groups. Theresults are shown below:

Varimax Grouping
group1 group2 group3 group4
cpsexageslope
numthaltrestbpsoldpeak
exang ca
thalach
Promax Grouping
group1 group2 group3 group4
thalachchol
oldpeaktrestbps
numage

Now that we have the data split into smaller groups, we can begin toproperly visualize it. Some of the groups ended up still not beinguseful for analysis, namely group 1 in Varimax and group 2 in Promax.However, the rest of the groups are quite small and worth lookinginto!

Exercise, Age, and ST segments

The chart on the left is based off of group 4 from the Varimaxrotation and shows the amount of depression in angle of the ST segmentwhile exercising vs resting and is grouped by the slope angle categoryat peak exercise. The regression line shows that there is a linearrelationship between the variables and the boxplots show that as slopedecreases, the mean depression angle increases; as does itsvariability.

The chart on the right shows that the max heart rate achieved and agehave a negative linear relationship, and that the points are grouped bythe same depression angle groupings as mentioned on the left. Each grouphas its own regression line, with patients with upsloping ST segmentshaving the highest max heart rates, flat ST segment patients having thelowest, and downsloping ST segment patients somewhere in the middle.

How can we use this data to predict heartdisease?

Using group 1 from our Promax rotation, we link the Max Heart Rateand ST Depression variables together. The ellipses show where 95% ofeach type lie, and there is a large area where not present does notoverlap with present. This shows that those with a low-medium maximumheart rate and moderate-severe exercise-induced ST depression are mostlikely to have heart disease. By doing this, we can use observationsmade in the previous two charts about how age and peak exercise slopeaffect heart disease diagnosis.

For example, it was previously shown that those with an older age anda flat ST slope at peak exercise have a lower max heart rate, so we canuse the new chart to infer that that same group has a higher chance ofhaving heart disease. Similarly, as shown in the chart on the top left,those with downsloping ST segments have a higher average ST Depression,which also implies a higher chance of heart disease.

Can we use this data to predict heart disease in futurepatients?

Using all the connections observed above, along with many others thatare harder to see, I created a model to predict whether or not somebodyhas heart disease based off of the 13 other variables provided in themedical records.

About the model:

In our observations, there are 139 patients with heart disease out of303 total patients. With this prior information of a 45.87% positiverate, we can use linear discriminant analysis to predict whether or nota new patient has heart disease given the 13 variables.

After being run 1000 times with random samples, this model is able toaccurately predict whether or not someone has heart disease 83.27% ofthe time, given the other variables. Furthermore, it is 79.68% accuratein predicting a true case, and 86.28% accurate in predicting a falsecase.

Let’s test on a random 5 patients:

Note: the num variable is the true value and is not included whengiven to the model.

Patient Info:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
52031361960216900.12030
55041803270111713.42031
49131181490212600.81331
74021202690212110.21130
54031602010016300.01130
## Prediction:
## [1] 0 1 1 0 0## Levels: 0 1
Using the model:

Given our new patient, Gertrude Smith, a 60 year old female withnon-anginal pain, a resting blood pressure of 102 mm Hg, a cholesterolmeasurement of 318 mg/dl, low fasting blood sugar, normal restingelectrocardiographic results, a maximum heart rate of 160 beats/minute,no exercise-induced angina, no ST depression induced by exerciserelative to rest, upsloping peak ST segment, only 1 colored majorvessel, and normal thal diagnosis, we can predict whether or not she hasheart disease by putting her info into the model:

Smith, Gertrude:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
60031023180016000113
## Prediction:
## [1] 0## Levels: 0 1
The model predicts that Gertrude does not haveheart disease.

As stated above, the model is only 86.28% accurate in giving a truenegative, so there is still a 13.72% chance that the model returned afalse negative and Gertrude does have heart disease.However, 86.28% is a pretty high likelihood and it would likely be morebeneficial to look for another cause of Gertrude’s symptoms beforetesting for heart disease.

Can we use our observations made earlier to help makepredictions?

I created a reduced model using only using the age, thalach, slope,and oldpeak variables; as used in the charts above. Doing this resultedin an accuracy of 67.11%, a 62.07% rate of true positive, and a 70.21%rate of true negative. While this is a pretty big drop in quality fromthe 83.27%, 79.68%, and 86.28% of the full model, the reduced model onlycontains 4 predictors compared to the 13 of the full model. Referringback to the “Patient Info” chart above, let’s try out the reduced modelon the same 5 patients as out full model, as well as Gertrude:

## Patient prediction:
## [1] 0 1 1 1 0## Levels: 0 1
## Gertrude prediction:
## [1] 0## Levels: 0 1

As shown here, this model is not as accurate as the full model.However, it still made the same negative prediction about Gertrude asthe full model. Since age is already given without testing and the other3 variables can be obtained with a single test, performing said testfirst would be the quickest way to make a general assumption about apatient before performing further testing.

Conclusion

Quite a few observations can be made from the 14 terms listed in the303 patients’ medical records. I found that, of the 13 predictorvariables, some of the most important and easiest to visualize variablesare the ST Slope, ST Depression Induced by Exercise, Maximum Heart Rate,and Age. Using the 13 predictor variables to predict presence of heartdisease, a model can be used to make predictions with 83.27% accuracyand a model using only the 4 aforementioned variables could be used withan accuracy of 67.11%. Since the model was only trained on 303 patients,the accuracy has potential to grow with more cases introduced.