Project 2

Project 2 - Classifying

Introduce the problem:

The problem that I wanted to tackle was Heart disease. There are a lot of people in this world that are affected by heart disease and I wanted to figure out what exactly causes heart disease. This I feel could be done by predicting based on data that has currently been obtained. Heart disease is a problem that can be sometimes due to genetics but most of the time due to taking bad care of the body or old age. I wanted to see how big the correlation between something like cholesterol had to do with actual heart disease.

Introducing the data:

The dataset that was picked by me to preform this analysis was one that I found on Kaggle. This data set has plenty of columns that I felt were beneficial to the study. These are the features of the data set.

Age: Age
Sex: Sex (1 = male; 0 = female)
1. ChestPain: Chest pain (typical, asymptotic, nonanginal, nontypical)
RestBP: Resting blood pressure
1. Chol: Serum cholestoral in mg/dl
2. Fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
3. RestECG: Resting electrocardiographic results
4. MaxHR: Maximum heart rate achieved
5. ExAng: Exercise induced angina (1 = yes; 0 = no)
6. Oldpeak: ST depression induced by exercise relative to rest
7. Slope: Slope of the peak exercise ST segment
8. Ca: Number of major vessels colored by fluoroscopy (0 - 3)
9. Thal: (3 = normal; 6 = fixed defect; 7 = reversable defect)
10. target: AHD - Diagnosis of heart disease (1 = yes; 0 = no)

Pre-Processing

The first step that I took was importing the actual data. The second thing that I did was check the head and make sure that all 14 columns were there, which they were. I also checked the data types of the columns and all of them were int64 other than a singular column which was age and the data type being float. The main thing that I did was check if there were any null values that would have to be removed if they were in the data. Luckily there were no null values that could be seen. These were all the pre-processing steps that were taken.

Data Understanding/Visualizations

The first thing that I wanted to was check if there was a correlation between people that actually had heart disease and the age that a person has. So using seaborn I plotted a graph that shows people's age as the X axis and the hue as to whether the individual actually had heart disease or not. If the color was orange that means that the individual did have heart disease if it was blue that means the individual didn't have heart disease.

As can be seen i the picture below there really wasn't any correlation between a persons age and whether they had heart disease or not. This can be due to many reasons but I believe that genetics may be one of them. Another thing that came up while I was doing some more visuals that I thought was interesting is the ratio of women to men in this study that had heart disease. Keep in mid before you see the next graphic that for the X Axis

1 = Male

0 = Female

Orange = No confirmed Heart Disease

Blue = Confirmed Heart Disease

As You can see by the image above in the sample collected there were more males that had heart disease compared to females. This was really interesting to see because before one would think that the gender you are wouldn't affect the chances of getting heart disease but the data says otherwise. This is all the visualizations that I feel I needed to see before moving on to the next step.

Modeling and Evaluation

The first thing that I decided to accomplish was making a decision tree. I looked back into the data types that I had and I made sure that there were no strings in there as that would make the decision tree unusable. Once i finished doing that I saw that there were no string so I decided to move on to the next step which was setting what the target was going to be. For my case the target that was trying to be predicted was quite clearly the target column which determined whether or not there was or was not heart disease in that patient. This can be seen in the screenshot below.

After doing that I then went on to split the data into train and test. I decided that I wanted to train 80 percent of the data while the other 20% got tested. This can be seen by the image below.

After this I decided that I wanted to make a decision tree and I wanted to fit the X and y data into the tree. The first thing that I did was import in tree from sklearn, after doing this I decided to fit the tree with the X and the y train. Then I started to predict and score the model. The model scored a 0.7704918032786885 which is around a 77% accuracy which is not bad considering the dataset that was used. You can see all the steps that were done above in the below picture.

Then it was time to make the actual tree itself. This can be shown being done by looking at the screenshot below. This shows what I did to get the tree itself and what I did to visualize the tree.

As you can see that is the decision tree above. While it may be a little bit hard to see you can always zoom in to the picture and get all the details that might be necessary. The nodes that I can see a lot of are columns like age, chol, thal. I wanted to now see which of these columns had the most importance. The code that I used to make the importance chart can be seen below. The reason that I used a decision tree was because I believed that it would be the best visualization to see which columns were being used the most and because I felt it could further help me with my understanding.

After running this code segment I got the following results which were quite shocking.

According to this the 3 most important things that affected the target column which was if heart disease was actually positive in the patient was cp, ca, and age, cp stands for chest pain and ca stands for number of major vessels colored by fluoroscopy. I wanted to get into a little bit more detail than this so the next this that I decided to get was a classification report. The code and the actual classification report can be seen below.

Here you can see the predicting aspect of the model. This also shows that again the accuracy score for the model was around 77%, this I'm sure is because there were some outliers where patients even though they might have had all the trends toward heart disease they might not have actually had it, and vice versa. This I feel is where genetics and luck come into play for some of these cases.

Storytelling

This is really interesting and I was really glad that I was able to do all of this research with the data. d I got to the bottom of what the most important features that were traced to if an individual indeed did have heart disease. Now initially I thought that cholesterol would have a lot of effect on if people had heart disease or not. This is something that you hear about all the time "don't eat this food don't eat that food it's bad for you its gonna make your cholesterol go through the roof", this at first was what I imagined was the reason that people would have heart disease. That is not to say that there weren't other variable in play when it came to people with heart disease such as genetics. I just thought the main reason would be about not being healthy. This as you can see however was proven to be wrong. The most important things when it comes to whether individuals had heart disease or not was having chest pains, the amount of major vessels that were colored using fluoroscopy as well as age had a way bigger impact than initially thought to have.

Impact

I think that the nature of this research that was conducted was done in good faith. I think that the research that I conducted today could help people get an early detection of heart disease. I think that if individuals have chest pains they should definitely get them checked out by a doctor and if available to them getting a fluoroscopy wouldn't be a bad idea either. As much as these things have an impact you can also remember that keeping in shape and eating healthier will lower you chances of having high cholesterol which still having low impact it is better to have no impact at all. I don't believe that any of the research that I conducted could be used in bad faith, but I think that the data set that was used can be. This dataset does predict what the most important feature to having heart disease are and I think that if there was a person out there with ill intentions they could start using the data to have opposite effects than intended. Example rather than eating healthy to lower cholesterol they could start going crazy with junk food.

References/Code

https://www.kaggle.com/datasets/volodymyrgavrysh/heart-disease

Download the code here