PORTFOLIO
Project 4 Clustering
Introduce the problem:
​
The thing that I want to find out is if I classify wine by its certain properties. I want to find out if I classify wine by properties that are something like the acidic quality you can see if the whine is red wine, white wine, or something else. I think that this could help people find what kind of wine they might like depending on the variables that are introduced in the data that I got from Kaggle. I think this could be an interesting thing to find out and this could be an interesting data set to explore.
​
What is clustering and how does it work?
​
Clustering is a method that can be used to group data by certain points that might be made. k-means clustering is an unsupervised machine learning algorithm. The main point of k-means clustering is to group similar data points together and get some patterns that might not be visible just by taking a glance. In this method, k must be defined which is the amount of clusters that are in the dataset.
​
Introduce the data:
​
The dataset comes from Kaggle, the attributes are Alcohol, Malic Acid, Ash, Alcalinity of Ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, Proline. The wines that were selected for this dataset are found in the same region in Italy.
​
Pre-Processing:
​
The first thing that I wanted to check was making sure all the columns that were listed in the data set were there, and I did this as can be seen in the picture below
​
​
​
​
​
​
​
​
​
The next thing that I did was check to see if there were any null values. Luckily in our case, nothing was null in the data. I also checked the types of data that were in the data. As you can see the data that was inside is classified with floats and ints. That were all the preprocessing steps that I took toward the data.
​
​
​
​
​
​
​
​
​
​
​
​
​
Data Understanding/Visualizations
​
The first thing that I wanted to see was the relationship that the feature had in between them. This I chose to show by making a pair plot that has all the features that are in the dataset. This can be shown in the picture below.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
As you can see by the pair plot that is shown above there are a lot of relationships that can be seen between the different columns that there are. One of the things that I saw was that there was a unique relationship between the alcohol level and the color intensity of the alcohol itself. This was really something that I wanted to explore a little bit further. I explore this a little further by making another pair plot that was based on color intensity.
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​
As you can see here I made a correlation graph that correlates with the color intensity that there is. I also made another graph that would correlate magnesium levels alcohol levels and color intensity. This can be shown in the graph below.
​
​
​
​
​
​
​
As you can see the cool thing here is when the alcohol level is low the color intensity tends to be on the lighter side of things. The opposite however is also true. We can also see that there are no major correlations that can be spotted in between magnesium and color intensity.
​
Modeling
​
The algorithm that I wanted to use when it comes to my dataset was k-means. This was decided because the data set didn't have a lot of columns and types, aka the dataset was on the smaller side than most datasets. So the first thing that I wanted to do was make sure that I check the variance which will tell us how many clusters we should have. This can be shown in the picture below.
​
​
​
​
​
​
​
​
​
​
The number of clusters you want to pick correlates with the amount of the dropoff and where the closest number to the steepest dropoff is. In this case, the number would be 2. Then I created the model itself which can also be seen below. The next thing I did was make a model that showed the algorithm in use.
​
​
​
​
​
​
​
​
​
​
Storytelling
​
I think that it is pretty cool what I made with the amount of data that I was given with the set. I think that the clear divide that we can see there shows that there are two clusters in play that are showing the clear distinction between the color intensity of alcohol and what that means. I think that this answers the previous question that I had as this shows that the higher alcohol level doesn't necessarily matter when it comes to the color intensity of the wine. What matters is that the color intensity is what predicts the alcohol level.
​
Impact
​
I think that this study can have a very positive impact on the wine individuals that drink on a daily basis every now and then. The research can however also have a negative impact on society if someone wanted to use this information in a negative way. For example, they could base what type of wine they want to steal based on how acidic it is off of the color intensity of the actual wine. This is a concern that could be made a security threat t any store.
​
References
​
https://www.kaggle.com/datasets/harrywang/wine-dataset-for-clustering
​
Code
​






