PORTFOLIO
Project 3 Linear Regression
Introduction:
​
The dataset that I decided to use is from a kaggle competition. The dataset consists of a lot of columns filled with information pertaining to information about properties. The description of the dataset is as such "You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.". The point of this competition is for people to get their hands on a dataset and use linear regression methods to get discoveries from the data. I will be trying to find correlations between some particular data and what that actually means for the value of the home that have been valued.
​
What is Linear Regression:
​
Linear Regression is a supervised machine learning model, in which the model uses a line to fit a linear line between the independent and dependent variables. There are two types of linear regression. There is the first type which is simple and then there is multiple linear regression. Simple linear regression has only one independent variable and it gets compared to the dependent variable. Multiple linear regression on the other hand has more than one independent variables in which the model tries to find a relationship.
​
Experiment 1: Understanding the Data/Pre Processing
​
The first thing that I knew I needed to do was check if there were any null values in the data. I checked and there were 19 columns that had null values in them. I glanced over the column names and they didn't seem like they had much correlation to the sale price of the house. So I decided that I would just drop the columns as a whole. This can be seen in the image below
​
​
​
​
After doing this I then wanted to see which features had the biggest correlation to the sales price. After making a heat map I saw the following
​
​
As you can see the higher that the number on the very right x axis value goes the higher the correlation it has to sale price. I found the 2 values which had the highest correlation to the sale price was a column named GrLivArea and another column named OverallQual. OverallQual stands for the overall quality of the property and GrLivArea stands for "Above grade (ground) living area square feet". Which makes sense the higher the quality of the house and the higher square footage it has are the 2 most likely variables that would affect the price of the house.
​
Experiment 1: Modeling
​
The next thing that I wanted to do was create some linear regression graphs that we can look at. This shows the regression line for both of the columns that I found were most correlated to the sale price.
​
​
​
​
Experiment 1: Evaluation
​
For evaluating the model the evaluation method I used was the one that would be used for grading the contest by Kaggle. I decided that I would be using the RMSE method. This stands for Root Mean Square Error. I calculated the totals and this is what the totals were.
​
​
​
​
​
​
​
​
The Root Mean Square Error tells us that there might be about a 43000 dollar error when it comes to predicting the sale price. I think that this RMSE is pretty good as house value can go anywhere from the 100's of thousands of dollars to the millions. With that in consideration I think that 43112 is an excellent RMSE.
​
Experiment 2
​
The next this that I wanted to try in this experiment that was different from the experiment above was I wanted to predict the sale price of houses with a column that had one of the more off the end correlation rates to sale price. The column that I decide I wanted to go for was lot area. This column like the name refers to is the lot area of the property. I was very surprised according to the heat map that this column only had a .26 correlation to the sale price, even though I feel people that tend to have houses with a bigger lot tend to spend more money on said houses. The first thing I decided to check out was the linear regression chart.
​
​
​
​
​
As you can see from the chart above this is what happens when there is a pretty inaccurate relation between the target feature and the column being compared to it. You can tell by seeing the graph because most of the blue spots aren't following the trend of the black line. I also went ahead and evaluated the RMSE.
​
​
​
​
​
​
​
​
​
As you can see the RMSE for the LotArea is way larger than it was for the best correlated columns. This makes sense but it sure does show us that the correlation heat map really does do it's job. The increase for RMSE is about 36182, which is a huge difference and now you can see how a linear graph and an evaluation method such as an RSME is very helpful. This has shown how great the evaluation method truly is.
Experiment 3
For this next experiment I decided that I wanted to use different evaluation models and see which one would fit my first experiment the best. The reasoning behind using the first experiment is simple as it simply has the best correlated data. This image below is a reminder for what we got from the linear model.
​
​
​
​
The next model that I tried to accomplish was using SVR
​
​
​
​
As you can see based on the method that was used above which was SVR the RMSE is a couple thousand higher compared to using the linear model. The next one that I wanted to look at was the bayesian. This is what the mode came out with.
​
​
​
​
​
​
​
​
​
As you can see by this model the RMSE was very similar to the previous model that we used. It was also nearly just as far apart from the original linear model that we decided to use. So based on the few models that we used today we can see that the original model has the least room for error with about a 43000 error field.
​
Impact​
​
I know for one thing is that this project could definitely be used for good. The way that this could be used for good is because now people know what they should be looking for in their future house. Houses are a great way to have a place to live, but I believe at the same time they can be a great investment as-well. Based on the location of your house even though as time the house would slowly age the possibility that the price on the house goes up in market value is very high, based on the development that happens around you and what people are looking for in a house. Therefore I think that this was very informational as to what the biggest correlation between the price of a house at sale and some of the houses key features were. With all good however may come bad, this includes this data and research that was conducted could easily be used for people that scam individuals for a living to use. If a person lives in a house that is highly priced based off external features that are easy to spot and were shown in this analysis, the individual living inside the house might be more tasty in the eyes of scamming individuals.
​
Conclusion​
​
What I mainly learned in this project was fitting data to a linear models and then assessing that data through evaluation methods. I learned also that there are a lot of evaluation methods that can be used for many different things. I think one of the main things is I can actually see myself using things like linear models and evaluation methods in a workplace. Some of the steps that I took definitely helped the model preform better I feel. Like taking out some of the columns that were unnecessary I feel helped the column to reach it's full potential and give a lower RMSE which basically means the model is more accurate. I didn't necessarily remove and include some features but I did remove some and find that the RMSE was getting lower upon certain columns removal.
​
Sources
​
​
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html
​
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html
​
Code
​
​
​
​
​
​
​
​
​
​
​
​
​
​
​









