When we buy most things in our lives, we need comparison. Because there are many brands and alternative models. Doing this manually sometimes takes a lot of effort. In these cases, what would it be like if there was a system that calculates the average price of the product whose features we give? That would make our job a lot easier, wouldn’t it? Most big marketing and other sector companies have already started to focus and use this system. We can use this system even when buying a home, car or small computer items. That’s why I wanted to make a project about that.
The Scenario of Project
Let’s say I’m working for a company that produces ram, and my boss asked me for help about pricing for a new model that we will produce soon. He wants me to create a model using the price data of RAMs with similar features in our competitors. He expects this model algorithm to give correct prices.
Or I want to use a nice computer that I will buy the parts one by one. For this I wanted to know the average prices of the parts. Firstly I want to buy my RAM’s. For this i will build a model with machine learning algorithms and formulate it. Thus, i will have information about the average price of the ram I want.
Web Scraping For Dataset
I need to detailed dataset and a realistic price list for this project. If I researched on web for this dataset, it would be difficult and inadequate. So I found a shopping site with detailed features and prices about RAMs. I’ve chosed the required pages and started web scraping for it.(I don’t have special api for this website so i scraped manually.)
I’ve used the BeautifulSoap library for this and edited the datas to create a strong and useful dataframe.
After finishing web scraping I’ve looked up dataset , I was curious about the shape and columns properties of dataset. Then I could edit, change and clean index,columns and other things that important for me.
My data had a categorical column (Memory_Type). It consisted of this category and I’ve done “One Hot Encoding” with dummies codes to use it in machine learning.
My dataset had empty values in the Price_TL columns. At the same time, prices have so many different valuse because of different GB and speed values. So it wouldn’t make sense to mean() or drop() them directly. That’s why I used the K-nearest Neighbors Algorithm(KNN) for these null values.
When I checked several rows, the code was okey. In this way, My dataset was ready for machine learning models and algorithms. After I analyzed my dataset, I wanted to check if there is any outliers.
As we can see in the graphs above, outliers was not at a problematic values. So can make our final checks and tests on the dataframe and jump into machine learning and model algorithms side.
Model Algorithms and Test-Train Performances
To predict the sale prices I am using the following regression algorithms: Linear Regression, Ridge regression, Lasso regression , Polinomial Regression, PLS Regression, PCR Regression and lastly Elastic Net regression algorithm. These algorithms can be feasibly implemented in python with the use of the scikit-learn package. Finally, I will conclude which model is best suitable for the given case by evaluating each of them using the evaluation metrics provided by the scikit-learn package.
Cross-Validation and Model Tuning
Cross-Validation is an essential tool in the Data Scientist toolbox. It allows us to utilize our data better. When we’re building a machine learning model using some data, we often split our data into training and validation/test sets. The training set is used to train the model, and the validation/test set is used to validate it on data it has never seen before. The classic approach is to do a simple 80%-20% split, sometimes with different values like 70%-30% or 90%-10%. In cross-validation, we do more than one split. We can do 3, 5, 10 or any K number of splits.
By doing this, we’re able to use all our 100 examples both for training and for testing while evaluating our learning algorithm on examples it has never seen before.
And finally, by looking at the mean_absolute_error() values, we can see which model gave us the least error.
Models Testing and Training Conclusions
When analyzing the report, it is noted that the cross validition score of the Polynomial Ridge regression model is seemed to be the highest which means, it takes the place of being the most suitable model for our dataset . It is followed by the Linear regression model.
We can also see same results if we would like to see mean of absolute error values. We can conclude that Polynomial Ridge model can be used the best model for our RAMs prices prediction problem.
We can see that, every model while rounding the output values will result in a score of 0.84 (84%) or 0.85 (85%) which means our model performs well on our dataset and can be used to solve real-world problems. So we can get a formula using the parameters given by our model. With this formula, we can calculate the price of RAM given its properties.