What is Regression Analysis?
Regression analysis is one of the core concepts in the field of machine learning. It comes under supervised machine learning where the algorithm is used to model the relationship between the output variable(y) with one or more independent variables(x). In simpler terms, regression analysis is one of the tools of machine learning which helps us to predict the output value depending on available data points. It predicts continuous values such as height, weight, temperature, length, price, etc.
Example of how to apply regression analysis in a real-life scenario: Suppose we have to suggest a car to our friend and our friend asks us to suggest a car that has the best mileage. For this purpose, if we have to predict the mileage of each car (y), then the input features(x1,x2,…) can be the weight of the car, engine efficiency, tire design, transmission design, etc. Hence, by using regression analysis we can predict the mileage of each car and in turn, give the perfect advice to our friend.
How to evaluate a regression model?
Though there are several metrics to evaluate a regression model – here we will talk about the 3 best metrics which are –
- Coefficient of Determination
- Mean Square Error(MSE)
- Mean Absolute Error
Coefficient of Determination, also called R Squared (R2) – Explains how much the variance of one variable determines the variation of another variable. The value of R2 varies between zero and one – the bigger the value of R2, the better the regression model. Hence, R2 determines how well the dependent variables are fitted in our model. Statsmodel or Sklearn Package can be used to calculate R Square in Python. Also click on this link, to know more about model selection using the R2 measure.
Mean Square Error(MSE) – It is a measure of how close a fitted line is to data points. The value range is between 0 to ∞, the lower the value of MSE, the better is the model with 0 being the perfect model. Sklearn Package can be used to calculate MSE in Python.
Mean Absolute Error (MAE) – It is a measure of errors between paired observations expressing the same phenomenon. It is similar to MSE, but here we take the absolute sum of errors instead of the sum of the square of errors. The value range is between 0 to ∞, the lower the value of MAE, the better is the model with 0 being the perfect model. Sklearn Package can be used to calculate MAE in Python.
To study other evaluation metrics, check out the interesting links section at the end of this article.
Types of Regression Analysis –
Various types of regressions are used in data science and machine learning. Even though each type has its own specific use case, at the core, all the regression methods explore the effect of the independent variable on dependent variables. Some of the important regression types are given below:
- Linear Regression – Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), using a best-fit straight line. The variable we want to predict is called the dependent variable and the variable which is used to predict the dependent variable is called the independent variable. If we have only one input variable (x), then such linear regression is called simple linear regression. And if we have multiple input variables, then such linear regression is called multiple linear regression. To know in-depth about Linear regression, follow this link.
- Polynomial Regression – To understand polynomial regression, first, we should understand what a polynomial equation is. The equation can be represented by the following image. Here the order of the polynomial is ‘n’ i.e. the largest exponent in any of the terms.
Hence, a polynomial regression equation of ‘k’ order will be represented as –
In this type of regression, the original variable is transformed into polynomial variables of a given degree and then modeled using a linear model. Hence, if we look at this equation from the coefficients point of view and not from the ‘x’ point of view, a polynomial regression equation is an example of simple linear regression. It is only used in place of a simple linear regression when we have to model a non-linear dataset. It helps us to identify the curvilinear relationship between independent and dependent variables. However, unlike simple linear regression which uses a best-fit straight line, here the data points are best fitted using a polynomial line. In short, polynomial regression is a linear model with some modifications in order to increase the accuracy and fit the maximum data points.
- Bayesian Regression – Also called Bayesian linear regression, it is used in cases where we have insufficient data or poorly distributed data. Here we formulate linear regression using probability distributions rather than point estimates. The output, y, is not estimated as a single value but is assumed to be drawn from a probability distribution. Also, along with the output value ‘y’, the model parameters ‘x’ are also assumed to come from distribution as well. The output is generated from a normal (Gaussian) distribution characterized by a mean and variance and the model parameters come from posterior probability distribution. In problems where we have limited data or have some prior knowledge that we want to use in our model, this approach can both incorporate prior information and show our uncertainty. Also, we can improve our initial estimate as we gather more and more data i.e. the Bayesian approach.
- Ridge Regression – In cases where the linearly independent variables are highly correlated, a linear or a polynomial regression will fail. To solve such issues, Ridge regression can be used. Here we introduce a small amount of bias so as to get better long-term predictions. The amount of bias added is called the Ridge Regression Penalty. Ridge regression is a regularization technique hence it is used to reduce model complexity. To learn more about what regularization is, follow this link.
- Lasso Regression – It is another regularization technique similar to ridge regression and hence it is also used to reduce model complexity. However, it has an added benefit that lasso regression enforces sparsity on the learned weights. To learn more about what regularization is, follow this link.
- Logistic Regression – It is one of the most popular machine learning algorithms. It is a classification algorithm that is used to predict a binary outcome based on a set of independent variables. The logistic regression model works with categorical variables such as 0 or 1, True or False, Yes or No, etc. To know in-depth about Logistic regression, follow this link.
Use cases of regression analysis –
It has various use cases spread over multiple domains like –
- Marketing- to forecast pricing and sales of the product and also to measure the efficiency of market campaigns.
- Financial Industry- to forecast the stock prices, analyze and evaluate risks by understanding the trend in stock prices.
- Medicine- to forecast the various uses and side effects of various medicines and also to prepare generic medicines for various diseases.
- Manufacturing- to analyze and evaluate the relationships between various data points to improve the efficiency of the manufacturing products.
A few pointers to keep in mind while applying regression analysis –
The most important thing to keep in mind while applying regression analysis is to understand the problem statement correctly. We have to evaluate our models based on the problem statements. For example, if we have a forecasting problem, we should use linear regression. If we have a classification problem, logistic regression should be used.
In this article, I tried to explain regression analysis in simple terms. If you have any questions related to the post, put them in the comment section and I will do my best to answer them. Also, do check out interesting links related to this topic below.
- Ways to evaluate a regression model – https://towardsdatascience.com/ways-to-evaluate-regression-models-77a3ff45ba70
- How to know if data is linear or non-linear – https://vitalflux.com/how-know-data-linear-non-linear/#:~:text=If%20the%20least%20square%20error,when%20dealing%20with%20regression%20problem.
- Machine Learning Project on Linear Regression – https://www.youtube.com/watch?v=iRCaMnR_bpA&t=838s
- Linear Regression vs Logistic Regression – https://www.youtube.com/watch?v=OCwZyYH14uw
- Five applications of Regression Analysis in Business – https://www.newgenapps.com/blogs/business-applications-uses-regression-analysis-advantages/