# Linear Regression

## What is Linear Regression?

Linear regression quantifies the relationship between one or more predictor variables and an outcome variable. It is commonly used for predictive analysis and models. For example, it can be used to quantify the relative effects of age, gender, and diet (the predictor variables) on height (the outcome variable). It is also known as multiple regression, multivariate regression, ordinary least squares (OLS), and regression. This article will show you examples of linear regression, including a simple linear regression example and a multiple linear regression example.

## Types of Linear regression

**Simple Linear regression:** It is used to estimate the relationship between** **two quantitative variables. You can use simple linear regression when you want to know:

- How strong the relationship is between two variables (e.g. the relationship between rainfall and soil erosion).
- The value of the dependent variable at a certain value of the independent variable (e.g. the amount of soil erosion at a certain level of rainfall).

**Assumptions:**

- The relationship between the independent and dependent variable is linear: the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor). If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.
- Independence of observations: the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations.
- Normality: The data follows a normal distribution.
- Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

## Performing simple linear regression

The formula is given as follows:

**y**is the predicted value of the dependent variable (**y**) for any given value of the independent variable (**x**).- β
is the intercept, the predicted value of_{0}**y**when the**x**is 0. **β**is the regression coefficient – how much we expect_{1}**y**to change as**x**increases.**x**is the independent variable ( the variable we expect is influencing**y**).**e**is the**error**of the estimate, or how much variation there is in our estimate of the regression coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B_{1}) that minimizes the total error (e) of the model.

While you can perform a linear regression by hand, this is a tedious process, so most people use statistical programs to help them quickly analyze the data.

**Multiple linear regression:** It is used to estimate the relationship between two or more independent variables and a dependent variable. You can use multiple linear regression if you want to know:

- How strong is the relationship between two or more independent variables and a dependent variable (e.g. how rainfall, temperature and the amount of fertilizer added affect crop growth).
- The value of the dependent variable at a certain value of the independent variables (for example, the expected yield of a crop at certain levels of precipitation, temperature and fertilizer application).

**Assumptions:**

Multiple linear regression makes all of the same assumptions as simple linear regression:

**Linearity**: the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.**Homogeneity of variance (homoscedasticity)**: the size of the error in our prediction doesn’t change significantly across the values of the independent variable.**Normality**: The data follows normal distribution.**Independence of observations**: the observations in the dataset were collected using statistically valid methods, and there are no hidden relationships among variables.- In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

## How to perform a multiple linear regression

The formula is given as follows:

- y = the predicted value of the dependent variable
- β
_{o}= the y-intercept (value of y when all other parameters are set to 0) - β
_{1}X_{1}= the regression coefficient B_{1}of the first independent variable X_{1}(a.k.a. the effect that increasing the value of the independent variable has on the predicted y value) - … = to the same for however many independent variables you are testing
- β
_{n}X_{n}= the regression coefficient of the last independent variable - ε = model error (a.k.a. how much variation there is in our estimate of

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

- The regression coefficients that lead to the smallest overall model error.
- The
*t*-statistic of the overall model. - The associated p-value (how likely it is that the t-statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the *t*-statistic and *p*-value for each regression coefficient in the model.

**Example of Simple and multiple linear regression:**

The table below shows some data from the early days of the Italian clothing company Benetton. Each row in the table shows Benetton’s sales for a year and the amount spent on advertising that year. In this case, our outcome of interest is sales—it is what we want to predict. If we use advertising as the predictor variable, linear regression estimates that **Sales = 168 + 23 Advertising**. That is, if advertising expenditure is increased by one million Euro, then sales will be expected to increase by 23 million Euros, and if there was no advertising we would expect sales of 168 million Euros.

Linear regression with a single predictor variable is known as *simple regression. *In real-world applications, there is typically more than one predictor variable. Such regressions are called *multiple regression. *For more information, check out this post on for multiple linear regression examples. Returning to the Benetton example, we can include year variable in the regression, which gives the result that Sales = 323 + 14 Advertising + 47 Year. The interpretation of this equation is that every extra million Euro of advertising expenditure will lead to an extra 14 million Euro of sales and that sales will grow due to non-advertising factors by 47 million Euro per year.