Introduction to regression analysis:

Machine learning tasks can be gathered into the four following categories:

This article focuses on regression analysis. Specifically, this article describes the basis of this task and illustrates its main concepts onto the California housing dataset.

The structure of this article is the following:

  1. What is regression?
  2. Difference between regression and classification
  3. Types of regression: Due to the large number of regression models, we introduce the most common ones.
  4. How to choose the correct model
  5. Linear model: From all the available regression models, this article focuses on the theory and assumptions of the linear model.
  6. Linear model example: Analyze the California Housing dataset with a linear regression model.
  7. Other regression analysis examples

1. What is regression?

Regression analysis is defined in Wikipedia as:

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome variable’) and one or more independent variables (often called ‘predictors’, ‘covariates’, or ‘features’).

The terminology you will often listen related with regression analysis is:

  • Dependent variable or target variable: Variable to predict.
  • Independent variable or predictor variable: Variables to estimate the dependent variable.
  • Outlier: Observation that differs significantly from other observations. It should be avoided since it may hamper the result.
  • Multicollinearity:  Multicollinearity Situation in which two or more independent variables are highly linearly related.
  • Homoscedasticity or homogeneity of variance:  Homoscedasticity Situation in which the error term is the same across all values of the independent variables.

Regression analysis is primarily used for two distinct purposes. First, it is widely used for prediction and forecasting, which overlaps with the field of machine learning. Second, it is also used to infer causal relationships between independent and dependent variables.

2. Difference between regression and classification

Regression and classification are both supervised learning methods, which means that they use labeled training data to train their models and make predictions. Therefore, those two tasks are often categorized under the same group in machine learning.

The main difference between them is the output variable. While in regression, the output is numerical or continuous, in classification, the output is categorical or discrete. This defines the way classification and regression evaluate the predictions:

  • Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
  • Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.

The following link gathers some loss functions when training both classification and regression models.

There is some overlap between the algorithms for classification and regression; for example:

  • A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.
  • A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.

Some algorithms can be used for both classification and regression with small modifications, such as decision trees and artificial neural networks. Some other algorithms are more difficult to implement for both problem types, such as linear regression for regression predictive modeling and logistic regression for classification predictive modeling [1].

3. Types of regression

There are various types of regressions that are used in data science and machine learning. Each type has its own importance on different scenarios, but at the core, all the regression methods analyze the effect of the independent variables on dependent variables. Here we mention some important types of regression:

  1. Linear Regression
  2. Polynomial Regression
  3. Support Vector Regression
  4. Decision Tree Regression
  5. Random Forest Regression
  6. Ridge Regression
  7. Lasso Regression
  8. Logistic Regression

Explaining each one of them in detail would cover several articles so, if the reader is interested in further information about regressors, I recommend reading [2, 3, 4].

4. How to choose the correct regression model?

In my opinion, this is the most difficult task, not only in regression but in machine learning in general.

Even though I think that experience is probably the right answer to this question, some tips [5,6]are:

  1. Linear models are the most common and most straightforward to use. If you have a continuous dependent variable, linear regression is probably the first type you should consider.
  2. If the dependent variable is continuous and your model has collinearity or a lot of independent variables, you can try, for example, ridge or lasso models. You can select the final model based on R-square or RMSE.
  3. If you are working with categorical data, you can try Poisson, quasi-Poisson, and negative binomial regression.
  4. To avoid overfitting, we can use the cross-validation method to evaluate the models. Ridge, Lasso, and elastic net regressions techniques can be used to correct overfitting issues.
  5. Try support vector regression when you have a non-linear model.

5. Linear model

The most common model in regression analysis is linear regression. This model finds the relationship between the independent and dependent variables by fitting a linear equation. The most common method for fitting this regression line is using least-squares, which calculates the best-fitting line that minimizes the sum of the squares of the vertical deviations from each data point to the line.

Building a linear regression model is only half of the work. In order to actually be usable in practice, the model should conform to the assumptions of linear regression [7,8]:

  1. Linear in parameters.
  2. The sample is representative of the population at large.
  3. The independent variables are measured with no error.
  4. The number of observations must be greater than the number of independent variables.
  5. No multicollinearity within independent variables
  6. The mean of residuals is zero.
  7. Normality of residuals
  8. The independent variables and residuals are uncorrelated
  9. Homoscedasticity of residuals (The variance of the residuals is constant across observations)
  10. No autocorrelation of residuals (Applicable especially for time series data). Mathematically, the variance-covariance matrix of the errors is diagonal.

 It is difficult to fulfill all these assumptions, so practitioners have developed a variety of methods to maintain some or all of these desirable properties in real-world settings. The following articles [9,10] explain some examples.

6. Linear model example

To showcase some of the concepts previously introduce, we implemented a linear regression model onto the California housing dataset. Here is the code along with a brief explanation for each block.

First, we import the required libraries.

Next, we load the housing data from the scikit-learn library :

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

To know more about the features, we print california_housing_dataset.DESCR:

- MedInc median income in block
- HouseAge median house age in block
- AveRooms average number of rooms
- AveBedrms average number of bedrooms
- Population block population
- AveOccup average house occupancy
- Latitude house block latitude
- Longitude house block longitude

These are eight independent variables based on which we can predict the value of the house. The prices of the house are indicated by the variable AveHouseVal , which defines our dependent variable.

We now load the data into a pandas dataframe using pd.DataFrame.

Data preprocessing

After loading the data, it’s a good practice to see if there are any missing values in the data. We count the number of missing values (none for this dataset) for each feature using isnull() .

Exploratory Data Analysis

Let’s first plot the distribution of the target variable AveHouseVal depending on Latitude and Longitude . The image is supposed to plot the state of California, USA. As observed, houses located close to the sea are more expensive than the rest.

Next, we create a correlation matrix that measures the linear relationships between the variables.


  • To fit a linear regression model, we select those features that have a high correlation with our dependent variable AveHouseVal. By looking at the correlation matrix we can see that MediaInc has a strong positive correlation with AverageHouseVal (0.69). The other two variables with highest correlation are HouseAve and AveRooms .
  • An important point when selecting features for a linear regression model is to check for multicollinearity. For example, the features Latitude and Longitude have 0.92 correlation, so we should not include both of them simultaneously in our regression model. Since the correlation between the variables MediaInc , HouseAve and AveRooms is not high, we consider those three variables for our regression model.

Training and testing the model

We use scikit-learn’s LinearRegression to train our model on both the training and test sets.

Distributions of the residuals:

Let’s print the distribution of the residuals to verify afterwards the assumptions of the linear model.

Verification of the assumptions:

  1. Linear in parameters (Okay)
  2. The sample is representative of the population at large (Supposed)
  3. The independent variables are measured with no error (Supposed)
  4. The number of observations must be greater than number of independent variables (Okay)
  5. No multicollinearity within independent variables (Okay for the studied variables)
  6. The mean of residuals is zero (Okay)
  7. Normality of residuals (No)
  8. The independent variables and residuals are uncorrelated (Not checked)
  9. Homoscedasticity of residuals (The variance of the residuals is constant across observations) (No)
  10. No autocorrelation of residuals (Applicable especially for time series data). Mathematically, the variance–covariance matrix of the errors is diagonal (Not checked)

The linear regression model is not the best model when studying this dataset so, we will approach other models in future articles.

Just as a hint, here there is a good link about how to continue:

  • One of the big problems with non-normality in the residuals and heteroscedasticity is that the amount of error in your model is not consistent across the full range of your observed data. This means that the amount of predictive ability they have is not the same across the full range of the dependent variable. Transforming the dependent variable can help to correct this but it difficulties the interpretation. If the square-root transformation did not fully normalize your data you can also try an inverse transformation. The strength of transformations tends to go from 1. Logarithmic, 2. Square Root, 3. Inverse (1/x).

7. Other regression analysis examples

If the readers are interested, I recommend trying the following datasets:

About the Boston housing dataset, I recommend reading this other article.

Also, in this web link, there is a collection of some thematically related datasets that are suitable for different types of regression analysis.

Finally, many datasets can be found in:

Thank you for reading !!


[1] Machine Learning Mastery, Difference between classification and regression in machine learning

[2] Analytics Vidhya, 7 regression techniques you should know!

[3] ListenData, 15 types of regression in data science

[4] Javatpoint, Regression analysis in machine learning

[5] ListenData, How to choose the correct regression model?

[6] Statistics by Jim, Choosing the Correct Type of Regression Analysis

[7] R statistics, Assumptions of linear regression

[8] University of Colorado, Linear model assumptions and diagnosis

[9] Medium, Linear Regression and its assumptions

[10] Medium, How to Tackle Your Next Regression Problem

Similar Posts

Leave a Reply

Your email address will not be published.