While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value.
So, after understanding the causes of these outliers, we can handle them by dropping those records or imputing the values or leaving them as it is.
- To perform data analysis on a set of values, we have to make sure the values in the same column should be on the same scale.
- For example, if the data contains the values of the top speed of different companies’ cars, then the whole column should be either in meters/sec scale or miles/sec scale.
We can analyze the data using the following approach:
- “Bi means two and variate means variable, so here there are two variables. The analysis is related to causing and the relationship between the two variables.”
- This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables.
- Example: An example of bivariate data can be temperature and ice cream sales in the summer season.Suppose the temperature and ice cream sales are the two variables of bivariate data. Here, the relationship is visible from the table that temperature and sales are directly proportional to each other and thus related because as the temperature increases, the sales also increase. Thus bivariate data analysis involves comparisons, relationships, causes, and explanations. These variables are often plotted on the X and Y axis on the graph for a better understanding of data and one of these variables is independent while the other is dependent.
There are three types of bivariate analysis.
1. Bivariate Analysis of two Numerical Variables (Numerical-Numerical):
A scatter plot represents individual pieces of data using dots. These plots make it easier to see if two variables are related to each other. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables.
In a scatterplot, we describe the overall pattern with descriptions of direction, form, and strength. Deviations from the pattern are still called outliers.
2. Bivariate Analysis of two categorical Variables (Categorical-Categorical): To find the relationship between two categorical variables, we can use the following methods:
- Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represent the category of one variable and the columns represent the categories of the other variable. It shows the count or count% of observations available in each combination of row and column categories.
- Stacked Column Chart: This method is more of a visual form of a Two-way table.
3. Bivariate Analysis of one numerical and one categorical variable (Numerical-Categorical)
There are several techniques for handling outliers. We only give 3 techniques:
a. Dropping/ Trimming the outliers data: You omit the outlier’s values.
b. Caping the outliers data: You replace the outlier’s values with upper bound and lower bound. outliers that are located at more upper bound be replaced by upper bound values. Otherwise, outliers that are located at the lower bound can be replaced with a lower bound.
c. Replacing with new values: You replace outliers’ values with mean, median, or mode.
d. You can convert everything into ‘Log form’, as log brings everything at the same distance. (works on numeric data only though)