Mean, Median, and Mode (now with Python!)

Mean, median, and mode

Mean, median, and mode are the most commonly used measures for central tendencies in descriptive statistics. Everyone learns them in school…so I hope you paid attention back in your school days because it’ll be very relevant in this article. But even if you didn’t, don’t worry, that’s what I am here for!

Even though these topics are basic, they are very very very very handy in the real world. If you’ve done any Data Science, Data Analysis, or Machine Learning project then you’ll know exactly what I am talking about.

These three are brought up all the time and for good reasons.

And don’t worry, this article won’t just teach you about Mean, Median, and Mode but also why they are used so much in Data Science and in the related fields, and their implementation in Python.

Mean

Mean also called the average, is the total sum of a series divided by the total number of items in that series.

The definition above is correct and explains the concept in a very brief and concise manner…but it’s a bit too boring. So let’s add in LaTeX and Python to change the mood a bit.

Suppose you have 5 children, and you give each one of them a few apples. In a Python list, this would look something like this.

Mean, Median, and Mode - Python list of apples
Python list of apples

If we apply the definition of mean given above on this list, we would get something like this.

Mean, Median, and Mode - Mean of apples
Mean of apples

The sum of all the apples distributed among 5 children came out to be 244. If we divide this by the number of children then we’ll get the average number of apples. This average would help us in determining how many apples each child has. In the example I took, it is 48.8.

We can surely calculate the mean by hand by just simply adding all the values in the series and dividing it by the total number of values in it, but in real-life problems, this series would be far too huge to work on manually. So let’s use Python to calculate the mean. For this, there are two methods.

  1. Write the code by yourself
  2. Using the NumPy module.

Writing by yourself –

Mean, Median, and Mode - Mean of apples using Python
Mean of apples using Python

Using NumPy –

Mean, Median, and Mode - Mean of apples using NumPy
Mean of apples using NumPy

As you can see, both the methods yield the same result as the manual one 48.8. And this is a problem.

One of the children had 239 apples (like the guy in the math questions with 400 watermelons), which is an absurd amount of apples. This number increases the whole sum of apples by quite a lot, which in turn increases the mean. This means that large numbers in a series can skew the mean towards them. That’s why the mean we got doesn’t represent the number of apples per child accurately.

This is a huge problem, and to fix this we use median and mode.

Median

Median is the middle value of a series sorted in ascending order.

Getting the middle value in a series with an odd number of total items is easy. In the example I gave earlier, the median is 2. But how can we get the median for a series with an even number of total items? For such a series, we identify the middle two values and take their mean.

Let’s visualize this better using Python and LaTeX.

Suppose a new child joined the group of children from the previous example and I gave them 3 apples. Now I have two series apples1, the series without the 6th child, and apples2, the series with the 6th child. If I write them in Python, they would look something like this.

Mean, Median, and Mode - Lists of apples
Lists of apples

If I sort them they would look something like this.

Mean, Median, and Mode - Sorted lists of apples
Sorted lists of apples

Now that I have both the series sorted, I just need to pick the middle value from both of them. For the first series, it would be 2. And for the second series, it would be the mean of 2 and 2 (because this series has an even number of items), that is 2.

Of course, I can do this manually by writing the list down by hand and then sorting them but as we learned earlier, this is not the best approach. So let’s use Python to calculate the median. And just like with the mean, there are two primary methods for calculating the median.

  1. Write the code by yourself
  2. Using the NumPy module.

Writing by yourself –

Let’s first create a function called median for this purpose. Once we have this, we’ll feed both the series into it to get the desired results.

Mean, Median, and Mode - Function of Median
Function of Median

Now that the function is created, let’s apply it to the two lists.

Mean, Median, and Mode - Median using Python
Median using Python

Using NumPy –

Mean, Median, and Mode - Median using NumPy
Median using NumPy

As you can see the outputs from both the methods match the output we got manually. And it’s not something as big as 48.8, so that’s a good thing. Median is not something that can be skewed like mean can and hence is much more reliable for getting the accurate number of apples per child.

Both mean and median can be used to get the picture of the type of data we are dealing with. If the mean and median aren’t too far apart, then we can conclude that most of the items in our data follow the same trend. But if the difference is significant, like in the example with 5 children, then we can conclude that the data has a few outliers. In the example I took, that outlier was 239.

Mode

Mode is probably the simplest to understand out of the three concepts. It’s nothing but the item with the maximum number of occurrences in a series.

If we come to the example of the 5 children with apples, you’ll see that the number 2 occurs twice as two children have 2 apples with them. Thus the mode for this series is 2. But suppose a 6th child joined in and I gave them 1 apple. Now 2 children have 2 apples and 2 children have 1 apple. In such a case, there would be two modes. 1 and 2 both as both have the same number of occurrences.

And again…doing this manually is very inefficient so let’s use Python for this. Just like mean and median there are two ways to calculate mode in Python, but this time we would be using Pandas instead of NumPy as NumPy doesn’t have any pre-built function called mode.

Writing the code by yourself –

Let’s create a function by the name of mode which takes in a series as a parameter.

Mean, Median, and Mode - Function of Median
Function of Mode

Now let’s use this function on the series from our example.

Mean, Median, and Mode - Mode using Python
Mode using Python

Using Pandas –

Mean, Median, and Mode - Mode using Pandas
Mode using Pandas

As you can see both the methods yield the same results as the manual one, meaning that we have working methods for calculating mode. And just like median, mode cannot be skewed to one side because of just a few large values, which makes it a lot more reliable.

Both median and mode can be a great replacement for any null values present in your data because they don’t tend to skew because of a few large values.

TLDR: using mean, median, and mode together can give you a very good idea of the type of data you are working with. So it’s a good habit to check these three for the data you are working with before you proceed with anything else.

Sources

  1. Youtube – CodeBasics
  2. Khan Academy
  3. Tutorial by W3Schools
  4. How to use Mean, Median, and Mode – Towards Data Science
  5. Analytics Vidhya

Leave a Reply

Your email address will not be published.