paint-brush
Understanding Measures of Centrality: A Deep Dive into Mean, Median, and Modeby@oumaymaabassi
130 reads

Understanding Measures of Centrality: A Deep Dive into Mean, Median, and Mode

by November 20th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A Deep Dive into Mean, Median, and Mode. We’ll discuss their strengths and weaknesses, along with practical examples using SQL and Python.
featured image - Understanding Measures of Centrality: A Deep Dive into Mean, Median, and Mode
undefined HackerNoon profile picture

One of the most important steps of any data-related project is Exploratory Data Analysis (EDA). It is crucial to explore the distribution of the data to understand it and efficiently decide on the next steps. A straightforward way to explore data distribution is to study its central tendencies through measures of centrality. In this article, we’ll explore the three primary measures of centrality: Mean, Median, and Mode. We’ll discuss their strengths and weaknesses, along with practical examples using SQL and Python.

The Mean: The Average Joe

The mean, often referred to as the average, is calculated by summing all the values in a dataset and dividing by the number of values. It’s a straightforward way to find a central value.


Let’s say we have a table named Sales with a column Revenue. The following query will give us the mean revenue per sale:


SELECT AVG(Revenue) AS MeanRevenue
FROM Sales;


When using Python, we can obtain the same result by importing Pandas and running the following code:


import pandas as pd
# import your data into a pandas dataframe
mean_revenue = df['Revenue'].mean()

Strengths and Weaknesses

The mean leaves no man behind, meaning every data point is taken into consideration. Although this allows it to give us a holistic view of the data, it makes it highly susceptible to outliers.


The Median: The Middle Ground

The median is the middle value when the data is sorted, meaning there is an equal number of data points before and after it.

Note: If there’s an even number of observations, the median is obtained by averaging the two middle values.

We can obtain the median price using the following SQL query:


SELECT 
DISTINCT PERCENTILE_CONT(0.5) 
    WITHIN GROUP (ORDER BY Revenue) OVER() AS MedianRevenue
FROM Sales;


Using Python, however, obtaining the median is much more straightforward:


median_revenue = df['Revenue'].median()

Strengths and Weaknesses

Since the median only considers the order of data points, it is not affected by outliers, making it a great indicator of the center when the data is not symmetrically distributed. However, this strength is also its weakness—it fails to capture the whole dataset.


The mode is the value that appears most frequently in a dataset. It’s particularly useful for categorical data where we want to know the most common category.


To find the mode in SQL, we can execute the following query:


SELECT Revenue, COUNT(*) AS Frequency
FROM Sales
GROUP BY Revenue
ORDER BY Frequency DESC
LIMIT 1;


And for the Pythonistas out there, it’s even simpler:


mode_revenue = df['Revenue'].mode()

Strengths and Weaknesses

The mode is useful for providing categorical insights, making it ideal for identifying the most common category or categories (bimodal or multimodal) in non-numeric datasets. However, it has limitations: if all values are unique, there may be no mode at all, and it tends to be less relevant for continuous data compared to the mean or median.


The most efficient way to understand a dataset’s central tendencies is by considering all three measures of centrality. Each provides unique insights into the data distribution, allowing us to build a more comprehensive understanding of its shape.


  • If mean ≈ median ≈ mode, the data is likely symmetrically distributed (think bell-shaped curve).
  • If mean > median, it suggests a right-skewed (positively skewed) distribution, where higher values pull the mean upwards.
  • If mean < median, it indicates a left-skewed (negatively skewed) distribution, meaning lower values pull the mean down.


By observing all three measures, we get a clearer understanding of the data. If there’s a large difference between the mean and median, that could be a red flag for outliers or a skewed distribution, which would require further analysis or potential data transformations.


Conclusion: The Bigger Picture

Exploring central tendencies is a critical part of exploratory data analysis. The mean, median, and mode each serve different purposes, and understanding when and how to use them is critical for interpreting data correctly. Whether you’re working with continuous or categorical data, applying these measures helps uncover patterns and insights that guide the upcoming steps—whether it’s cleaning, transforming, or modeling.