Menu Close

Understanding Mean, Median, and Mode in Machine Learning | DevDuniya

Rate this post

In machine learning, understanding central tendencies is fundamental. These statistical measures provide crucial insights into the distribution of data and help in various aspects of model building and analysis. Let’s explore three key measures: Mean, Median, and Mode.

Mean

The mean, often referred to as the average, is calculated by summing all values in a dataset and then dividing the sum by the total number of values.

Formula:

Mean = (Sum of all values) / (Number of values)

Example:

import numpy as np

num_list = [54, 3, 12, 65, 34, 23, 16, 9, 3, 8, 12, 3]
mean = np.mean(num_list) 
print("Mean =", mean) 

Output:

Mean = 20.166666666666668

Use Cases of Mean:

  • Widely used in various statistical and machine learning calculations.
  • Useful for understanding the average value of a dataset.
  • Sensitive to outliers, as a single extreme value can significantly influence the mean.

Median

The median is the middle value in a dataset when the data is arranged in ascending or descending order. If the dataset has an even number of values, the median is the average of the two middle values.

Example:

import numpy as np

num_list = [54, 3, 12, 65, 34, 23, 16, 9, 3, 8, 12, 3]
median = np.median(num_list)
print("Median =", median)

Output:

Median = 12.0

Use Cases of Median:

  • Less sensitive to outliers compared to the mean.
  • Provides a more robust measure of central tendency when dealing with skewed distributions or datasets with extreme values.

Mode

The mode is the value that appears most frequently in a dataset. A dataset can have multiple modes (multimodal) or no mode if all values occur with equal frequency.

Example:

from scipy import stats

num_list = [54, 3, 12, 65, 34, 23, 16, 9, 3, 8, 12, 3]
mode = stats.mode(num_list) 
print("Mode =", mode)

Output:

Mode = ModeResult(mode=array([3]), count=array([3])) 

Use Cases of Mode:

  • Identifying the most common value or category in a dataset.
  • Useful in categorical data analysis and understanding the most frequent occurrences.

Choosing the Right Measure

The choice of which central tendency measure to use depends on the characteristics of the dataset and the specific goals of the analysis.

  • Mean: Suitable for symmetric distributions without significant outliers.
  • Median: More robust for skewed distributions or datasets with outliers.
  • Mode: Useful for identifying the most frequent values or categories.

By understanding these key measures and their appropriate use cases, you can gain valuable insights into your data and make more informed decisions in your machine learning projects.

Suggested Blog Posts

Leave a Reply

Your email address will not be published.