11 C
New York
Tuesday, April 22, 2025

7 Important Knowledge High quality Checks with Pandas


7 Essential Data Quality Checks with Pandas
Picture by Creator

 

As a knowledge skilled, you’re in all probability aware of the price of poor knowledge high quality. For all knowledge tasks—large or small—it is best to carry out important knowledge high quality checks.

There are devoted libraries and frameworks for knowledge high quality evaluation. However if you’re a newbie, you may run easy but necessary knowledge high quality checks with pandas. And this tutorial will train you the way.

We’ll use the California Housing Dataset from scikit-learn for this tutorial.

 

 

We’ll use the California housing dataset from Scikit-learn’s datasets module. The information set accommodates over 20,000 information of eight numeric options and a goal median home worth.

Let’s learn the dataset right into a pandas dataframe df:

from sklearn.datasets import fetch_california_housing
import pandas as pd

# Fetch the California housing dataset
knowledge = fetch_california_housing()

# Convert the dataset to a Pandas DataFrame
df = pd.DataFrame(knowledge.knowledge, columns=knowledge.feature_names)

# Add goal column
df['MedHouseVal'] = knowledge.goal

 

For an in depth description of the dataset, run knowledge.DESCR as proven:

 

7 Essential Data Quality Checks with Pandas
Output of knowledge.DESCR

 

Let’s get some primary info on the dataset:

 

Right here’s the output:

Output >>>


RangeIndex: 20640 entries, 0 to 20639
Knowledge columns (whole 9 columns):
 #   Column   	Non-Null Rely  Dtype  
---  ------   	--------------  -----  
 0   MedInc   	20640 non-null  float64
 1   HouseAge 	20640 non-null  float64
 2   AveRooms 	20640 non-null  float64
 3   AveBedrms	20640 non-null  float64
 4   Inhabitants   20640 non-null  float64
 5   AveOccup 	20640 non-null  float64
 6   Latitude 	20640 non-null  float64
 7   Longitude	20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
reminiscence utilization: 1.4 MB

 

As a result of now we have numeric options, allow us to additionally get the abstract begins utilizing the describe() methodology:

 

7 Essential Data Quality Checks with Pandas
Output of df.describe()

 

 

Actual-world datasets typically have lacking values. To investigate the info and construct fashions, you must deal with these lacking values.

To make sure knowledge high quality, it is best to verify if the fraction of lacking values is inside a selected tolerance restrict. You may then impute the lacking values utilizing appropriate imputation methods.

Step one, due to this fact, is to verify for lacking values throughout all options within the dataset.

This code checks for lacking values in every column of the dataframe df

# Examine for lacking values within the DataFrame
missing_values = df.isnull().sum()
print("Lacking Values:")
print(missing_values)

 

The result’s a pandas collection that exhibits the rely of lacking values for every column:

Output >>>

Lacking Values:
MedInc     	0
HouseAge   	0
AveRooms   	0
AveBedrms  	0
Inhabitants 	0
AveOccup   	0
Latitude   	0
Longitude  	0
MedHouseVal	0
dtype: int64

 

As seen, there are not any lacking values on this dataset.

 

 

Duplicate information within the dataset can skew evaluation. So it is best to verify for and drop the duplicate information as wanted.

Right here’s the code to determine and return duplicate rows in df. If there are any duplicate rows, they are going to be included within the outcome:

# Examine for duplicate rows within the DataFrame
duplicate_rows = df[df.duplicated()]
print("Duplicate Rows:")
print(duplicate_rows)

 

The result’s an empty dataframe. That means there are not any duplicate information within the dataset:

Output >>>

Duplicate Rows:
Empty DataFrame
Columns: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal]
Index: []

 

 

When analyzing a dataset, you’ll typically have to rework or scale a number of options. To keep away from surprising errors when performing such operations, it is very important verify if the columns are the entire anticipated knowledge kind.

This code checks the info kinds of every column within the dataframe df:

# Examine knowledge kinds of every column within the DataFrame
data_types = df.dtypes
print("Knowledge Varieties:")
print(data_types)

 

Right here, all numeric options are of float knowledge kind as anticipated:

Output >>>

Knowledge Varieties:
MedInc     	float64
HouseAge   	float64
AveRooms   	float64
AveBedrms  	float64
Inhabitants 	float64
AveOccup   	float64
Latitude   	float64
Longitude  	float64
MedHouseVal	float64
dtype: object

 

 

Outliers are knowledge factors which can be considerably completely different from different factors within the dataset. In the event you keep in mind, we ran the describe() methodology on the dataframe.

Primarily based on the quartile values and the utmost worth, you would’ve recognized {that a} subset of options comprise outliers. Particularly, these options:

  • MedInc
  • AveRooms
  • AveBedrms
  • Inhabitants

One strategy to dealing with outliers is to make use of the interquartile vary, the distinction between the seventy fifth and twenty fifth quartiles. If Q1 is the twenty fifth quartile and Q3 is the seventy fifth quartile, then the interquartile vary is given by: Q3 – Q1. 

We then use the quartiles and the IQR to outline the interval [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. And all factors outdoors this vary are outliers.

columns_to_check = ['MedInc', 'AveRooms', 'AveBedrms', 'Population']

# Operate to seek out information with outliers
def find_outliers_pandas(knowledge, column):
	Q1 = knowledge[column].quantile(0.25)
	Q3 = knowledge[column].quantile(0.75)
	IQR = Q3 - Q1
	lower_bound = Q1 - 1.5 * IQR
	upper_bound = Q3 + 1.5 * IQR
	outliers = knowledge[(data[column] < lower_bound) | (knowledge[column] > upper_bound)]
	return outliers

# Discover information with outliers for every specified column
outliers_dict = {}

for column in columns_to-check:
	outliers_dict[column] = find_outliers_pandas(df, column)

# Print the information with outliers for every column
for column, outliers in outliers_dict.objects():
	print(f"Outliers in '{column}':")
	print(outliers)
	print("n")

 

7 Essential Data Quality Checks with Pandas
Outliers in ‘AveRooms’ Column | Truncated Output for Outliers Examine

 

 

An necessary verify for numeric options is to validate the vary. This ensures that each one observations of a characteristic tackle values in an anticipated vary.

This code validates that the ‘MedInc’ worth falls inside an anticipated vary and identifies knowledge factors that don’t meet this standards:

# Examine numerical worth vary for the 'MedInc' column
valid_range = (0, 16)  
value_range_check = df[~df['MedInc'].between(*valid_range)]
print("Worth Vary Examine (MedInc):")
print(value_range_check)

 

You may attempt for different numeric options of your selection. However we see that each one values within the ‘MedInc’ column lie within the anticipated vary:

Output >>>

Worth Vary Examine (MedInc):
Empty DataFrame
Columns: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal]
Index: []

 

 

Most knowledge units comprise associated options. So it is necessary to incorporate checks primarily based on logically related relationships between columns (or options).

Whereas options—individually—could tackle values within the anticipated vary, the connection between them could also be inconsistent.

Right here is an instance for our dataset. In a legitimate document, the ‘AveRooms’ ought to usually be better than or equal to the ‘AveBedRms’.

# AveRooms shouldn't be smaller than AveBedrooms
invalid_data = df[df['AveRooms'] < df['AveBedrms']]
print("Invalid Data (AveRooms < AveBedrms):")
print(invalid_data)

 

Within the California housing dataset we’re working with, we see that there are not any such invalid information:

Output >>>

Invalid Data (AveRooms < AveBedrms):
Empty DataFrame
Columns: [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal]
Index: []

 

 

Inconsistent knowledge entry is a standard knowledge high quality subject in most datasets. Examples embrace:

  • Inconsistent formatting in datetime columns 
  • Inconsistent logging of categorical variable values 
  • Recording of studying in numerous models 

In our dataset, we’ve verified the info kinds of columns and have recognized outliers. However you may as well run checks for inconsistent knowledge entry.

Let’s whip up a easy instance to verify if all of the date entries have a constant formatting.

Right here we use common expressions along with pandas apply() perform to verify if all date entries are within the YYYY-MM-DD format:

import pandas as pd
import re

knowledge = {'Date': ['2023-10-29', '2023-11-15', '23-10-2023', '2023/10/29', '2023-10-30']}
df = pd.DataFrame(knowledge)

# Outline the anticipated date format
date_format_pattern = r'^d{4}-d{2}-d{2}$'  # YYYY-MM-DD format

# Operate to verify if a date worth matches the anticipated format
def check_date_format(date_str, date_format_pattern):
	return re.match(date_format_pattern, date_str) just isn't None

# Apply the format verify to the 'Date' column
date_format_check = df['Date'].apply(lambda x: check_date_format(x, date_format_pattern))

# Establish and retrieve entries that don't observe the anticipated format
non_adherent_dates = df[~date_format_check]

if not non_adherent_dates.empty:
	print("Entries that don't observe the anticipated format:")
	print(non_adherent_dates)
else:
	print("All dates are within the anticipated format.")

 

This returns the entries that don’t observe the anticipated format:

Output >>>

Entries that don't observe the anticipated format:
     	Date
2  23-10-2023
3  2023/10/29

 

 

On this tutorial, we went over frequent knowledge high quality checks with pandas. 

If you find yourself engaged on smaller knowledge evaluation tasks, these knowledge high quality checks with pandas are an excellent place to begin. Relying on the issue and the dataset, you may embrace further checks. 

In the event you’re focused on studying knowledge evaluation, take a look at the information 7 Steps to Mastering Knowledge Wrangling with Pandas and Python.
 
 

Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embrace DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and low! At the moment, she’s engaged on studying and sharing her information with the developer neighborhood by authoring tutorials, how-to guides, opinion items, and extra.



Related Articles

Latest Articles