Drop Non-numeric Columns From Pandas DataFrame

As a developer working on a data analysis project in Python for one of my clients, I often encounter datasets that contain a mix of numeric and non-numeric columns. Sometimes, I need to perform calculations that only work with numeric data, and in these situations, I need to filter out all non-numeric columns from my DataFrame.

In this article, I’ll share three effective methods to drop non-numeric columns from a Pandas DataFrame. These techniques have saved me countless hours in my data analysis projects, and I’m confident they’ll help you too.

Let’s get in!

This Tutorial Covers:

Method 1: Use select_dtypes() to Keep Only Numeric Columns

The simplest approach to drop non-numeric columns is to use the select_dtypes() method in Python. This method allows you to filter columns based on their data types.

Let’s start with a simple example:

import pandas as pd
import numpy as np

# Creating a sample DataFrame with mixed column types
data = {
    'Name': ['John Smith', 'Sarah Johnson', 'Mike Williams', 'Emily Davis'],
    'Age': [32, 28, 45, 36],
    'Salary': [75000, 82000, 95000, 67000],
    'Department': ['Marketing', 'IT', 'Finance', 'HR'],
    'Performance_Score': [4.2, 3.8, 4.5, 4.0],
    'Date_Joined': ['2019-05-12', '2020-02-15', '2017-11-01', '2021-08-23']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df.dtypes)
print(df.head())

# Keep only numeric columns
numeric_df = df.select_dtypes(include=['number'])
print("\nDataFrame with only numeric columns:")
print(numeric_df.dtypes)
print(numeric_df.head())

In this example, the select_dtypes(include=['number']) method returns a new DataFrame containing only the numeric columns. The output would look like:

Original DataFrame:
Name                object
Age                  int64
Salary               int64
Department          object
Performance_Score  float64
Date_Joined         object
dtype: object

DataFrame with only numeric columns:
Age                  int64
Salary               int64
Performance_Score  float64
dtype: object

This method is clean and efficient, perfect for quick data preprocessing.

Method 2: Use pd.to_numeric() with Errors=’coerce’

Another approach I frequently use involves the pd.to_numeric() function in Python combined with some DataFrame manipulation. This method is particularly useful when you want to try converting columns to a numeric format before deciding whether to drop them.

Here’s how to implement it:

import pandas as pd

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4],
    'B': ['5', '6', '7', '8'],
    'C': ['a', 'b', 'c', 'd'],
    'D': [10.5, 11.2, 12.8, 9.7],
    'E': ['10.5', '11.2', 'twelve', '9.7']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df.dtypes)

# Function to check if a column can be converted to numeric
def is_numeric(column):
    # Try to convert to numeric, with errors='coerce'
    numeric_column = pd.to_numeric(column, errors='coerce')
    # Check if the column has any non-NaN values after conversion
    return not numeric_column.isna().all()

# Filter columns
numeric_columns = [col for col in df.columns if is_numeric(df[col])]
numeric_df = df[numeric_columns]

print("\nDataFrame with only numeric columns:")
print(numeric_df.dtypes)

The output would show:

Original DataFrame:
A     int64
B    object
C    object
D    float64
E    object
dtype: object

DataFrame with only numeric columns:
A     int64
B    object
D    float64
E    object
dtype: object

I executed the above example code and added the screenshot below.

In this case, columns A, B, D, and E are kept because they can be converted to numeric types, while column C is dropped since it contains only non-numeric values.

Note that columns B and E are still object types in the result, but they contain values that can be converted to numbers. If you want to convert them, you can add step:

# Convert all remaining columns to numeric types
for col in numeric_df.columns:
    numeric_df[col] = pd.to_numeric(numeric_df[col], errors='coerce')

Method 3: Use DataFrame.drop() with Custom Function

The third method involves creating a custom function to identify non-numeric columns and then using the drop() method to remove them:

import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'Product': ['Laptop', 'Smartphone', 'Tablet', 'Monitor'],
    'Price': [1200, 800, 350, 250],
    'Stock': [45, 120, 75, 30],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Accessories'],
    'Rating': [4.5, 4.8, 4.2, 4.0],
    'Last_Updated': ['2023-01-15', '2023-02-10', '2023-01-28', '2023-02-05']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df.dtypes)
print(df.head())

# Function to check if a column is numeric
def is_non_numeric_column(df, column):
    return not np.issubdtype(df[column].dtype, np.number)

# Get list of non-numeric columns
non_numeric_cols = [col for col in df.columns if is_non_numeric_column(df, col)]

# Drop non-numeric columns
numeric_df = df.drop(columns=non_numeric_cols)

print("\nDataFrame after dropping non-numeric columns:")
print(numeric_df.dtypes)
print(numeric_df.head())

The output:

Original DataFrame:
Product          object
Price             int64
Stock             int64
Category         object
Rating          float64
Last_Updated     object
dtype: object

DataFrame after dropping non-numeric columns:
Price     int64
Stock     int64
Rating  float64
dtype: object

I executed the above example code and added the screenshot below.

This method gives you more control over the filtering process and can be customized based on specific requirements.

Read Convert a DataFrame to JSON Array in Python

Bonus Method: Use DataFrame.describe() to Identify Numeric Columns

Here’s a bonus method that leverages the fact that describe() by default only includes numeric columns:

import pandas as pd

# Sample DataFrame
data = {
    'Employee': ['John Smith', 'Sarah Johnson', 'Robert Brown'],
    'Department': ['Sales', 'Marketing', 'IT'],
    'Years_Employed': [5, 3, 7],
    'Salary': [65000, 58000, 72000],
    'Performance': [0.85, 0.92, 0.78]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df.head())

# Get numeric columns using describe()
numeric_columns = df.describe().columns
numeric_df = df[numeric_columns]

print("\nDataFrame with only numeric columns:")
print(numeric_df.head())

This method is quick and elegant, although it provides less flexibility than the previous approaches.

Check out Convert a DataFrame to JSON in Python

Real-World Application: Data Preprocessing for Machine Learning

Let’s apply what we’ve learned to a more practical scenario. Suppose we’re preparing a dataset for a machine learning model that predicts housing prices in California:

import pandas as pd
import numpy as np

# Sample California housing dataset
data = {
    'Address': ['123 Main St, San Francisco', '456 Oak Ave, Los Angeles', '789 Pine Rd, San Diego'],
    'Zip_Code': ['94102', '90001', '92101'],
    'Price': [1250000, 950000, 875000],
    'Bedrooms': [3, 4, 3],
    'Bathrooms': [2.5, 3.0, 2.0],
    'Square_Feet': [1850, 2200, 1650],
    'Year_Built': [1985, 2002, 1992],
    'Neighborhood': ['Downtown', 'Hollywood', 'Gaslamp'],
    'School_Rating': [8.5, 7.2, 8.9]
}

housing_df = pd.DataFrame(data)
print("Original Housing DataFrame:")
print(housing_df.head())

# Machine learning models typically need numeric features
# Method 1: Using select_dtypes()
numeric_housing_df = housing_df.select_dtypes(include=['number'])
print("\nNumeric Housing Data for ML Model:")
print(numeric_housing_df.head())

# We might want to keep the Zip_Code as it could be relevant
# Let's try to convert it to numeric
housing_df['Zip_Code'] = pd.to_numeric(housing_df['Zip_Code'], errors='coerce')

# Now get all numeric columns
numeric_columns = [col for col in housing_df.columns if not pd.api.types.is_object_dtype(housing_df[col])]
final_housing_df = housing_df[numeric_columns]

print("\nFinal Housing Data for ML Model (including Zip Code):")
print(final_housing_df.head())

In this example, we’ve filtered out non-numeric columns from a housing dataset, which is a common preprocessing step before training models in machine learning.

Handling Special Cases

Sometimes, you might encounter columns that look numeric but are stored as strings, or columns with mixed numeric and non-numeric values. Here’s how to handle these special cases:

import pandas as pd

# DataFrame with mixed types
data = {
    'A': ['1', '2', '3', '4'],  # Strings that look like numbers
    'B': ['1.5', '2.5', 'three', '4.5'],  # Mixed numeric and non-numeric
    'C': [1, 2, 3, 4],  # Pure numeric
    'D': ['a', 'b', 'c', 'd']  # Pure non-numeric
}

df = pd.DataFrame(data)
print("Original DataFrame with mixed types:")
print(df.dtypes)
print(df.head())

# Try to convert all columns to numeric
for col in df.columns:
    try:
        df[col] = pd.to_numeric(df[col])
    except ValueError:
        # If column contains any non-numeric values, try coercing
        numeric_values = pd.to_numeric(df[col], errors='coerce')
        # If at least 75% of values can be converted to numeric, keep the column
        if numeric_values.count() / len(numeric_values) >= 0.75:
            df[col] = numeric_values
        else:
            # Mark for dropping
            df[col] = None

# Drop columns that are now all None
df = df.dropna(axis=1, how='all')

print("\nDataFrame after handling special cases:")
print(df.dtypes)
print(df.head())

This approach allows you to handle columns that are mostly numeric but might contain a few non-numeric values.

I’ve found these methods invaluable in my data preprocessing workflows. The right choice depends on your specific needs:

Use select_dtypes() for quick and clean filtering
Use pd.to_numeric() with error handling for more flexible conversion attempts
Use custom functions with drop() when you need more control over the filtering criteria

Remember, dropping non-numeric columns is just one step in the data preprocessing pipeline. Always ensure that you’re not discarding important information that could be encoded differently (like one-hot encoding categorical variables) before training your models.