How To Remove All Non-numeric Characters In Pandas

When working with real-world data, I often encounter messy text containing a mix of numbers and other characters. Sometimes, I need to extract the numeric values from these strings for calculations or analysis.

Pandas makes this data cleaning process much easier, but there’s no single built-in function called “remove_non_numeric()”. Instead, we need to use a combination of methods to get the job done.

In this tutorial, I’ll share four different methods for removing all non-numeric characters in Pandas, along with practical examples from my decade of Python experience.

This Tutorial Covers:

Methods to Remove All Non-numeric Characters in Pandas

Now, I will explain how to remove all non-numeric characters in Pandas.

Method 1: Use str.replace() with Regular Expressions

The simplest way to remove non-numeric characters is to use Pandas’ string method str.replace() with a regular expression pattern.

Here’s how you can do it:

import pandas as pd

# Sample DataFrame with mixed string data
df = pd.DataFrame({
    'Product_ID': ['ABC123', 'DEF456', 'GHI789'],
    'Price': ['$99.99', '€49.95', '£29.99'],
    'Phone': ['(555) 123-4567', '555.987.6543', '555-321-7890']
})

# Remove non-numeric characters from the Phone column
df['Phone_Clean'] = df['Phone'].str.replace(r'\D', '', regex=True)

print(df)

Output:

  Product_ID   Price           Phone Phone_Clean
0     ABC123  $99.99  (555) 123-4567   5551234567
1     DEF456  €49.95   555.987.6543   5559876543
2     GHI789  £29.99   555-321-7890   5553217890

I executed the above example code and added the screenshot below.

pandas remove non numeric characters from column

In this example, I used the pattern \D which matches any non-digit character, and replaces all matches with an empty string.

This method is simple and works well for most cases. The regex=True parameter is important to ensure the pattern is interpreted as a regular expression.

Method 2: Use Lambda Function with filter()

Another approach is to use a lambda function with Python filter() function to keep only the numeric characters:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Order_ID': ['ORD-12345', 'ORD-67890', 'ORD-24680'],
    'Amount': ['$1,234.56', '$789.01', '$2,468.10'],
})

# Remove non-numeric characters using lambda and filter
df['Order_ID_Clean'] = df['Order_ID'].apply(lambda x: ''.join(filter(str.isdigit, x)))
df['Amount_Clean'] = df['Amount'].apply(lambda x: ''.join(filter(str.isdigit, x)))

print(df)

Output:

     Order_ID    Amount Order_ID_Clean Amount_Clean
0  ORD-12345  $1,234.56          12345      123456
1  ORD-67890    $789.01          67890       78901
2  ORD-24680  $2,468.10          24680      246810

I executed the above example code and added the screenshot below.

This method uses Python’s built-in filter() function along with str.isdigit() to keep only the digit characters from each string. The join() method then combines these characters back into a string.

Notice that this method removes all non-digit characters, including decimal points and commas. This works great for IDs but may not be suitable for monetary values, where you need to keep the decimal point.

Method 3: Use pd.to_numeric() with errors=’coerce’

If your goal is to convert strings to actual numeric values (not just strings containing only digits), Pandas provides a convenient function called to_numeric():

import pandas as pd

# Sample DataFrame with numeric values as strings
df = pd.DataFrame({
    'Sales': ['$1,200', '$3,450', 'N/A', '$890'],
    'Units': ['120 pcs', '345 pcs', 'Out of stock', '89 pcs']
})

# First remove currency symbols, commas, and other characters
df['Sales_Clean'] = df['Sales'].str.replace(r'[^\d.]', '', regex=True)
df['Units_Clean'] = df['Units'].str.replace(r'\D', '', regex=True)

# Convert to actual numeric values
df['Sales_Numeric'] = pd.to_numeric(df['Sales_Clean'], errors='coerce')
df['Units_Numeric'] = pd.to_numeric(df['Units_Clean'], errors='coerce')

print(df)

Output:

    Sales         Units Sales_Clean Units_Clean  Sales_Numeric  Units_Numeric
0  $1,200        120 pcs        1200         120         1200.0          120.0
1  $3,450        345 pcs        3450         345         3450.0          345.0
2     N/A  Out of stock                                     NaN            NaN
3    $890         89 pcs         890          89          890.0           89.0

I executed the above example code and added the screenshot below.

This approach is particularly useful when you want to perform mathematical operations on the cleaned data. The errors='coerce' parameter converts any invalid numeric strings to NaN values rather than raising an error.

I’m using two different regex patterns here:

[^\d.] matches any character that’s not a digit or decimal point
\D matches any non-digit character

The first pattern is better for monetary values, where you want to keep decimal points, while the second is better for whole numbers.

Method 4: Use str.extract() to Pull Out Numeric Portions

If you need to extract specific numeric patterns from strings, Python str.extract() method is very useful:

import pandas as pd

# Sample DataFrame with product codes and measurements
df = pd.DataFrame({
    'Product': ['iPhone 13 Pro', 'Samsung Galaxy S22', 'Google Pixel 6'],
    'Dimensions': ['146.7 x 71.5 x 7.65 mm', '146.0 x 70.6 x 7.6 mm', '158.6 x 74.8 x 8.9 mm']
})

# Extract the first number from each product name
df['Product_Number'] = df['Product'].str.extract(r'(\d+)')

# Extract all three dimensions separately
df[['Height', 'Width', 'Thickness']] = df['Dimensions'].str.extract(r'([\d.]+)\s*x\s*([\d.]+)\s*x\s*([\d.]+)')

# Convert the extracted strings to float
df[['Height', 'Width', 'Thickness']] = df[['Height', 'Width', 'Thickness']].astype(float)

print(df)

Output:

              Product              Dimensions Product_Number  Height  Width  Thickness
0        iPhone 13 Pro  146.7 x 71.5 x 7.65 mm             13   146.7   71.5       7.65
1  Samsung Galaxy S22   146.0 x 70.6 x 7.6 mm             22   146.0   70.6       7.60
2     Google Pixel 6   158.6 x 74.8 x 8.9 mm              6   158.6   74.8       8.90

This method is particularly useful when you need to extract specific numeric patterns from more complex strings. The regular expression pattern inside extract() uses capture groups (the parts in parentheses) to pull out the exact numbers you want.

The beauty of this approach is that it can handle complex patterns and extract multiple numeric values at once, as shown with the dimensions example.

Read Fix “Function Not Implemented for This Dtype” Error in Python

Handle Special Cases: Negative Numbers and Decimals

When working with financial or scientific data, you might need to preserve negative signs and decimal points:

import pandas as pd

# Sample DataFrame with financial data
df = pd.DataFrame({
    'Amount': ['+$1,234.56', '-$789.01', '$2,468.10'],
    'Change': ['+15.2%', '-7.8%', '+0.3%']
})

# Preserve negative signs and decimal points
df['Amount_Clean'] = df['Amount'].str.replace(r'[^\d.-]', '', regex=True)
df['Change_Clean'] = df['Change'].str.replace(r'[^\d.-]', '', regex=True)

# Convert to numeric values
df['Amount_Numeric'] = pd.to_numeric(df['Amount_Clean'])
df['Change_Numeric'] = pd.to_numeric(df['Change_Clean'])

print(df)

Output:

       Amount  Change Amount_Clean Change_Clean  Amount_Numeric  Change_Numeric
0  +$1,234.56  +15.2%      1234.56         15.2         1234.56           15.2
1   -$789.01   -7.8%      -789.01        -7.8          -789.01           -7.8
2  $2,468.10   +0.3%      2468.10         0.3          2468.10            0.3

In this example, I used the pattern [^\d.-] which preserves digits, decimal points, and minus signs while removing everything else. This is crucial for financial data, where the negative sign carries important meaning.

Check out Read a CSV to the dictionary using Pandas in Python

Performance Considerations for Large DataFrames

When working with large datasets, performance becomes important. Here’s how the different methods stack up:

str.replace(): Generally fast and efficient for most operations.
apply() with lambda: Slower for large DataFrames as it applies Python-level functions.
to_numeric(): Very efficient for converting to numeric types.
str.extract(): Great for complex patterns, but can be slower than simple replacements.

For a DataFrame with millions of rows, I recommend using vectorized operations like str.replace() or to_numeric() rather than apply() with lambda functions.

In my experience working with US customer data, I’ve found the str.replace() method to be the most versatile for cleaning up phone numbers, zip codes, and social security numbers, where you need to strip out all formatting characters.

All these methods have their place depending on your specific needs. I hope these examples from my years of Python data wrangling help you clean your data more effectively.

How to Remove All Non-numeric Characters in Pandas

Methods to Remove All Non-numeric Characters in Pandas

Method 1: Use str.replace() with Regular Expressions

Method 2: Use Lambda Function with filter()

Method 3: Use pd.to_numeric() with errors=’coerce’

Method 4: Use str.extract() to Pull Out Numeric Portions

Handle Special Cases: Negative Numbers and Decimals

Performance Considerations for Large DataFrames

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends