In this Python blog, I will explain How Python remove Non ASCII characters from string using different methods with illustrative examples.
What is ASCII?
ASCII (American Standard Code for Information Interchange) is a character encoding standard that represents each character as a number between 0 and 127. It covers English letters, digits, punctuation, and some control characters. Any character outside of this range is considered a non-ASCII character.
When working with text in Python, we might come across strings that contain non-ASCII characters. These could be special symbols, diacritics, or characters from non-Latin scripts. In some applications, such as when dealing with legacy systems or when we want to standardize textual data, we might need to remove these non-ASCII characters.
Methods Python remove Non ASCII characters from String
There are several different methods to Remove Non ASCII characters from Python String.
- Using a For-Loop
- Using Regular Expressions
- Using the encode() and decode() Methods
- Using List Comprehension
- Using Python’s String Methods
- Using the filter() Function
- Using map() and lambda
Let’s see them one by one using some demonstrative examples:
Method 1: Remove Non ASCII characters Python using a for loop
By iterating over each character in the string through the for loop in Python, this method checks the Unicode code point of the character using the ord() function. If the code point is less than 128 (indicating an ASCII character), it’s saved; otherwise, it’s not.
For example: A local library in San José, California is digitizing its card catalog. Their legacy system can’t handle non ASCII characters. They need a way to standardize city names in their records through Python.
def remove_non_ascii(text): return ''.join(i for i in text if ord(i) < 128) text = "San José, California" clean_text = remove_non_ascii(text) print(clean_text)
Output: The ord() method with for loop in Python, removes the accented “é” in “San José“, turning it into “San Jos“. The state name “California” remains unchanged as it contains only ASCII characters.
San Jos, California
This way we can remove Non ASCII characters from Python string using the ord() function with a for loop.
Method 2: Python strip non ASCII characters using Regular Expressions
This method uses Python’s re module to find and remove any character outside the ASCII range. The regular expression r'[^\x00-\x7F]+’ matches non ASCII characters, and the sub() function replaces them with an empty string in Python.
For instance: The Tourism Board of Hawaii is creating brochures through Python. They’ve sourced content from various authors, and some have used native spellings like “Mānoa.” as a Python string, To keep the brochure consistent and easily readable by all tourists, they decided to use only ASCII characters.
import re def remove_non_ascii(text): return re.sub(r'[^\x00-\x7F]+', '', text) text = "Mānoa, Hawaii" clean_text = remove_non_ascii(text) print(clean_text)
Output: The “ā” character in the Python string is non-ASCII. The sub() method from the re module recognizes and removes it, resulting in “Mnoa“. “Hawaii” remains unaffected.
The sub() function from the Regex module helps Python remove Non ASCII characters from string.
Method 3: Python remove all non ascii characters from string using encode() and decode() functions
In this method, Python remove Non ASCII characters from string, by first encoding the string as bytes using the ASCII codec, any non-ASCII characters are ignored due to the ‘ignore‘ error handler. The byte string is then decoded back into a regular Python string, without the non-ASCII characters.
Scenario: A local radio station in Nevada is launching a new website through Python. While uploading the list of featured artists, they notice special characters in some names are saved as Python strings, like “Zoë.” Their web font doesn’t display these characters properly, so they need a quick solution to remove them in Python.
def remove_non_ascii(text): return text.encode('ascii', 'ignore').decode('ascii') text = "Zoë, Nevada" clean_text = remove_non_ascii(text) print(clean_text)
Output: The character “ë” in the Python string is a non-ASCII character. The encode() and decode() methods together, remove this character, leaving “Zo” behind. “Nevada” remains the same.
This way we can use encode() and decode() functions to remove Non ASCII characters from Python string.
Method 4: Python remove non-ascii characters from string using List comprehension
List comprehension in Python is a more concise way of the for-loop method, this approach uses a list comprehension to quickly evaluate each character’s Unicode code point, keeping only those within the ASCII range.
Example: A tech startup in Boisé, Idaho is migrating its user database to a new platform through Python. Some user entries are saved as Python strings that have non-ASCII characters in city names. To ensure smooth migration and compatibility, they decided to standardize all city names to contain only ASCII characters.
def remove_non_ascii(text): return ''.join([i if ord(i) < 128 else '' for i in text]) text = "Boisé, Idaho" clean_text = remove_non_ascii(text) print(clean_text)
Output: The accented character “é” in the string gets detected and removed through Python. This transforms the city name to “Bois“. “Idaho” is untouched as it’s purely ASCII.
This way we can use list comprehension to remove Non ASCII characters from Python string
Method 5: Python replace Non ASCII character in string
Python strings have some inbuilt methods, one of them is the isascii() method. The str.isascii() method checks if all characters in a string are ASCII. By applying this method to each character in the text, we can easily filter out any non-ASCII characters.
Example: A software company in Texas is creating an online portal for employee registrations through Python and saving the names as strings in Python. They realize that strings with special characters, like “Chloë,” can cause issues with their legacy database system. They need a method in Python to filter out such characters during the registration process from the strings.
def remove_non_ascii(text): return ''.join(char for char in text if char.isascii()) text = "Chloë, Texas" clean_text = remove_non_ascii(text) print(clean_text)
Output: The isascii() Python string method identifies the “ë” in the string as a non-ASCII character and removes it, resulting in “Chlo“. “Texas” remains unchanged.
The string isascii() method in Python remove Non ASCII characters from string.
Method 6: Remove all non ascii characters Python using filter() function
The built-in filter() function takes in a Python function and an iterable as arguments, returning only items for which the Python function evaluates as True. Here, str.isascii() is used as the filter to return only ASCII characters.
Example: A car dealership in Arizona is importing a large CSV of customer details saved as strings from their European branch through Python. Names like “Renée” have non-ASCII characters. To keep their CRM system error-free, they decided to filter out these characters with the help of different methods in Python.
def remove_non_ascii(text): return ''.join(filter(str.isascii, text)) text = "Renée, Arizona" clean_text = remove_non_ascii(text) print(clean_text)
Output: Both occurrences of the character “é” in the Python string are non-ASCII. The filter() method with the isascii() string method, strips the non-ASCII characters out, leaving “Rene“. “Arizona” stays the same.
This way we can use the filter() function with the isascii() method for Python remove Non ASCII characters from string.
Method 7: Remove non-ascii characters from String Python using map() with lambda function
The map() function applies a given Python function to every item in an iterable. In this method, a lambda function checks each character of the string for ASCII compliance, replacing non-ASCII characters with an empty Python string.
Example: A historical museum in Colorado is archiving letters from early settlers. Some names, like “Långe,” contain non-ASCII characters. For uniformity in their digital archive system, they decided to store names using only ASCII characters as Python strings after removing the Non ASCII characters in Python.
def remove_non_ascii(text): return ''.join(map(lambda x: x if x.isascii() else '', text)) text = "Långe, Colorado" clean_text = remove_non_ascii(text) print(clean_text)
Output: The character “å” in string Python is identified as non-ASCII and is subsequently removed by map() with lambda function, resulting in “Lnge“. “Colorado” remains unaffected
This way we can use the map() function with the lambda function to remove Non ASCII characters from Python string.
Understanding seven different methods for how Python remove Non ASCII characters from string like using a For-Loop with ord() function, using Regular Expressions sub() method, using the encode() and decode() Methods, using List Comprehension, using Python’s String Methods (isascii() method), using the filter() Function with isascii() method, and using map() with lambda functions. I have explained each method in detail with some demonstrative examples.
This totally depends on the developer’s preference or the specific use case. Whether they choose any techniques, they can easily standardize their Python strings to contain only ASCII characters.
You may also like to read:
- Multiply in Python
- Remove first character from a string in Python
- string index out of range in Python
- Python Strings
- Python find index of element in list
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.