This Python tutorial is all about “Remove non-ASCII characters Python“. We will see, how to remove non-ASCII characters in Python with various examples.
Also, we will cover these topics.
- Remove Non ASCII Characters Python Pandas
- Remove Non ASCII Characters Python
- Remove Non ASCII Characters Python Regex
- Remove Non ASCII Characters From CSV Python
- Remove Non ASCII Characters From File Python
- Strip Out Non ASCII Characters Python
- Pyspark Replace Non ASCII Characters Python
- Remove Non-ASCII Characters From Text Python
- Python Remove Non-ASCII Characters From Bytes
ASCII stands for American Standard Code For Information Interchange. All the keyword on the US keyboard has some ASCII code. Non-ASCII codes can be seen mostly in Regional languages of different countries.
For Example Chinese, Japanese, Hindi, etc come under non-ASCII characters. In this tutorial, we will learn how to remove non-ASCII characters in python.
You might be wondering how non-ASCII characters look like. So here is the sample of non-ASCII characters.
¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇËÎÐÖÑרψϑωϖℵℜ←↑→↓↔↵⇐⇑⇒⇓⇔∀
Remove Non ASCII Characters Python Pandas
- In this section, we will learn how to remove non-ASCII characters in Python pandas.
- By using encode and decode function we can easily remove non-ASCII characters from Pandas DataFrame. In Python, the encode() function is used to encode the string using a given encoding, and decoding means converting a string of bytes to a Unicode string.
Source Code:
import pandas as pd
df = pd.Series(['m©ª«zy', '¤¥uw', 'ÆÇval 672'])
d= df.str.encode('ascii', 'ignore').str.decode('ascii')
print("After removing non-ascii:")
print(d)
In the above code first, we have imported the Pandas library and then create a dataframe in which we have assigned some characters and non-ASCII characters.
Now apply encode() function and it will help the user to encode the string into ‘ASCII’ and also pass the error as ‘ignore’ to remove Non-ASCII characters.
Here is the execution of the following given code
Read How to Convert Python DataFrame to JSON
Remove Non ASCII Characters Python
- In this Program, we will discuss how to remove non-ASCII characters in Python 3.
- Here we can apply the method str.encode() to remove Non-ASCII characters from string. To perform this task first create a simple string and assign multiple characters in it like Non-ASCII characters. Now first we will apply the encode() method to encode the string into ASCII and then use the decode() method which will help the user to convert the byte string into a new string.
Example:
new_str = "¡¢£ Py¼½¾thon is a be¹ºst prog®¯°ramming language±²³."
print("Original string:",new_str)
new_val = new_str.encode("ascii", "ignore")
updated_str = new_val.decode()
print("After romoving non-ascii:")
print(updated_str)
You can refer to the below Screenshot
Read How to convert floats to integer in Pandas
Remove Non ASCII Characters Python Regex
- Let us see how to remove non-ASCII characters in Python Regex.
- In this Program, we will see how we can use the regular expression for removing the non-ASCII character from the string. In Python, the regular expression can be used to search for a pattern in a string. In Python, the ‘re’ module provides the support to use regex in Program.
Source Code:
import re
String_value='JoÂÃÄÅhn iרψs a goωϖℵod b¡¢oy'
print("Original string:",String_value)
new_result = re.sub(r'[^\x00-\x7f]', "", String_value)
print("After removing ASC-II charcater from string : ")
print(new_result)
In the above code first, we will import the re module and then create a string in the variable named ‘String_value’.
Now we are going to use it re.sub() function for removing the non-ASCII characters from the string and storing the result in the output variable ‘new_result’.
Once you will print the ‘new_result’ then the output will display the updated string.
Here is the Output of the following given code
Read Python convert binary to decimal
Remove Non ASCII Characters From CSV Python
- In this section, we will learn how to remove non-ASCII characters from CSV files in Python.
- Here we can see how to remove non-ASCII characters in the CSV file. To do this task we will apply the Pandas method and use encode() method in the dataframe.
Source Code:
import pandas as pd
data = pd.read_csv('test1.csv', encoding= 'unicode_escape')
new_val = data.encode("ascii", "ignore")
updated_str = new_val.decode()
print("After romoving non-ascii:")
print(updated_str)
As you can see in the Screenshot the output as the specific Non-ASCII character has not been removed from the CSV file because the dataframe has no attribute and it will not update in the CSV file mode.
Check out, How to Get first N rows of Pandas DataFrame in Python
Strip Out Non ASCII Characters Python
- Here we can see how to strip out ASCII characters in Python.
- In this example, we will use the.sub() method in which we have assigned a standard code ‘[^\x00-\x7f]’ and this code represents the values between 0-127 ASCII code and this method contains the input string ‘new_str’. Once you will print the ‘new_result’ then the Output will display the new string and do not contain any Non-ASCII characters in it.
Source Code:
import re
new_str='Australia©ª«Germany'
new_result = re.sub(r'[^\x00-\x7f]', "", new_str)
print(new_result)
Here is the Output of the following given code
Read Python Count Words in File
How to strip out Non-ASCII characters in Python
In this Program, we will apply the combination of ord() and for loop method for removing Non-ASCII characters from a string.
In Python, the ord() method accepts only a single character and this method will help the user to check whether a string contains a single Unicode character.
Example:
new_val = "Mi©ª«chal is a³´µ¶·good b½¾¿oy"
new_res = ''.join([m if ord(m) < 128 else ' ' for m in new_val])
print("After strip ascii characters: ",new_res)
In the above code first, we have created a string ‘new_val’ and assigned them non-ASCII characters.
Now we will use the join() function within the ord() method. As you can see in the below screenshot the Output as the Non-ASCII characters are removed from the new string.
You can refer to the below Screenshot
Read Pandas replace nan with 0
Pyspark Replace Non ASCII Characters Python
- In this section, we will learn how to replace non-ASCII characters in pyspark python.
- In Python if you want to run the application using Apache Spark then we will apply the Pyspark library for e.g if you are working on a Python application where you are dealing with big datasets then Pyspark is the best way to solve this problem and it is faster than Pandas library.
- In this example, we have imported the Pyspark module in which first we use pyspark.row library that represents a row of data in a DataFrame. Now we are going to use spark context and spark session for creating a dataframe.
Source Code:
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
df = spark.createDataFrame([
(0, "Pyt¾¿hon Deve¾¿loper"),
(1, "Jav±²³a Deve±²³loper"),
(2, "S£¤¥ql Datab£¤¥ase"),
(3, "Mongodb database")
], ["id", "words"])
from pyspark.sql.functions import udf
def ascii_ignore(x):
return x.encode('ascii', 'ignore').decode('ascii')
ascii_udf = udf(ascii_ignore)
df.withColumn("foo", ascii_udf('words')).show()
In the above code first, we have created a Dataframe object by using spark.createDataFrame and assigning non-ASCII characters in it.
Now use UDF(user-defined function) and it is used to create a reusable method in Spark.
Here is the execution of the following given code
Read How to Convert Pandas DataFrame to NumPy Array in Python
Remove Non-ASCII Characters From Text Python
- In this section, we will learn how to remove non-ASCII characters from a text in Python.
- Here we can use the replace() method for removing the non-ASCII characters from the string. In Python the str.replace() is an inbuilt function and this method will help the user to replace old characters with a new or empty string.
Source Code:
new_ele = "England Germanyℜ←↑→China"
new_result = new_ele.replace('ℜ←↑→', '')
print("Removing ascii characters from text : ",new_result)
In the above code first, we have created a string ‘new_ele’ and then use the str.replace() method to replace specific non-ASCII characters with the empty space.
Once you will print the ‘new_result’ then the output will display the new string with all the removed Non-ASCII characters.
Here is the implementation of the following given code
Read How to Find Duplicates in Python DataFrame
Python Remove Non-ASCII Characters From Bytes
- In this section, we will learn how to remove non-ASCII characters from bytes in Python.
- Let us see how to use byte code in encode() fuction for removing Non-ASCII characters from string.
Source Code:
import re
new_str='Oliva→↓↔↵⇐Elijah'
re.sub(r'[^\x00-\x7f]',r' ',new_str)
m = new_str.encode('utf8')
z=re.sub(rb'[^\x00-\x7f]',rb' ',m)
print(z)
Here is the Screenshot of the following given code
You may like the following Python tutorials:
- Multiply in Python with Examples
- Remove first character from a string in Python
- string index out of range in Python
- Python find index of element in list
- Download zip file from URL using python
- Python invalid literal for int() with base 10
- Remove Unicode characters in python
- Comment lines in Python
In this tutorial, we have learned how to remove non-ASCII characters in python. Also, we have covered these topics.
- Remove Non ASCII Characters Python Pandas
- Remove Non ASCII Characters Python
- Remove Non ASCII Characters Python Regex
- Remove Non ASCII Characters From CSV Python
- Remove Non ASCII Characters From File Python
- Strip Out Non ASCII Characters Python
- Pyspark Replace Non ASCII Characters Python
- Remove Non-ASCII Characters From Text Python
- Python Remove Non-ASCII Characters From Bytes
I am Bijay Kumar, a Microsoft MVP in SharePoint. Apart from SharePoint, I started working on Python, Machine learning, and artificial intelligence for the last 5 years. During this time I got expertise in various Python libraries also like Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… for various clients in the United States, Canada, the United Kingdom, Australia, New Zealand, etc. Check out my profile.