Remove non-ASCII characters Python

This Python tutorial is all about “Remove non-ASCII characters Python“. We will see, how to remove non-ASCII characters in Python with various examples.

Also, we will cover these topics.

  • Remove Non ASCII Characters Python Pandas
  • Remove Non ASCII Characters Python
  • Remove Non ASCII Characters Python Regex
  • Remove Non ASCII Characters From CSV Python
  • Remove Non ASCII Characters From File Python
  • Strip Out Non ASCII Characters Python
  • Pyspark Replace Non ASCII Characters Python
  • Remove Non-ASCII Characters From Text Python
  • Python Remove Non-ASCII Characters From Bytes

ASCII stands for American Standard Code For Information Interchange. All the keyword on the US keyboard has some ASCII code. Non-ASCII codes can be seen mostly in Regional languages of different countries.

For Example Chinese, Japanese, Hindi, etc come under non-ASCII characters. In this tutorial, we will learn how to remove non-ASCII characters in python.

You might be wondering how non-ASCII characters look like. So here is the sample of non-ASCII characters.

¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇËÎÐÖÑרψϑωϖℵℜ←↑→↓↔↵⇐⇑⇒⇓⇔∀

Remove Non ASCII Characters Python Pandas

  • In this section, we will learn how to remove non-ASCII characters in Python pandas.
  • By using encode and decode function we can easily remove non-ASCII characters from Pandas DataFrame. In Python, the encode() function is used to encode the string using a given encoding, and decoding means converting a string of bytes to a Unicode string.

Source Code:

import pandas as pd

df = pd.Series(['m©ª«zy', '¤¥uw', 'ÆÇval 672'])
d= df.str.encode('ascii', 'ignore').str.decode('ascii')
print("After removing non-ascii:")
print(d)

In the above code first, we have imported the Pandas library and then create a dataframe in which we have assigned some characters and non-ASCII characters.

Now apply encode() function and it will help the user to encode the string into ‘ASCII’ and also pass the error as ‘ignore’ to remove Non-ASCII characters.

Here is the execution of the following given code

Remove Non ASCII Characters Python Pandas
Remove Non-ASCII Characters Python Pandas

Read How to Convert Python DataFrame to JSON

Remove Non ASCII Characters Python

  • In this Program, we will discuss how to remove non-ASCII characters in Python 3.
  • Here we can apply the method str.encode() to remove Non-ASCII characters from string. To perform this task first create a simple string and assign multiple characters in it like Non-ASCII characters. Now first we will apply the encode() method to encode the string into ASCII and then use the decode() method which will help the user to convert the byte string into a new string.

Example:

new_str = "¡¢£ Py¼½¾thon is a be¹ºst prog®¯°ramming language±²³."

print("Original string:",new_str)
new_val = new_str.encode("ascii", "ignore")
updated_str = new_val.decode()

print("After romoving non-ascii:")
print(updated_str)

You can refer to the below Screenshot

Remove Non ASCII Characters Python
Remove Non-ASCII Characters Python

Read How to convert floats to integer in Pandas

Remove Non ASCII Characters Python Regex

  • Let us see how to remove non-ASCII characters in Python Regex.
  • In this Program, we will see how we can use the regular expression for removing the non-ASCII character from the string. In Python, the regular expression can be used to search for a pattern in a string. In Python, the ‘re’ module provides the support to use regex in Program.

Source Code:

import re

String_value='JoÂÃÄÅhn iרψs a goωϖℵod b¡¢oy'
print("Original string:",String_value)
new_result = re.sub(r'[^\x00-\x7f]', "", String_value)

print("After removing ASC-II charcater from string : ")
print(new_result)

In the above code first, we will import the re module and then create a string in the variable named ‘String_value’.

Now we are going to use it re.sub() function for removing the non-ASCII characters from the string and storing the result in the output variable ‘new_result’.

Once you will print the ‘new_result’ then the output will display the updated string.

Here is the Output of the following given code

Remove Non ASCII Characters Python Regex
Remove Non-ASCII Characters Python Regex

Read Python convert binary to decimal

Remove Non ASCII Characters From CSV Python

  • In this section, we will learn how to remove non-ASCII characters from CSV files in Python.
  • Here we can see how to remove non-ASCII characters in the CSV file. To do this task we will apply the Pandas method and use encode() method in the dataframe.

Source Code:

import pandas as pd
data = pd.read_csv('test1.csv', encoding= 'unicode_escape')

new_val = data.encode("ascii", "ignore")
updated_str = new_val.decode()

print("After romoving non-ascii:")
print(updated_str)     
Remove Non ASCII Characters From CSV Python
Remove Non-ASCII Characters From CSV Python

As you can see in the Screenshot the output as the specific Non-ASCII character has not been removed from the CSV file because the dataframe has no attribute and it will not update in the CSV file mode.

Check out, How to Get first N rows of Pandas DataFrame in Python

Strip Out Non ASCII Characters Python

  • Here we can see how to strip out ASCII characters in Python.
  • In this example, we will use the.sub() method in which we have assigned a standard code ‘[^\x00-\x7f]’ and this code represents the values between 0-127 ASCII code and this method contains the input string ‘new_str’. Once you will print the ‘new_result’ then the Output will display the new string and do not contain any Non-ASCII characters in it.

Source Code:

import re
new_str='Australia©ª«Germany'

new_result = re.sub(r'[^\x00-\x7f]', "", new_str)
print(new_result)

Here is the Output of the following given code

Strip Out Non ASCII Characters Python
Strip Out Non-ASCII Characters Python

Read Python Count Words in File

How to strip out Non-ASCII characters in Python

In this Program, we will apply the combination of ord() and for loop method for removing Non-ASCII characters from a string.

In Python, the ord() method accepts only a single character and this method will help the user to check whether a string contains a single Unicode character.

Example:

new_val = "Mi©ª«chal is a³´µ¶·good b½¾¿oy"
 
new_res = ''.join([m if ord(m) < 128 else ' ' for m in new_val])

print("After strip ascii characters: ",new_res)

In the above code first, we have created a string ‘new_val’ and assigned them non-ASCII characters.

Now we will use the join() function within the ord() method. As you can see in the below screenshot the Output as the Non-ASCII characters are removed from the new string.

You can refer to the below Screenshot

Strip Out Non ASCII Characters Python
Strip Out Non-ASCII Characters Python

Read Pandas replace nan with 0

Pyspark Replace Non ASCII Characters Python

  • In this section, we will learn how to replace non-ASCII characters in pyspark python.
  • In Python if you want to run the application using Apache Spark then we will apply the Pyspark library for e.g if you are working on a Python application where you are dealing with big datasets then Pyspark is the best way to solve this problem and it is faster than Pandas library.
  • In this example, we have imported the Pyspark module in which first we use pyspark.row library that represents a row of data in a DataFrame. Now we are going to use spark context and spark session for creating a dataframe.

Source Code:

from pyspark.sql import Row

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
df = spark.createDataFrame([

    (0, "Pyt¾¿hon Deve¾¿loper"),

    (1, "Jav±²³a Deve±²³loper"),

    (2, "S£¤¥ql Datab£¤¥ase"),

    (3, "Mongodb database")

], ["id", "words"])

from pyspark.sql.functions import udf



def ascii_ignore(x):

    return x.encode('ascii', 'ignore').decode('ascii')



ascii_udf = udf(ascii_ignore)



df.withColumn("foo", ascii_udf('words')).show()

In the above code first, we have created a Dataframe object by using spark.createDataFrame and assigning non-ASCII characters in it.

Now use UDF(user-defined function) and it is used to create a reusable method in Spark.

Here is the execution of the following given code

Pyspark Replace Non ASCII Characters Python
Pyspark Replace Non-ASCII Characters Python

Read How to Convert Pandas DataFrame to NumPy Array in Python

Remove Non-ASCII Characters From Text Python

  • In this section, we will learn how to remove non-ASCII characters from a text in Python.
  • Here we can use the replace() method for removing the non-ASCII characters from the string. In Python the str.replace() is an inbuilt function and this method will help the user to replace old characters with a new or empty string.

Source Code:

new_ele = "England Germanyℜ←↑→China"
 
new_result = new_ele.replace('ℜ←↑→', '')
print("Removing ascii characters from text : ",new_result)

In the above code first, we have created a string ‘new_ele’ and then use the str.replace() method to replace specific non-ASCII characters with the empty space.

Once you will print the ‘new_result’ then the output will display the new string with all the removed Non-ASCII characters.

Here is the implementation of the following given code

Remove Non ASCII Characters from Text Python
Remove Non-ASCII Characters from Text Python

Read How to Find Duplicates in Python DataFrame

Python Remove Non-ASCII Characters From Bytes

  • In this section, we will learn how to remove non-ASCII characters from bytes in Python.
  • Let us see how to use byte code in encode() fuction for removing Non-ASCII characters from string.

Source Code:

import re
new_str='Oliva→↓↔↵⇐Elijah'

re.sub(r'[^\x00-\x7f]',r' ',new_str)  

m = new_str.encode('utf8')
z=re.sub(rb'[^\x00-\x7f]',rb' ',m) 
print(z)

Here is the Screenshot of the following given code

Python Remove Non ASCII Characters From Bytes
Python Remove Non-ASCII Characters From Bytes

You may like the following Python tutorials:

In this tutorial, we have learned how to remove non-ASCII characters in python. Also, we have covered these topics.

  • Remove Non ASCII Characters Python Pandas
  • Remove Non ASCII Characters Python
  • Remove Non ASCII Characters Python Regex
  • Remove Non ASCII Characters From CSV Python
  • Remove Non ASCII Characters From File Python
  • Strip Out Non ASCII Characters Python
  • Pyspark Replace Non ASCII Characters Python
  • Remove Non-ASCII Characters From Text Python
  • Python Remove Non-ASCII Characters From Bytes