How To Read Large CSV Files In Python?

I faced a challenge to read large CSV files when working on a project that involved analyzing millions of rows of sales data from various states across the USA. In this tutorial, I will explain how to read large CSV files in Python. I will share the techniques and tools that helped me overcome these challenges.

This Tutorial Covers:

Read Large CSV Files in Python

Reading large CSV files can be problematic due to memory constraints and processing time. For instance, loading a CSV file with millions of rows and hundreds of columns into memory can cause your system to slow down or even crash. Additionally, the time required to process such large datasets can be significant.

Read How to Create a Python File in Terminal?

1. Use the CSV Module

The built-in csv module in Python is a simple way to read CSV files. However, it may not be the most efficient for very large files.

Let us consider we have the given data in a CSV file.

Date,Sales,Location
2024-01-01,500,New York
2024-01-02,750,Chicago

import csv

filename = 'large_sales_data.csv'

with open(filename, mode='r') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        print(row)

I executed the above example code and added the screenshot below.

While this method works for smaller files, it’s not ideal for large datasets due to its memory consumption.

Check out How to Replace a Specific Line in a File Using Python?

2. Use pandas for Large CSV Files

pandas is a powerful library for data manipulation and analysis. It provides the read_csv function, which is more efficient than the csv module. Here’s how you can use it:

import pandas as pd

filename = 'large_sales_data.csv'
data = pd.read_csv(filename)
print(data.head())

I executed the above example code and added the screenshot below.

pandas handles large files better, but you might still run into memory issues with extremely large datasets.

Read How to Call a Function from Another File in Python?

3. Optimize Memory Usage with dask

dask is a parallel computing library that integrates seamlessly with pandas. It allows you to process large datasets that don’t fit into memory by breaking them into smaller chunks. Here’s an example:

import dask.dataframe as dd

filename = 'large_sales_data.csv'
data = dd.read_csv(filename)
print(data.head())

I executed the above example code and added the screenshot below.

dask reads the CSV file in chunks, enabling you to work with datasets larger than your system’s memory.

Read How to Read an Excel File in Python?

4. Read CSV Files in Chunks

Another approach to handle large CSV files is to read them in chunks using pandas. This method allows you to process the file in smaller, more manageable pieces. Here’s an example:

import pandas as pd

filename = 'large_sales_data.csv'
chunksize = 100000  # Number of rows per chunk

for chunk in pd.read_csv(filename, chunksize=chunksize):
    # Process each chunk
    print(chunk.head())

By processing the file in chunks, you can significantly reduce memory usage and avoid crashes.

Check out How to Import a Python File from the Same Directory?

5. Parallel Processing with multiprocessing

If you need to speed up the processing of large CSV files, you can use the multiprocessing module to read and process the file in parallel. Here’s an example:

import pandas as pd
from multiprocessing import Pool

def process_chunk(chunk):
    # Process each chunk
    print(chunk.head())

filename = 'large_sales_data.csv'
chunksize = 100000

# Create a list of chunks
chunks = pd.read_csv(filename, chunksize=chunksize)

# Create a pool of workers
with Pool() as pool:
    pool.map(process_chunk, chunks)

This method leverages multiple CPU cores to process the file faster.

Check out How to Get File Size in Python?

Conclusion

In this tutorial, I helped you to learn how to read large CSV files in Python. I explained several methods to achieve this task such as using a CSV module, using pandas for large CSV files, optimizing memory usage with dask, reading CSV files in chunks, and parallel processing with multiprocessing.

How to Read Large CSV Files in Python?

Read Large CSV Files in Python

1. Use the CSV Module

2. Use pandas for Large CSV Files

3. Optimize Memory Usage with dask

4. Read CSV Files in Chunks

5. Parallel Processing with multiprocessing

Conclusion

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends