How to Create a Scatter Plot in Pandas

As a Python developer who has spent years wrangling data, I’ve found that nothing reveals the relationship between two variables faster than a scatter plot.

Whether I am analyzing housing prices in California or tracking tech stock trends on the NASDAQ, a scatter plot is my go-to tool for spotting outliers.

In this tutorial, I will show you exactly how to generate scatter plots directly from your Pandas DataFrames using various methods I use in my daily workflow.

Table of Contents

The Basics of Pandas Scatter Plots

Before we get into the advanced styling, let’s look at the simplest way to get a plot on your screen.

Pandas has a built-in .plot() method that wraps around Matplotlib, making it incredibly convenient for quick data exploration.

For our examples, let’s use a dataset representing different cities in the USA, looking at their population density and average rent prices.

import pandas as pd
import matplotlib.pyplot as plt

# Sample data: US City Metrics
data = {
    'City': ['New York', 'San Francisco', 'Austin', 'Seattle', 'Miami', 'Chicago', 'Denver', 'Boston'],
    'Population_Density': [27012, 18633, 3006, 8791, 12239, 11847, 4674, 13976],
    'Avg_Rent': [3800, 3500, 1800, 2200, 2500, 2100, 1900, 3100]
}

df = pd.DataFrame(data)

# Method 1: The standard kind='scatter' approach
df.plot(kind='scatter', x='Population_Density', y='Avg_Rent', color='blue', title='US City Density vs Rent')

plt.show()

In my experience, using kind=’scatter’ is the most readable way to write your code for long-term maintenance.

Method 1: Use the DataFrame.plot.scatter() Wrapper

While kind=’scatter’ works great, Pandas also provides a direct accessor called .plot.scatter().

I personally prefer this syntax because it feels more “Pythonic” and allows for better IDE autocompletion.

In this example, let’s look at a US-specific scenario: The relationship between years of experience and annual salary in the American tech industry.

import pandas as pd
import matplotlib.pyplot as plt

# Tech Salary Data in the USA
salary_data = {
    'Years_Experience': [1, 2, 3, 5, 8, 10, 12, 15, 20],
    'Annual_Salary_USD': [75000, 82000, 95000, 120000, 155000, 175000, 210000, 240000, 300000]
}

df_salary = pd.DataFrame(salary_data)

# Creating the plot using the direct scatter method
df_salary.plot.scatter(x='Years_Experience', 
                       y='Annual_Salary_USD', 
                       grid=True, 
                       figsize=(10, 6),
                       title='Tech Salary Growth in the USA')

plt.ylabel('Salary (USD)')
plt.xlabel('Years of Experience')
plt.show()

I executed the above example code and added the screenshot below.

Adding grid=True is a small tip I always recommend. It makes it much easier for your stakeholders to pinpoint specific data points on the Y-axis.

Method 2: Add Color Maps (c) and Point Size (s)

When I am dealing with multi-dimensional data, a simple 2D plot isn’t enough. I often use color and size to represent a third and fourth variable.

Suppose we are looking at US gas stations. We want to see the relationship between the number of pumps (X) and daily customers (Y), and we also want to color-code them by gas price.

# US Gas Station Performance
station_data = {
    'Pumps': [4, 8, 12, 16, 6, 10, 14, 20],
    'Daily_Customers': [200, 450, 700, 1100, 310, 600, 850, 1500],
    'Gas_Price': [3.10, 3.45, 3.80, 4.10, 3.20, 3.60, 3.95, 4.50],
    'Store_Size_SQFT': [500, 1500, 2500, 4000, 800, 2000, 3000, 5000]
}

df_stations = pd.DataFrame(station_data)

# Using 'c' for color and 's' for size
df_stations.plot.scatter(x='Pumps', 
                         y='Daily_Customers', 
                         c='Gas_Price', 
                         s=df_stations['Store_Size_SQFT'] * 0.1, 
                         colormap='viridis', 
                         alpha=0.7,
                         title='US Gas Station Analytics')

plt.show()

I executed the above example code and added the screenshot below.

In this code, the s parameter (size) is multiplied by 0.1. I do this because the raw square footage values are too large for the plot markers.

Method 3: Group and Plotting Multiple Series

Often, I need to compare two groups in the same chart, such as “New York” vs. “Texas”.

The trick here is to create an ax (axis) object and pass it into the subsequent plot calls. This overlays the data.

# Comparing Two States: High School Scores
ny_scores = {'Math': [88, 92, 95, 78], 'Science': [90, 85, 99, 82]}
tx_scores = {'Math': [75, 80, 85, 90], 'Science': [70, 78, 88, 92]}

df_ny = pd.DataFrame(ny_scores)
df_tx = pd.DataFrame(tx_scores)

# Overlaying two scatter plots
ax = df_ny.plot.scatter(x='Math', y='Science', color='Blue', label='New York')
df_tx.plot.scatter(x='Math', y='Science', color='Red', label='Texas', ax=ax)

plt.title('Comparison of Student Scores by State')
plt.show()

I executed the above example code and added the screenshot below.

I find this method invaluable when I am presenting competitive analysis or A/B testing results to my team.

Handle Overlapping Data with Alpha and Jitter

One common headache I face is “overplotting,” where many dots land on the exact same coordinate. This hides the true density of the data.

To solve this, I use the alpha parameter to make dots semi-transparent.

If I have 1,000 retail transactions at a US Walmart location that all share the same price point, alpha=0.5 will show darker clusters where more shoppers are buying.

# Large dataset simulation
import numpy as np

# Simulating 500 shoppers in a US store
np.random.seed(42)
df_shoppers = pd.DataFrame({
    'Items_Bought': np.random.randint(1, 15, 500),
    'Total_Spent': np.random.randint(10, 200, 500)
})

df_shoppers.plot.scatter(x='Items_Bought', y='Total_Spent', alpha=0.3, color='forestgreen')
plt.title('US Retail Shopping Patterns (Alpha Transparency)')
plt.show()

Customize Your Scatter Plot for Professional Reports

If you are publishing your findings on a blog or a company report, the default Matplotlib styles can look a bit dated.

I always take a few extra seconds to clean up the labels and add a professional color palette.

Here is a full example that creates a “Publication-Ready” scatter plot using US Census-style data on Education vs. Income.

import pandas as pd
import matplotlib.pyplot as plt

# US Education and Income Data
data = {
    'Education_Years': [12, 14, 16, 18, 20, 12, 16, 18, 14, 16],
    'Annual_Income_K': [45, 55, 85, 120, 150, 42, 90, 115, 60, 88],
    'State_Tax_Rate': [0.05, 0.06, 0.04, 0.07, 0.05, 0.03, 0.08, 0.06, 0.04, 0.05]
}

df_edu = pd.DataFrame(data)

# Advanced customization
ax = df_edu.plot.scatter(
    x='Education_Years', 
    y='Annual_Income_K', 
    c='State_Tax_Rate', 
    colormap='RdYlGn', 
    s=100, 
    edgecolor='black', 
    linewidth=1,
    sharex=False # Keeps the X-axis label visible when using colorbars
)

ax.set_title('Impact of Education on Income in the USA', fontsize=14, pad=20)
ax.set_xlabel('Years of Education', fontsize=12)
ax.set_ylabel('Annual Income (in $1,000s)', fontsize=12)

plt.tight_layout()
plt.show()

Using edgecolor=’black’ makes the points pop, especially when you have a lot of different colors in the background.

Why use Pandas over Matplotlib for Scatter Plots?

You might wonder why I don’t just use plt.scatter() every time.

The reason is speed. When your data is already in a DataFrame, Pandas automatically handles the labels and legends for you.

If I use Matplotlib directly, I have to manually extract each column and write extra lines of code for the legend. Pandas does the “heavy lifting” so I can focus on the analysis.

Common Errors to Avoid

Throughout my years of coding, I’ve seen beginners trip up on the same few things. Here is how to avoid them:

Missing ‘x’ or ‘y’: Unlike some other plots, a scatter plot requires both X and Y columns. If you miss one, Pandas will throw a ValueError.
Non-Numeric Data: You cannot plot strings in a scatter plot. I always ensure my columns are float or int before plotting. Use df[‘col’] = pd.to_numeric(df[‘col’]) if you get an error.
The Colorbar Fix: Sometimes when you add a colorbar, the X-axis label disappears. I always add sharex=False inside the plot function to fix this bug.

Summary of Scatter Plot Methods

Method	Best For	Difficulty
`df.plot(kind='scatter')`	Quick data checks	Easy
`df.plot.scatter()`	Standard development	Easy
`ax` parameter overlay	Comparing two datasets	Intermediate
`c` and `s` arguments	Multidimensional analysis	Intermediate

I hope you found this tutorial helpful! Creating scatter plots in Pandas is an essential skill for any data scientist working with Python.

Whether you are looking at US economic trends or personal health data, these techniques will help you visualize your data clearly and effectively.

You may read:

Bijay Kumar

Bijay Kumar is an experienced Python and AI professional who enjoys helping developers learn modern technologies through practical tutorials and examples. His expertise includes Python development, Machine Learning, Artificial Intelligence, automation, and data analysis using libraries like Pandas, NumPy, TensorFlow, Matplotlib, SciPy, and Scikit-Learn. At PythonGuides.com, he shares in-depth guides designed for both beginners and experienced developers. More about us.

enjoysharepoint.com/

How to Create a Scatter Plot in Pandas

The Basics of Pandas Scatter Plots

Method 1: Use the DataFrame.plot.scatter() Wrapper

Method 2: Add Color Maps (c) and Point Size (s)

Method 3: Group and Plotting Multiple Series

Handle Overlapping Data with Alpha and Jitter

Customize Your Scatter Plot for Professional Reports

Why use Pandas over Matplotlib for Scatter Plots?

Common Errors to Avoid

Summary of Scatter Plot Methods

51 PYTHON PROGRAMS PDF FREE

Aspiring to be a Python developer?

Let’s be friends