Convert a DataFrame to Matrix in Python (4 Methods)

While I was working on a machine learning project, I needed to feed my Pandas DataFrame into a model that required a matrix format. This is a common scenario in data science: you’ve done all your data wrangling in Pandas, but now you need to convert that structured DataFrame into a matrix for mathematical operations.

Python offers various easy ways to convert a DataFrame to a matrix. In this tutorial, I will walk you through the most effective methods I’ve used in my decade of Python development.

Let’s get into these methods with practical examples using real-world data!

Matrix vs. DataFrame

Before we start converting, let me clarify what we’re talking about:

A DataFrame is a 2D labeled data structure with columns that can be of different types, think of it as a spreadsheet or SQL table.

A Matrix is a rectangular array of numbers arranged in rows and columns, typically used for mathematical operations.

The key difference? DataFrames have labels and can hold mixed data types, while matrices are typically numeric and without labels.

Read Use Pandas to Convert Float to Int in Python

1. Use the to_numpy() Method

Python to_numpy() method is my go-to approach for converting a DataFrame to a matrix. It’s clean, efficient, and part of the Pandas API.

Let’s see it in action with a simple stock portfolio example:

import pandas as pd

# Sample stock data
df = pd.DataFrame({
    'Stock': ['AAPL', 'MSFT', 'AMZN', 'GOOGL', 'TSLA'],
    'Price': [185.92, 420.55, 178.35, 173.52, 237.91],
    'Shares': [10, 5, 8, 12, 7]
})

# Convert only numeric columns to NumPy array
numeric_array = df[['Price', 'Shares']].to_numpy()

print(numeric_array)

Output:

[[185.92  10.  ]
 [420.55   5.  ]
 [178.35   8.  ]
 [173.52  12.  ]
 [237.91   7.  ]]

You can refer to the screenshot below to see the output.

dataframe to matrix

The to_numpy() method returns a NumPy ndarray, which is essentially a matrix. Notice how I used iloc[:, 1:] to select only the numeric columns for the second conversion.

This method is recommended by the pandas documentation as the preferred way to obtain a NumPy array from a DataFrame.

2. Use the values Attribute

The values attribute is another common way to convert a DataFrame to a matrix. While slightly older than to_numpy(), it’s still widely used:

# Using the values attribute
matrix_values = df.values
print("\nMatrix using .values attribute:")
print(matrix_values)

# Only numeric columns
numeric_matrix_values = df.iloc[:, 1:].values
print("\nNumeric Matrix using .values:")
print(numeric_matrix_values)

Output:

Matrix using .values attribute:
[['AAPL' 185.92 10]
 ['MSFT' 420.55 5]
 ['AMZN' 178.35 8]
 ['GOOGL' 173.52 12]
 ['TSLA' 237.91 7]]

Numeric Matrix using .values:
[[185.92  10.  ]
 [420.55   5.  ]
 [178.35   8.  ]
 [173.52  12.  ]
 [237.91   7.  ]]

You can refer to the screenshot below to see the output.

dataframe to matrix python

While values and to_numpy() often produce the same result, there’s an important difference: to_numpy() gives you more control over the output with parameters like dtype and handles extension in pandas arrays better.

3. Create a Matrix with NumPy Functions

Sometimes you need more control over how the matrix is constructed. Python NumPy functions can help with this:

# Using NumPy's array function
np_matrix = np.array(df.iloc[:, 1:])
print("\nMatrix using NumPy array function:")
print(np_matrix)

# Creating a matrix of specific type
float_matrix = np.matrix(df.iloc[:, 1:].values, dtype=float)
print("\nFloat Matrix:")
print(float_matrix)

Output:

Matrix using NumPy array function:
[[185.92  10.  ]
 [420.55   5.  ]
 [178.35   8.  ]
 [173.52  12.  ]
 [237.91   7.  ]]

Float Matrix:
[[185.92  10.  ]
 [420.55   5.  ]
 [178.35   8.  ]
 [173.52  12.  ]
 [237.91   7.  ]]

You can refer to the screenshot below to see the output.

convert dataframe to matrix python

The NumPy approach gives you flexibility in specifying the data type and structure of your matrix. This is particularly useful when you’re working with algorithms that expect specific matrix formats.

When I was building a recommendation system for an e-commerce site, I needed to convert user-item interaction data into a specific matrix format for collaborative filtering. The NumPy approach gave me the precise control I needed.

Check out Print the First 10 Rows from a Pandas DataFrame in Python

4. Convert to Sparse Matrix

For large datasets with many zero values (like in natural language processing or recommendation systems), a sparse matrix can be more memory efficient:

from scipy import sparse

# Sample large, sparse DataFrame (mostly zeros)
sparse_data = pd.DataFrame(np.zeros((1000, 1000)))
sparse_data.iloc[10:15, 10:15] = np.random.rand(5, 5)  # Add some non-zero values

# Convert to sparse matrix
sparse_matrix = sparse.csr_matrix(sparse_data.values)
print("\nSparse Matrix Info:")
print(f"Shape: {sparse_matrix.shape}")
print(f"Non-zero elements: {sparse_matrix.count_nonzero()}")
print(f"Memory usage: {sparse_matrix.data.nbytes + sparse_matrix.indptr.nbytes + sparse_matrix.indices.nbytes} bytes")

# Compare to dense matrix memory
dense_memory = sparse_data.values.nbytes
print(f"Dense matrix memory: {dense_memory} bytes")
print(f"Memory savings: {dense_memory - (sparse_matrix.data.nbytes + sparse_matrix.indptr.nbytes + sparse_matrix.indices.nbytes)} bytes")

I’ve used this approach when working with text data for a sentiment analysis project. The document-term matrix was extremely sparse (mostly zeros), and using a sparse matrix format saved gigabytes of memory.

Real-World Application: Machine Learning Preprocessing

One of the most common reasons to convert a DataFrame to a matrix is for machine learning. Here’s a real-world example using housing price data:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Sample housing data
housing_data = {
    'Area_sqft': [1200, 1500, 1800, 1100, 2200, 2500, 1900, 3000],
    'Bedrooms': [2, 3, 3, 2, 4, 4, 3, 5],
    'Age_years': [15, 10, 5, 20, 7, 3, 12, 8],
    'Distance_downtown_miles': [5.2, 4.7, 7.8, 2.3, 8.5, 9.2, 5.1, 12.7],
    'Price_USD': [250000, 320000, 380000, 210000, 450000, 520000, 390000, 650000]
}

df_housing = pd.DataFrame(housing_data)

# Split features and target
X_df = df_housing.drop('Price_USD', axis=1)
y_df = df_housing['Price_USD']

# Convert to matrices for scikit-learn
X_matrix = X_df.to_numpy()
y_matrix = y_df.to_numpy()

# Split data and train model
X_train, X_test, y_train, y_test = train_test_split(X_matrix, y_matrix, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print(f"Model coefficients: {model.coef_}")
print(f"Model intercept: {model.intercept_}")
print(f"R² score: {model.score(X_test, y_test)}")

In this example, I converted the feature DataFrame and target Series to NumPy arrays before feeding them into the scikit-learn model. This is a typical workflow in data science projects.

Read Pandas Replace Multiple Values in Python

Work with Mixed Data Types

Sometimes your DataFrame contains non-numeric data that you want to include in your matrix. Here’s how to handle that:

# Original DataFrame with mixed types
print("\nOriginal DataFrame with mixed types:")
print(df)

# Convert to matrix with object dtype (preserves strings)
mixed_matrix = df.to_numpy()
print("\nMixed-type matrix:")
print(mixed_matrix)
print(f"Data type: {mixed_matrix.dtype}")

# If you need a truly numeric matrix, you'll need to encode categorical data
from sklearn.preprocessing import LabelEncoder

# Create a copy to avoid modifying the original
df_encoded = df.copy()
encoder = LabelEncoder()
df_encoded['Stock'] = encoder.fit_transform(df_encoded['Stock'])

numeric_matrix = df_encoded.to_numpy()
print("\nEncoded matrix (all numeric):")
print(numeric_matrix)
print(f"Data type: {numeric_matrix.dtype}")

This approach is crucial when working with machine learning algorithms that require all-numeric input.

Performance Considerations

When working with large datasets, performance matters. Here’s a quick comparison of the methods:

import time

# Create a large DataFrame
large_df = pd.DataFrame(np.random.rand(10000, 100))

# Time to_numpy()
start = time.time()
_ = large_df.to_numpy()
numpy_time = time.time() - start

# Time values
start = time.time()
_ = large_df.values
values_time = time.time() - start

# Time np.array()
start = time.time()
_ = np.array(large_df)
nparray_time = time.time() - start

print(f"to_numpy() time: {numpy_time:.6f} seconds")
print(f"values time: {values_time:.6f} seconds")
print(f"np.array() time: {nparray_time:.6f} seconds")

In my experience, to_numpy() and values have very similar performance, while np.array() can be slightly slower as it makes a copy of the data.

I hope you found this article helpful for understanding how to convert DataFrames to matrices in Python. Each method has its place depending on your specific needs.

Whether you’re working on machine learning, data visualization, or numerical computations, the ability to efficiently move between DataFrames and matrices is an essential skill in the Python data science toolkit.

Other Python articles you may also like:

51 Python Programs

51 PYTHON PROGRAMS PDF FREE

Download a FREE PDF (112 Pages) Containing 51 Useful Python Programs.

pyython developer roadmap

Aspiring to be a Python developer?

Download a FREE PDF on how to become a Python developer.

Let’s be friends

Be the first to know about sales and special discounts.