ENH: Disable Numpy memory allocation while concat #59956

sandeyshc · 2024-10-04T06:25:07Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

We have sparse data with many null values, and while reading it using Pandas with PyArrow, it doesn't consume much memory because of pandas internal compression logic. However, during concatenation, NumPy allocates memory that isn't actually used, causing our Python script to fail due to memory allocation issues. Can you provide an option to disable NumPy memory allocation when concatenating DataFrames along axis=1?

Feature Description

pd.concat(df_list,axis=1,numpy_allocation=False)

Alternative Solutions

Atleast can you provide how can we change C++ script internally and use it for our purpose

Additional Context

Please let me know if i am wrong.

rhshadrach · 2024-10-05T13:39:01Z

Can you provide a reproducible example?

chaoyihu · 2024-10-09T01:58:47Z

Something like this? If my script is correct, I did not observe extra memory allocation in concatenation.

Reproducible script

import tracemalloc
import pandas as pd
import numpy as np

def create_sparse_data():
    ROW = 100_000
    COL = 10
    df = pd.DataFrame(np.nan, index=range(ROW), columns=range(COL))
    sparsity = 0.9
    rs = np.random.choice(np.arange(0, ROW),
                          size=int((1 - sparsity) * ROW * COL),
                          replace=False)
    for r in rs:
        df.loc[r, np.random.randint(COL)] = 0.1
    df.to_csv("data.csv", index=False)

def mem_profiler(func):
    def wrapper(*args, **kwargs):
        tracemalloc.start()
        result = func(*args, **kwargs)
        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()
        print(f"Current memory usage: {current / 1024:.6f} KiB")
        print(f"Peak memory usage: {peak / 1024:.6f} KiB")
        return result 
    return wrapper

@mem_profiler
def read_sparse_df():
    df = pd.read_csv("data.csv", engine="pyarrow") 
    return df

@mem_profiler
def concat_sparse_df(df_list):
    df = pd.concat(df_list, axis=1)
    return df

def main():
    create_sparse_data()

    print("read #################################")
    df_list = []
    n_to_concat = 5
    for i in range(n_to_concat):
        print(f"# {i}")
        df = read_sparse_df()
        df_list.append(df)

    print("concat #################################")
    print("Length of df_list:", len(df_list))
    repetition = 5
    for rep in range(repetition):
        df = concat_sparse_df(df_list)

if __name__ == "__main__":
    main()

Result:

# 0
Current memory usage: 320.005859 KiB
Peak memory usage: 2391.936523 KiB
# 1
Current memory usage: 3.383789 KiB
Peak memory usage: 2310.313477 KiB
# 2
Current memory usage: 3.383789 KiB
Peak memory usage: 2310.313477 KiB
# 3
Current memory usage: 3.330078 KiB
Peak memory usage: 2310.313477 KiB
# 4
Current memory usage: 3.383789 KiB
Peak memory usage: 2310.313477 KiB
concat #################################
Length of df_list: 5
Current memory usage: 29.715820 KiB
Peak memory usage: 30.512695 KiB
Current memory usage: 4.015625 KiB
Peak memory usage: 5.911133 KiB
Current memory usage: 3.195312 KiB
Peak memory usage: 5.911133 KiB
Current memory usage: 3.195312 KiB
Peak memory usage: 5.911133 KiB
Current memory usage: 3.195312 KiB
Peak memory usage: 5.911133 KiB

INSTALLED VERSIONS

commit : 7c0ee27
python : 3.10.14
python-bits : 64
OS : Linux
OS-release : 6.8.0-44-generic
Version : #44-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 13 13:35:26 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1237.g7c0ee27e6c
numpy : 1.26.4
dateutil : 2.9.0
pip : 24.0
Cython : 3.0.10
sphinx : 7.4.7
IPython : 8.26.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.0
fastparquet : 2024.5.0
fsspec : 2024.6.1
html5lib : 1.1
hypothesis : 6.108.3
gcsfs : 2024.6.1
jinja2 : 3.1.4
lxml.etree : 5.2.2
matplotlib : 3.9.1
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.4
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 17.0.0
pyreadstat : 1.2.7
pytest : 8.3.1
python-calamine : None
pytz : 2024.1
pyxlsb : 1.0.10
s3fs : 2024.6.1
scipy : 1.14.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.6.0
xlrd : 2.0.1
xlsxwriter : 3.1.9
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

sandeyshc · 2024-10-09T06:14:19Z

Hi,

Sorry for delayed response, Below are the points

I ran the same script which @chaoyihu gave and i see different memory usage numbers as below

read #################################
# 0
Current memory usage: 277.455078 KiB
Peak memory usage: 2404.784180 KiB
# 1
Current memory usage: 3.606445 KiB
Peak memory usage: 2313.016602 KiB
# 2
Current memory usage: 3.477539 KiB
Peak memory usage: 2312.965820 KiB
# 3
Current memory usage: 3.524414 KiB
Peak memory usage: 2312.950195 KiB
# 4
Current memory usage: 3.508789 KiB
Peak memory usage: 2312.934570 KiB
concat #################################
Length of df_list: 5
Current memory usage: 39094.959961 KiB
Peak memory usage: 39101.418945 KiB
Current memory usage: 39068.648438 KiB
Peak memory usage: 39075.121094 KiB
Current memory usage: 39067.968750 KiB
Peak memory usage: 39074.175781 KiB
Current memory usage: 39068.226562 KiB
Peak memory usage: 39074.269531 KiB
Current memory usage: 39067.914062 KiB
Peak memory usage: 39073.949219 KiB

I convert the DataFrame into a sparse format, concatenate it, and then convert it back to a dense format to write to a Parquet file. This process consumes double the memory due to the conversion between sparse and dense formats. Instead, can we read the data directly in a sparse format and write it without converting back to dense? We use sparse format because our DataFrame contains many NaN values, making it inherently sparse. We want to consume less memory and less time.

for i in range(0,n):
    df=pd.read_parquet(file_name[0], engine='pyarrow',filters=[('BUCKET_NUM', '>=', bucket_start), ('BUCKET_NUM', '<=', bucket_end)]).drop('BUCKET_NUM', axis=1).set_index(index).apply(pd.arrays.SparseArray)
    dfs.append(df)

final_df=pd.concat(dfs,axis=1)
final_df.sparse.to_dense().to_parquet(new_file, engine='pyarrow', compression='snappy')

Sometimes it fails because of below error since allocating the memory before hand but it doesn't use the entire memory at the end.
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 51.3 GiB for an array with shape (68836, 100000) and data type float64
Is there any way to spill the data to disk if it crosses memory and do the processing?

sandeyshc · 2024-10-09T08:08:45Z

Just to add, pandas dataframe is consuming lot of space

df=pd.read_parquet(file_name[0], engine='pyarrow',filters=[('BUCKET_NUM', '>=', bucket_start), ('BUCKET_NUM', '<=', bucket_end)]).drop('BUCKET_NUM', axis=1).set_index(index)
memory_usage_per_column = df.memory_usage(deep=True)
total_memory_usage = memory_usage_per_column.sum()
print(f"Memory usage per column (in bytes): {memory_usage_per_column}")
print(f"\nTotal memory usage (in bytes): {total_memory_usage}")
data_size = get_data_size(df)
print(f"Exact size of the data (in bytes): {data_size}")


Total memory usage (in bytes): 815536221
Exact size of the data (in bytes): 141846808

rhshadrach · 2024-10-10T21:09:21Z

@sandeyshc - can you share your output of pd.show_versions()

sandeyshc · 2024-10-11T07:28:17Z

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-513.24.1.el8_9.x86_64
Version : #1 SMP Thu Mar 14 14:20:09 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 2.1.0
pytz : 2024.1
dateutil : 2.9.0
setuptools : 72.2.0
pip : 24.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.5.0
fsspec : 2024.6.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

anzber · 2024-10-11T20:18:03Z

@rhshadrach
I've tried the same script with latest pandas 3.0.0.dev0+1374.g0c24b20bd9' and pandas 2.2.2
Memory consumption:
of 3.0.0.dev0+1374.g0c24b20bd9 looks like @chaoyihu 's output
of 2.2.2 looks like @sandeyshc 's output

I think this problem is already fixed

rhshadrach · 2024-10-12T14:26:01Z

Thanks for the reproducer @chaoyihu and the investigation @anzber - I can also reproduce this on 2.2.x but not on main. Closing.

sandeyshc added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2024

mroeschke mentioned this issue Oct 4, 2024

BUG: Disable Numpy memory allocation while concat #59957

Closed

3 tasks

rhshadrach added Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 5, 2024

rhshadrach closed this as completed Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Disable Numpy memory allocation while concat #59956

ENH: Disable Numpy memory allocation while concat #59956

sandeyshc commented Oct 4, 2024

rhshadrach commented Oct 5, 2024

chaoyihu commented Oct 9, 2024

sandeyshc commented Oct 9, 2024

sandeyshc commented Oct 9, 2024 •

edited

Loading

rhshadrach commented Oct 10, 2024

sandeyshc commented Oct 11, 2024

anzber commented Oct 11, 2024

rhshadrach commented Oct 12, 2024

ENH: Disable Numpy memory allocation while concat #59956

ENH: Disable Numpy memory allocation while concat #59956

Comments

sandeyshc commented Oct 4, 2024

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

rhshadrach commented Oct 5, 2024

chaoyihu commented Oct 9, 2024

sandeyshc commented Oct 9, 2024

sandeyshc commented Oct 9, 2024 • edited Loading

rhshadrach commented Oct 10, 2024

sandeyshc commented Oct 11, 2024

INSTALLED VERSIONS

anzber commented Oct 11, 2024

rhshadrach commented Oct 12, 2024

sandeyshc commented Oct 9, 2024 •

edited

Loading