Skip to content

ENH: Disable Numpy memory allocation while concat #59956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
sandeyshc opened this issue Oct 4, 2024 · 8 comments
Closed
1 of 3 tasks

ENH: Disable Numpy memory allocation while concat #59956

sandeyshc opened this issue Oct 4, 2024 · 8 comments
Labels
Enhancement Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance

Comments

@sandeyshc
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

We have sparse data with many null values, and while reading it using Pandas with PyArrow, it doesn't consume much memory because of pandas internal compression logic. However, during concatenation, NumPy allocates memory that isn't actually used, causing our Python script to fail due to memory allocation issues. Can you provide an option to disable NumPy memory allocation when concatenating DataFrames along axis=1?

Feature Description

pd.concat(df_list,axis=1,numpy_allocation=False)

Alternative Solutions

Atleast can you provide how can we change C++ script internally and use it for our purpose

Additional Context

Please let me know if i am wrong.

@sandeyshc sandeyshc added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2024
@rhshadrach
Copy link
Member

Can you provide a reproducible example?

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 5, 2024
@chaoyihu
Copy link
Contributor

chaoyihu commented Oct 9, 2024

Something like this? If my script is correct, I did not observe extra memory allocation in concatenation.

Reproducible script
import tracemalloc
import pandas as pd
import numpy as np

def create_sparse_data():
    ROW = 100_000
    COL = 10
    df = pd.DataFrame(np.nan, index=range(ROW), columns=range(COL))
    sparsity = 0.9
    rs = np.random.choice(np.arange(0, ROW),
                          size=int((1 - sparsity) * ROW * COL),
                          replace=False)
    for r in rs:
        df.loc[r, np.random.randint(COL)] = 0.1
    df.to_csv("data.csv", index=False)

def mem_profiler(func):
    def wrapper(*args, **kwargs):
        tracemalloc.start()
        result = func(*args, **kwargs)
        current, peak = tracemalloc.get_traced_memory()
        tracemalloc.stop()
        print(f"Current memory usage: {current / 1024:.6f} KiB")
        print(f"Peak memory usage: {peak / 1024:.6f} KiB")
        return result 
    return wrapper

@mem_profiler
def read_sparse_df():
    df = pd.read_csv("data.csv", engine="pyarrow") 
    return df

@mem_profiler
def concat_sparse_df(df_list):
    df = pd.concat(df_list, axis=1)
    return df

def main():
    create_sparse_data()

    print("read #################################")
    df_list = []
    n_to_concat = 5
    for i in range(n_to_concat):
        print(f"# {i}")
        df = read_sparse_df()
        df_list.append(df)

    print("concat #################################")
    print("Length of df_list:", len(df_list))
    repetition = 5
    for rep in range(repetition):
        df = concat_sparse_df(df_list)

if __name__ == "__main__":
    main()
Result:
# 0
Current memory usage: 320.005859 KiB
Peak memory usage: 2391.936523 KiB
# 1
Current memory usage: 3.383789 KiB
Peak memory usage: 2310.313477 KiB
# 2
Current memory usage: 3.383789 KiB
Peak memory usage: 2310.313477 KiB
# 3
Current memory usage: 3.330078 KiB
Peak memory usage: 2310.313477 KiB
# 4
Current memory usage: 3.383789 KiB
Peak memory usage: 2310.313477 KiB
concat #################################
Length of df_list: 5
Current memory usage: 29.715820 KiB
Peak memory usage: 30.512695 KiB
Current memory usage: 4.015625 KiB
Peak memory usage: 5.911133 KiB
Current memory usage: 3.195312 KiB
Peak memory usage: 5.911133 KiB
Current memory usage: 3.195312 KiB
Peak memory usage: 5.911133 KiB
Current memory usage: 3.195312 KiB
Peak memory usage: 5.911133 KiB
INSTALLED VERSIONS

commit : 7c0ee27
python : 3.10.14
python-bits : 64
OS : Linux
OS-release : 6.8.0-44-generic
Version : #44-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 13 13:35:26 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1237.g7c0ee27e6c
numpy : 1.26.4
dateutil : 2.9.0
pip : 24.0
Cython : 3.0.10
sphinx : 7.4.7
IPython : 8.26.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.0
fastparquet : 2024.5.0
fsspec : 2024.6.1
html5lib : 1.1
hypothesis : 6.108.3
gcsfs : 2024.6.1
jinja2 : 3.1.4
lxml.etree : 5.2.2
matplotlib : 3.9.1
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.4
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 17.0.0
pyreadstat : 1.2.7
pytest : 8.3.1
python-calamine : None
pytz : 2024.1
pyxlsb : 1.0.10
s3fs : 2024.6.1
scipy : 1.14.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.6.0
xlrd : 2.0.1
xlsxwriter : 3.1.9
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@sandeyshc
Copy link
Author

Hi,

Sorry for delayed response, Below are the points

  1. I ran the same script which @chaoyihu gave and i see different memory usage numbers as below
read #################################
# 0
Current memory usage: 277.455078 KiB
Peak memory usage: 2404.784180 KiB
# 1
Current memory usage: 3.606445 KiB
Peak memory usage: 2313.016602 KiB
# 2
Current memory usage: 3.477539 KiB
Peak memory usage: 2312.965820 KiB
# 3
Current memory usage: 3.524414 KiB
Peak memory usage: 2312.950195 KiB
# 4
Current memory usage: 3.508789 KiB
Peak memory usage: 2312.934570 KiB
concat #################################
Length of df_list: 5
Current memory usage: 39094.959961 KiB
Peak memory usage: 39101.418945 KiB
Current memory usage: 39068.648438 KiB
Peak memory usage: 39075.121094 KiB
Current memory usage: 39067.968750 KiB
Peak memory usage: 39074.175781 KiB
Current memory usage: 39068.226562 KiB
Peak memory usage: 39074.269531 KiB
Current memory usage: 39067.914062 KiB
Peak memory usage: 39073.949219 KiB
  1. I convert the DataFrame into a sparse format, concatenate it, and then convert it back to a dense format to write to a Parquet file. This process consumes double the memory due to the conversion between sparse and dense formats. Instead, can we read the data directly in a sparse format and write it without converting back to dense? We use sparse format because our DataFrame contains many NaN values, making it inherently sparse. We want to consume less memory and less time.
for i in range(0,n):
    df=pd.read_parquet(file_name[0], engine='pyarrow',filters=[('BUCKET_NUM', '>=', bucket_start), ('BUCKET_NUM', '<=', bucket_end)]).drop('BUCKET_NUM', axis=1).set_index(index).apply(pd.arrays.SparseArray)
    dfs.append(df)

final_df=pd.concat(dfs,axis=1)
final_df.sparse.to_dense().to_parquet(new_file, engine='pyarrow', compression='snappy')
  1. Sometimes it fails because of below error since allocating the memory before hand but it doesn't use the entire memory at the end.
    numpy._core._exceptions._ArrayMemoryError: Unable to allocate 51.3 GiB for an array with shape (68836, 100000) and data type float64
  2. Is there any way to spill the data to disk if it crosses memory and do the processing?

@sandeyshc
Copy link
Author

sandeyshc commented Oct 9, 2024

Just to add, pandas dataframe is consuming lot of space

df=pd.read_parquet(file_name[0], engine='pyarrow',filters=[('BUCKET_NUM', '>=', bucket_start), ('BUCKET_NUM', '<=', bucket_end)]).drop('BUCKET_NUM', axis=1).set_index(index)
memory_usage_per_column = df.memory_usage(deep=True)
total_memory_usage = memory_usage_per_column.sum()
print(f"Memory usage per column (in bytes): {memory_usage_per_column}")
print(f"\nTotal memory usage (in bytes): {total_memory_usage}")
data_size = get_data_size(df)
print(f"Exact size of the data (in bytes): {data_size}")


Total memory usage (in bytes): 815536221
Exact size of the data (in bytes): 141846808

@rhshadrach
Copy link
Member

@sandeyshc - can you share your output of pd.show_versions()

@sandeyshc
Copy link
Author

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-513.24.1.el8_9.x86_64
Version : #1 SMP Thu Mar 14 14:20:09 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 2.1.0
pytz : 2024.1
dateutil : 2.9.0
setuptools : 72.2.0
pip : 24.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.5.0
fsspec : 2024.6.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@anzber
Copy link
Contributor

anzber commented Oct 11, 2024

@rhshadrach
I've tried the same script with latest pandas 3.0.0.dev0+1374.g0c24b20bd9' and pandas 2.2.2
Memory consumption:
of 3.0.0.dev0+1374.g0c24b20bd9 looks like @chaoyihu 's output
of 2.2.2 looks like @sandeyshc 's output

I think this problem is already fixed

@rhshadrach
Copy link
Member

Thanks for the reproducer @chaoyihu and the investigation @anzber - I can also reproduce this on 2.2.x but not on main. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants