-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame.sum() creates temporary copy in memory #16788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would recommend that you install In [11]: %timeit df.sum(axis=1)
677 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: pd.options.compute.use_bottleneck = False
In [13]: %timeit df.sum(axis=1)
3.5 s ± 130 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Thanks for the tip. For my purposes, the workaround that I described works just as well. My point was that maybe the default |
It does seem like there is an avoidable copy here Line 287 in 664348c
|
well if you can avoid copy w/o changing any semantics sure |
So the reason of the unavoidable copy is that we use |
Oh, right. I suppose this could be left open but the right answer is to use |
I have a quite similar trouble even with bottleneck, it takes up all of my memory and freezes my computer. Is there a way to avoid this? https://stackoverflow.com/questions/45350545/how-to-avoid-this-memory-leak-caused-by-dataframe-sum-in-pandas |
@karkirowle your example is not similar at all this issue is about summing floats you are summing strings which is horribly inefficient and memory hungry |
No longer see memory > 7 GB on master and the copy is avoided here: Lines 326 to 332 in 50a59ba
|
Somehow, DataFrame.sum() seems to always create a temporary copy of the data frame in the memory.
Code Sample
First, we create a large, 3.7 GB DataFrame with many columns:
Output:
Next, we want to sum over the rows:
Output:
By monitoring the memory consumption of Python during the execution of this step using
top
(execution takes a 2-3 seconds on my machine), I can see that the consumption temporarily goes up two-fold, indicating that a copy of the entire frame is created in memory. However, the following code achieves the same result without creating a copy.Output:
Problem description
Creating a copy of the data frame seems unnecessary for summing (numpy does it without creating a copy). The current implementation of DataFrame.sum() makes it impossible to sum over data frames if there isn't enough memory available to create a copy.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.1
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: