Skip to content

PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage #14308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 28, 2016

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 27, 2016

In [2]: import string
   ...: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: def memory_usage(f):
   ...:     return f.memory_usage(deep=True).sum()
   ...: 
   ...: N = 100
   ...: M = len(string.uppercase)
   ...: df = pd.DataFrame({'value' : np.random.randn(N*M)},
   ...:                   index=pd.MultiIndex.from_product([list(string.uppercase),
   ...:                                                     pd.date_range('20160101',periods=N)],
   ...:                                                    names=['id','date'])
   ...:                   )
   ...: 
   ...: 
   ...: stacked = df.unstack('id')
   ...: 
   ...: assert df.values.nbytes == stacked.values.nbytes
   ...: 

In [3]: memory_usage(df)
Out[3]: 145600

In [4]: memory_usage(stacked)
Out[4]: 21600
I
n [7]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2600 entries, (A, 2016-01-01 00:00:00) to (Z, 2016-04-09 00:00:00)
Data columns (total 1 columns):
value    2600 non-null float64
dtypes: float64(1)
memory usage: 142.2 KB

In [8]: stacked.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2016-01-01 to 2016-04-09
Freq: D
Data columns (total 26 columns):
(value, A)    100 non-null float64
(value, B)    100 non-null float64
(value, C)    100 non-null float64
(value, D)    100 non-null float64
(value, E)    100 non-null float64
(value, F)    100 non-null float64
(value, G)    100 non-null float64
(value, H)    100 non-null float64
(value, I)    100 non-null float64
(value, J)    100 non-null float64
(value, K)    100 non-null float64
(value, L)    100 non-null float64
(value, M)    100 non-null float64
(value, N)    100 non-null float64
(value, O)    100 non-null float64
(value, P)    100 non-null float64
(value, Q)    100 non-null float64
(value, R)    100 non-null float64
(value, S)    100 non-null float64
(value, T)    100 non-null float64
(value, U)    100 non-null float64
(value, V)    100 non-null float64
(value, W)    100 non-null float64
(value, X)    100 non-null float64
(value, Y)    100 non-null float64
(value, Z)    100 non-null float64
dtypes: float64(26)
memory usage: 21.1 KB

with this PR

In [2]: memory_usage(df)
Out[2]: 27088

In [3]: memory_usage(stacked)
Out[3]: 21600

In [4]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2600 entries, (A, 2016-01-01 00:00:00) to (Z, 2016-04-09 00:00:00)
Data columns (total 1 columns):
value    2600 non-null float64
dtypes: float64(1)
memory usage: 26.5 KB

In [5]: stacked.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2016-01-01 to 2016-04-09
Freq: D
Data columns (total 26 columns):
(value, A)    100 non-null float64
(value, B)    100 non-null float64
(value, C)    100 non-null float64
(value, D)    100 non-null float64
(value, E)    100 non-null float64
(value, F)    100 non-null float64
(value, G)    100 non-null float64
(value, H)    100 non-null float64
(value, I)    100 non-null float64
(value, J)    100 non-null float64
(value, K)    100 non-null float64
(value, L)    100 non-null float64
(value, M)    100 non-null float64
(value, N)    100 non-null float64
(value, O)    100 non-null float64
(value, P)    100 non-null float64
(value, Q)    100 non-null float64
(value, R)    100 non-null float64
(value, S)    100 non-null float64
(value, T)    100 non-null float64
(value, U)    100 non-null float64
(value, V)    100 non-null float64
(value, W)    100 non-null float64
(value, X)    100 non-null float64
(value, Y)    100 non-null float64
(value, Z)    100 non-null float64
dtypes: float64(26)
memory usage: 21.1 KB

@jreback jreback added Bug Performance Memory or execution speed performance labels Sep 27, 2016
@jreback jreback added this to the 0.19.0 milestone Sep 27, 2016
@jreback jreback force-pushed the memory branch 2 times, most recently from d018eb8 to c47e0ba Compare September 27, 2016 23:37
# we are overwriting our base class to avoid
# computing .values here which could materialize
# a tuple representation uncessarily
return self.nbytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need to recurse into the levels to get an accurate total if deep=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already done in .nbytes (see a little below).

it actually ONLY does deep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, actually it *doesn't do deep....let me see if I can fix that too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@jreback jreback changed the title PERF: uncessary materialization of a MultiIndex.values when introspecting memory_usage PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage Sep 28, 2016
@codecov-io
Copy link

codecov-io commented Sep 28, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14308 into master will increase coverage by <.01%

@@             master     #14308   diff @@
==========================================
  Files           140        140          
  Lines         50614      50619     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43155      43161     +6   
+ Misses         7459       7458     -1   
  Partials          0          0          

Powered by Codecov. Last update c084bc1...c2e74e8

@jreback jreback merged commit 5033a4a into pandas-dev:master Sep 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants