PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage #14308

jreback · 2016-09-27T22:37:23Z

In [2]: import string
   ...: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: def memory_usage(f):
   ...:     return f.memory_usage(deep=True).sum()
   ...: 
   ...: N = 100
   ...: M = len(string.uppercase)
   ...: df = pd.DataFrame({'value' : np.random.randn(N*M)},
   ...:                   index=pd.MultiIndex.from_product([list(string.uppercase),
   ...:                                                     pd.date_range('20160101',periods=N)],
   ...:                                                    names=['id','date'])
   ...:                   )
   ...: 
   ...: 
   ...: stacked = df.unstack('id')
   ...: 
   ...: assert df.values.nbytes == stacked.values.nbytes
   ...: 

In [3]: memory_usage(df)
Out[3]: 145600

In [4]: memory_usage(stacked)
Out[4]: 21600
I
n [7]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2600 entries, (A, 2016-01-01 00:00:00) to (Z, 2016-04-09 00:00:00)
Data columns (total 1 columns):
value    2600 non-null float64
dtypes: float64(1)
memory usage: 142.2 KB

In [8]: stacked.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2016-01-01 to 2016-04-09
Freq: D
Data columns (total 26 columns):
(value, A)    100 non-null float64
(value, B)    100 non-null float64
(value, C)    100 non-null float64
(value, D)    100 non-null float64
(value, E)    100 non-null float64
(value, F)    100 non-null float64
(value, G)    100 non-null float64
(value, H)    100 non-null float64
(value, I)    100 non-null float64
(value, J)    100 non-null float64
(value, K)    100 non-null float64
(value, L)    100 non-null float64
(value, M)    100 non-null float64
(value, N)    100 non-null float64
(value, O)    100 non-null float64
(value, P)    100 non-null float64
(value, Q)    100 non-null float64
(value, R)    100 non-null float64
(value, S)    100 non-null float64
(value, T)    100 non-null float64
(value, U)    100 non-null float64
(value, V)    100 non-null float64
(value, W)    100 non-null float64
(value, X)    100 non-null float64
(value, Y)    100 non-null float64
(value, Z)    100 non-null float64
dtypes: float64(26)
memory usage: 21.1 KB

with this PR

In [2]: memory_usage(df)
Out[2]: 27088

In [3]: memory_usage(stacked)
Out[3]: 21600

In [4]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2600 entries, (A, 2016-01-01 00:00:00) to (Z, 2016-04-09 00:00:00)
Data columns (total 1 columns):
value    2600 non-null float64
dtypes: float64(1)
memory usage: 26.5 KB

In [5]: stacked.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2016-01-01 to 2016-04-09
Freq: D
Data columns (total 26 columns):
(value, A)    100 non-null float64
(value, B)    100 non-null float64
(value, C)    100 non-null float64
(value, D)    100 non-null float64
(value, E)    100 non-null float64
(value, F)    100 non-null float64
(value, G)    100 non-null float64
(value, H)    100 non-null float64
(value, I)    100 non-null float64
(value, J)    100 non-null float64
(value, K)    100 non-null float64
(value, L)    100 non-null float64
(value, M)    100 non-null float64
(value, N)    100 non-null float64
(value, O)    100 non-null float64
(value, P)    100 non-null float64
(value, Q)    100 non-null float64
(value, R)    100 non-null float64
(value, S)    100 non-null float64
(value, T)    100 non-null float64
(value, U)    100 non-null float64
(value, V)    100 non-null float64
(value, W)    100 non-null float64
(value, X)    100 non-null float64
(value, Y)    100 non-null float64
(value, Z)    100 non-null float64
dtypes: float64(26)
memory usage: 21.1 KB

shoyer · 2016-09-27T23:44:37Z

pandas/indexes/multi.py

+        # we are overwriting our base class to avoid
+        # computing .values here which could materialize
+        # a tuple representation uncessarily
+        return self.nbytes


Don't we still need to recurse into the levels to get an accurate total if deep=True?

already done in .nbytes (see a little below).

it actually ONLY does deep

hmm, actually it *doesn't do deep....let me see if I can fix that too

I'm not sure this is true.

Looking into the index base class, we compute the size of objects in memory_usage:
https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/core/base.py#L1040

but not in nbytes:
https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/core/base.py#L848

OK, sounds good

codecov-io · 2016-09-28T12:18:49Z

Current coverage is 85.26% (diff: 100%)

Merging #14308 into master will increase coverage by <.01%

@@             master     #14308   diff @@
==========================================
  Files           140        140          
  Lines         50614      50619     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43155      43161     +6   
+ Misses         7459       7458     -1   
  Partials          0          0

Powered by Codecov. Last update c084bc1...c2e74e8

…ecting memory closes pandas-dev#14308

jreback force-pushed the memory branch from aa87a45 to b52af49 Compare September 27, 2016 22:38

jreback added Bug Performance Memory or execution speed performance labels Sep 27, 2016

jreback added this to the 0.19.0 milestone Sep 27, 2016

jreback force-pushed the memory branch 2 times, most recently from d018eb8 to c47e0ba Compare September 27, 2016 23:37

shoyer reviewed Sep 27, 2016

View reviewed changes

jreback force-pushed the memory branch from c47e0ba to 82319f9 Compare September 27, 2016 23:55

jreback changed the title ~~PERF: uncessary materialization of a MultiIndex.values when introspecting memory_usage~~ PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage Sep 28, 2016

jreback force-pushed the memory branch from 82319f9 to 84240fc Compare September 28, 2016 10:45

PERF: unnecessary materialization of a MultiIndex.values when introsp…

c2e74e8

…ecting memory closes pandas-dev#14308

jreback force-pushed the memory branch from 84240fc to c2e74e8 Compare September 28, 2016 12:52

jreback merged commit 5033a4a into pandas-dev:master Sep 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage #14308

PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage #14308

Uh oh!

jreback commented Sep 27, 2016

Uh oh!

shoyer Sep 27, 2016

Uh oh!

jreback Sep 27, 2016

Uh oh!

jreback Sep 27, 2016

Uh oh!

shoyer Sep 27, 2016

Uh oh!

shoyer Sep 27, 2016

Uh oh!

jreback Sep 27, 2016

Uh oh!

codecov-io commented Sep 28, 2016 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage #14308

PERF: unnecessary materialization of a MultiIndex.values when introspecting memory_usage #14308

Uh oh!

Conversation

jreback commented Sep 27, 2016

Uh oh!

shoyer Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

jreback Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

jreback Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

shoyer Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

shoyer Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

jreback Sep 27, 2016

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 85.26% (diff: 100%)

Uh oh!

Uh oh!

codecov-io commented Sep 28, 2016 •

edited

Loading