Example of High Memory Usage of MultiIndex 

This is a dupe of #1752, in which @jreback recommended a new issue be opened. I have a (hopefully) reproduce-able example which shows how using MultiIndex can explode RAM usage.

At the top I define a function to get the memory of the current Python process. This will be called as I create arrays to get the "true" space of each object.

``` python
import numpy as np
import pandas as pd
import os
import psutil
import numba as nb

def print_mem():
    proc = psutil.Process(os.getpid())
    print proc.memory_info()[0] / float(2**20), "MB"
print_mem()
>> 74.20703125 MB
```

I need some functions to construct a plausible MultiIndex which matches roughly the actual data I'm using.

``` python
@nb.jit(nopython=True, nogil=True)
def nbf_number_new_level(repetitions):
    assert len(repetitions.shape) == 1

    n = repetitions.sum()
    ret = np.zeros(shape=n, dtype=np.uint64)

    i = 0
    for j in range(len(repetitions)):
        n_repetitions = repetitions[j]
        c = 0
        for k in range(n_repetitions):
            ret[i] = c
            c += 1; i += 1
    return ret

def expand_index(idx, repetitions, new_name=None):
    assert len(idx) == len(repetitions), "Index and repetitions must be the same length"
    repetitions = np.array(repetitions)
    assert np.all(repetitions > 0), "All repetitions must be > 0"

    names = list(idx.names) if isinstance(idx, pd.MultiIndex) else [idx.name]
    names.append(new_name)

    new_level_value = nbf_number_new_level(repetitions)

    value_arrays = [idx.get_level_values(i).repeat(repetitions) for i in range(idx.nlevels)] \
    if isinstance(idx, pd.MultiIndex) else \
    [idx.values.repeat(repetitions)]

    value_arrays.append(new_level_value)

    ret = pd.MultiIndex.from_arrays(value_arrays)
    ret.names = names
    return ret
```

Constructing the actual MultiIndex:

``` python
randomizer = np.random.RandomState(12345)

nhousehholds = 200000
hhids = pd.Index(np.arange(nhousehholds), name='hhid')
print nhousehholds, "households"
print_mem()
>>200000 households
>>83.95703125 MB

persons_per_hh = randomizer.randint(1, 9, size=nhousehholds)
person_ids = expand_index(hhids, persons_per_hh, new_name='pid')
print len(person_ids), "persons"
print_mem()
>>898254 persons
>>105.6953125 MB

tours_per_person = randomizer.choice([1, 2, 3], size=len(person_ids), replace=True, p=[0.85, 0.1, 0.05])
tour_ids = expand_index(person_ids, tours_per_person, new_name='tour_id')
print len(tour_ids), "tours"
print_mem()
>>1078155 tours
>>121.734375 MB

trips_per_tour = randomizer.choice([2,3,4,5], size=len(tour_ids), replace=True, p=[0.6, 0.25, 0.1, 0.05])
trip_ids = expand_index(tour_ids, trips_per_tour, new_name='trip_id')
print len(trip_ids), "trips"
print_mem()

trips = pd.DataFrame(index=trip_ids)
trips['origin'] = randomizer.randint(101, 200, size=len(trips))
trips['destination'] = randomizer.randint(101, 200, size=len(trips))
print_mem()

>>2803731 trips
>>150.17578125 MB
>>171.70703125 MB
```

Based on this, my fake trip table is taking up about 171.7 - 121.7 = 50MB of RAM. I haven't done any fancy indexing, just initialized the table. But, when I call `trips.info()` (which appears to instantiate the array of PyTuples) this happens:

``` python
print trips.info()
print_mem()
>><class 'pandas.core.frame.DataFrame'>
MultiIndex: 2803731 entries, (0, 0, 0, 0) to (199999, 0, 0, 1)
Data columns (total 2 columns):
origin         int32
destination    int32
dtypes: int32(2)
memory usage: 42.8+ MB
None
723.64453125 MB
```

My process's memory usage balloons to 723MB!. Doing the math, the cached indexer takes up 723.6 - 171.7 = 551 MB, a **tenfold** increase over the actual DataFrame!

For this fake dataset, this is not so much of a problem, but my production code is 20x the size and I soak up 27 GB of RAM when I as much as look at my trips table. 

Any performance tips would be appreciated; the only thing I can think of right now is "don't use a MultiIndex". It sounds like a 'proper' fix for this lies deep in the indexing internals which is far above my pay grade. But I wanted to at least log this issue with the community.

Output of `pd.show_versions()`:

```
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 0.6c11
Cython: 0.22
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.1.1
numexpr: 2.5.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.7.3
boto: 2.36.0
pandas_datareader: None
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Example of High Memory Usage of MultiIndex #13904

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Example of High Memory Usage of MultiIndex #13904

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions