High memory usage for MultiIndexes #1752

njsmith · 2012-08-10T13:02:46Z

This was on the list, but so it doesn't get forgotten: https://groups.google.com/d/msg/pydata/EaPRB4KNeTQ/Y-3kG5gW3xMJ

Summary: storing a large multi-index takes a lot of memory, with current pandas master, once it has been indexed into once. ~200 megabytes for a 1 mega-entry index. (As compared to ~20 megabytes pickled.)

The offending dataframe is available here (this is different from the url in the email thread): http://vorpus.org/~njs/tmp/df-pandas-master.pickle.gz

Partly this is just a bug; there is a large object array of tuples that Wes says shouldn't exist, but seems to anyway.

I was going to also say that it would be nice to be able to just opt out of the hash table entirely, but code inspection suggests that this already happens for indexes with more than pandas/src/engines.pyx:_SIZE_CUTOFF = 1000000 entries. The array above is just under this threshold. Perhaps this cutoff should be lowered for indices that have higher memory overhead, like multiindexes? Or even be user-controllable when the index is created? (The code I'm using to load in and set up these dataframes already takes elaborate measures to manage memory overhead, so this wouldn't be a hassle.)

The text was updated successfully, but these errors were encountered:

wesm · 2012-08-10T15:28:08Z

Well, the solution is not to persist the array of tuples (it is cached the first time it's constructed). I would like to improve the internals of MultiIndex-- for example the hash table footprint could be significantly improved by using a custom C tuple of ints to represent each value. If you write me the relevant data structure code and hash function I'll see what I can do in a future release (or I'll eventually do it myself)

wesm · 2012-08-10T15:28:32Z

It'd also be nice to make the hashtable cutoff for large sorted indexes configurable

njsmith · 2012-08-10T17:24:34Z

I'm not worrying about persistence overhead, I'm worried about memory overhead. Obviously my computers do have more than 500 megabytes of RAM, but every experiment involves dozens of these data sets, and right now using pandas ~halves the number that I can load simultaneously (and many data sets are larger to start with). I already run out of memory before I run out of CPUs.

What is this giant array of tuples for? Is it possible to avoid it entirely?

[sorry for the truncated notification, firefox somehow decided that I was done with the comment after one sentence.]

wesm · 2012-08-10T17:34:25Z

The array of tuples is for reindexing. It's actually the PyTuple objects that are the problem-- they are keys in the hash table. What I'm saying is that if you want a hash table, it needs to contain data structures that are not python objects. I'll investigate some options to fix it; I could spend a whole 'nother year on pandas's internals improving these things, but it's hard when I have no funding and lots of other priorities

wesm · 2012-08-10T17:35:32Z

I think if you store the labels in a single C-contiguous 2D array and use int64_t* as the hash table key type, you'd be in business

njsmith · 2012-08-10T18:30:18Z

Okay, right, that makes it clearer what's going on :-).

I think the right abstraction here might be numpy structured dtypes, which just represent several arbitrary C objects concatenated into a single buffer? Basically the same idea as your 2D array, but it would make it easy to work with mixed-length integers if desired. (This is what I was thinking of when I mentioned in the original email that a single 64-bit array could suffice -- if you intern the labels in this case to int8, int8, int32, then each "tuple" is only 6 bytes.) For that matter it might be possible to pretty much generalize the Int64Index code to efficiently handle any such dtypes -- a fixed-size chunk of raw memory is a fixed-size chunk of raw memory. You'd just have to check arr.dtype.itemsize once on the way in and then everything else is the same.

wesm · 2012-08-10T18:34:18Z

Yeah, I agree that's the "right" solution. It would be a very painful change to make just thinking about it-- for hashing I imagine you would just want to hash the bytes

njsmith · 2012-08-10T18:43:29Z

Yeah, just a string hash. Easier than writing a new hash function for int64_t[]'s, I guess.

njsmith · 2012-08-10T20:21:31Z

...I missed the part where khash API doesn't have any way to pass ancillary data to the hash/equality functions. Like data structure size. Suck. Could use length-prefixed (pascal) strings I guess, though it would be an ugly kluge.

wesm · 2012-08-10T20:36:54Z

You could make a modified version of the khash-table that does the right thing. I haven't tried yet, though

njsmith · 2012-08-10T22:02:03Z

Well, here's a hacked khash that support hash tables of fixed-size memory buffers:
https://github.com/njsmith/pandas/commits/smaller-multi-index

cdef kh_fixstr_t *h = kh_init_fixstr(<void *>item_size);
kh_get_fixstr(h, <char*>&buf)

wesm · 2012-08-13T03:28:42Z

Alright, very good. This work won't make upcoming 0.8.2 (which contains a load of bugfixes) but I'd like to get it done soon

njsmith · 2012-08-13T09:48:33Z

Cool. For reference, here's a thread on how to work with the "raw memory" behind structured dtypes in Cython:
https://groups.google.com/group/cython-users/browse_thread/thread/eac959af1a11dbc3

So far it sounds like you just have to use PyArray_DATA directly, but it's not too bad.

wesm · 2012-09-20T03:23:47Z

Any sense of the performance impact of the user_data addition on the other hash tables?

njsmith · 2012-09-20T19:09:09Z

It should be extraordinarily small.

Analysis: memory-wise, we're adding a single pointer to each hash table. This is negligible. Speed-wise, we're passing this pointer into a number of functions, of which probably the hashing function is called most often. So, worst case, this requires for each hash that we have to (1) look up the value of the pointer, (2) push it to the stack. However, the pointer is probably already in the cache, since it's stored right next to the root pointer for the hash table, it's a trivially predictable memory access, etc. (Note that we are not dereferencing the pointer, we're just reading its value from hash struct.) And pushing it to the stack is also absurdly cheap, like a few cycles. But, even this is probably too pessimistic an analysis, because for khash, all of these functions that are now taking an extra argument are short, inlineable, and the compiler can see that they never actually use this extra argument. So in practice it probably gets optimized out altogether.

So my bet is that it is literally indistinguishable from zero.

njsmith · 2012-09-20T19:15:08Z

Actually, I'm being silly -- looking at the code again, the vast majority of your hash/comparison functions are defined as macros, which literally throw away the user_data argument before the compiler even sees the code. So the performance impact really should be nil. (If you really want to be sure you could just make sure that all of the hash/comparison functions are defined as macros like this.)

jreback · 2014-11-02T18:16:30Z

@njsmith just noticed this issue (this is not really the same issue, but relevant). Since the indexing code as change significantly I am going to close this. Pls reopen if you have a nice example that involves memory overhead after indexing (a new issue will put it back on the radar).

See here #8456

So in 0.15.1 (releasing bug-fix next week).

I was able to use this really old pickle.

The reported memory usage was completely wrong for a MultiIndex. Here's approx the usage.

# 0.15.0
In [15]: sum([ l.nbytes for l in df.index.labels ])/(1024*1024.0)
Out[15]: 22.46484375

# 0.15.1/master
In [2]: sum([ l.nbytes for l in df.index.labels ])/(1024*1024.0)
Out[2]: 5.6162109375

and here's the reported usage (which is correct - this includes the actual by the levels as well)

In [6]: df.memory_usage(index=True)
Out[6]: 
Index    7486690
lle      7852032
lhz      7852032
MiPf     7852032
LLPf     7852032
RLPf     7852032
LMPf     7852032
RMPf     7852032
LDFr     7852032
RDFr     7852032
LLFr     7852032
RLFr     7852032
LMFr     7852032
RMFr     7852032
LMCe     7852032
RMCe     7852032
MiCe     7852032
MiPa     7852032
LDCe     7852032
RDCe     7852032
LDPa     7852032
RDPa     7852032
LMOc     7852032
RMOc     7852032
LLTe     7852032
RLTe     7852032
LLOc     7852032
RLOc     7852032
MiOc     7852032
A2       7852032
HEOG     7852032
rle      7852032
rhz      7852032
dtype: int64

jreback closed this as completed Nov 2, 2014

PeterKucirek mentioned this issue Aug 4, 2016

Example of High Memory Usage of MultiIndex #13904

Closed

chris-b1 mentioned this issue Nov 27, 2017

New engine for MultiIndex? #18519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage for MultiIndexes #1752

High memory usage for MultiIndexes #1752

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 13, 2012

njsmith commented Aug 13, 2012

wesm commented Sep 20, 2012

njsmith commented Sep 20, 2012

njsmith commented Sep 20, 2012

jreback commented Nov 2, 2014

High memory usage for MultiIndexes #1752

High memory usage for MultiIndexes #1752

Comments

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 10, 2012

njsmith commented Aug 10, 2012

wesm commented Aug 13, 2012

njsmith commented Aug 13, 2012

wesm commented Sep 20, 2012

njsmith commented Sep 20, 2012

njsmith commented Sep 20, 2012

jreback commented Nov 2, 2014