Skip to content

High memory usage for MultiIndexes #1752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
njsmith opened this issue Aug 10, 2012 · 17 comments
Closed

High memory usage for MultiIndexes #1752

njsmith opened this issue Aug 10, 2012 · 17 comments
Labels
Enhancement MultiIndex Performance Memory or execution speed performance

Comments

@njsmith
Copy link

njsmith commented Aug 10, 2012

This was on the list, but so it doesn't get forgotten: https://groups.google.com/d/msg/pydata/EaPRB4KNeTQ/Y-3kG5gW3xMJ

Summary: storing a large multi-index takes a lot of memory, with current pandas master, once it has been indexed into once. ~200 megabytes for a 1 mega-entry index. (As compared to ~20 megabytes pickled.)

The offending dataframe is available here (this is different from the url in the email thread): http://vorpus.org/~njs/tmp/df-pandas-master.pickle.gz

Partly this is just a bug; there is a large object array of tuples that Wes says shouldn't exist, but seems to anyway.

I was going to also say that it would be nice to be able to just opt out of the hash table entirely, but code inspection suggests that this already happens for indexes with more than pandas/src/engines.pyx:_SIZE_CUTOFF = 1000000 entries. The array above is just under this threshold. Perhaps this cutoff should be lowered for indices that have higher memory overhead, like multiindexes? Or even be user-controllable when the index is created? (The code I'm using to load in and set up these dataframes already takes elaborate measures to manage memory overhead, so this wouldn't be a hassle.)

@wesm
Copy link
Member

wesm commented Aug 10, 2012

Well, the solution is not to persist the array of tuples (it is cached the first time it's constructed). I would like to improve the internals of MultiIndex-- for example the hash table footprint could be significantly improved by using a custom C tuple of ints to represent each value. If you write me the relevant data structure code and hash function I'll see what I can do in a future release (or I'll eventually do it myself)

@wesm
Copy link
Member

wesm commented Aug 10, 2012

It'd also be nice to make the hashtable cutoff for large sorted indexes configurable

@njsmith
Copy link
Author

njsmith commented Aug 10, 2012

I'm not worrying about persistence overhead, I'm worried about memory overhead. Obviously my computers do have more than 500 megabytes of RAM, but every experiment involves dozens of these data sets, and right now using pandas ~halves the number that I can load simultaneously (and many data sets are larger to start with). I already run out of memory before I run out of CPUs.

What is this giant array of tuples for? Is it possible to avoid it entirely?

[sorry for the truncated notification, firefox somehow decided that I was done with the comment after one sentence.]

@wesm
Copy link
Member

wesm commented Aug 10, 2012

The array of tuples is for reindexing. It's actually the PyTuple objects that are the problem-- they are keys in the hash table. What I'm saying is that if you want a hash table, it needs to contain data structures that are not python objects. I'll investigate some options to fix it; I could spend a whole 'nother year on pandas's internals improving these things, but it's hard when I have no funding and lots of other priorities

@wesm
Copy link
Member

wesm commented Aug 10, 2012

I think if you store the labels in a single C-contiguous 2D array and use int64_t* as the hash table key type, you'd be in business

@njsmith
Copy link
Author

njsmith commented Aug 10, 2012

Okay, right, that makes it clearer what's going on :-).

I think the right abstraction here might be numpy structured dtypes, which just represent several arbitrary C objects concatenated into a single buffer? Basically the same idea as your 2D array, but it would make it easy to work with mixed-length integers if desired. (This is what I was thinking of when I mentioned in the original email that a single 64-bit array could suffice -- if you intern the labels in this case to int8, int8, int32, then each "tuple" is only 6 bytes.) For that matter it might be possible to pretty much generalize the Int64Index code to efficiently handle any such dtypes -- a fixed-size chunk of raw memory is a fixed-size chunk of raw memory. You'd just have to check arr.dtype.itemsize once on the way in and then everything else is the same.

@wesm
Copy link
Member

wesm commented Aug 10, 2012

Yeah, I agree that's the "right" solution. It would be a very painful change to make just thinking about it-- for hashing I imagine you would just want to hash the bytes

@njsmith
Copy link
Author

njsmith commented Aug 10, 2012

Yeah, just a string hash. Easier than writing a new hash function for int64_t[]'s, I guess.

@njsmith
Copy link
Author

njsmith commented Aug 10, 2012

...I missed the part where khash API doesn't have any way to pass ancillary data to the hash/equality functions. Like data structure size. Suck. Could use length-prefixed (pascal) strings I guess, though it would be an ugly kluge.

@wesm
Copy link
Member

wesm commented Aug 10, 2012

You could make a modified version of the khash-table that does the right thing. I haven't tried yet, though

@njsmith
Copy link
Author

njsmith commented Aug 10, 2012

Well, here's a hacked khash that support hash tables of fixed-size memory buffers:
https://github.com/njsmith/pandas/commits/smaller-multi-index

cdef kh_fixstr_t *h = kh_init_fixstr(<void *>item_size);
kh_get_fixstr(h, <char*>&buf)

@wesm
Copy link
Member

wesm commented Aug 13, 2012

Alright, very good. This work won't make upcoming 0.8.2 (which contains a load of bugfixes) but I'd like to get it done soon

@njsmith
Copy link
Author

njsmith commented Aug 13, 2012

Cool. For reference, here's a thread on how to work with the "raw memory" behind structured dtypes in Cython:
https://groups.google.com/group/cython-users/browse_thread/thread/eac959af1a11dbc3

So far it sounds like you just have to use PyArray_DATA directly, but it's not too bad.

@wesm
Copy link
Member

wesm commented Sep 20, 2012

Any sense of the performance impact of the user_data addition on the other hash tables?

@njsmith
Copy link
Author

njsmith commented Sep 20, 2012

It should be extraordinarily small.

Analysis: memory-wise, we're adding a single pointer to each hash table. This is negligible. Speed-wise, we're passing this pointer into a number of functions, of which probably the hashing function is called most often. So, worst case, this requires for each hash that we have to (1) look up the value of the pointer, (2) push it to the stack. However, the pointer is probably already in the cache, since it's stored right next to the root pointer for the hash table, it's a trivially predictable memory access, etc. (Note that we are not dereferencing the pointer, we're just reading its value from hash struct.) And pushing it to the stack is also absurdly cheap, like a few cycles. But, even this is probably too pessimistic an analysis, because for khash, all of these functions that are now taking an extra argument are short, inlineable, and the compiler can see that they never actually use this extra argument. So in practice it probably gets optimized out altogether.

So my bet is that it is literally indistinguishable from zero.

@njsmith
Copy link
Author

njsmith commented Sep 20, 2012

Actually, I'm being silly -- looking at the code again, the vast majority of your hash/comparison functions are defined as macros, which literally throw away the user_data argument before the compiler even sees the code. So the performance impact really should be nil. (If you really want to be sure you could just make sure that all of the hash/comparison functions are defined as macros like this.)

@jreback
Copy link
Contributor

jreback commented Nov 2, 2014

@njsmith just noticed this issue (this is not really the same issue, but relevant). Since the indexing code as change significantly I am going to close this. Pls reopen if you have a nice example that involves memory overhead after indexing (a new issue will put it back on the radar).

See here #8456

So in 0.15.1 (releasing bug-fix next week).

I was able to use this really old pickle.

The reported memory usage was completely wrong for a MultiIndex. Here's approx the usage.

# 0.15.0
In [15]: sum([ l.nbytes for l in df.index.labels ])/(1024*1024.0)
Out[15]: 22.46484375

# 0.15.1/master
In [2]: sum([ l.nbytes for l in df.index.labels ])/(1024*1024.0)
Out[2]: 5.6162109375

and here's the reported usage (which is correct - this includes the actual by the levels as well)

In [6]: df.memory_usage(index=True)
Out[6]: 
Index    7486690
lle      7852032
lhz      7852032
MiPf     7852032
LLPf     7852032
RLPf     7852032
LMPf     7852032
RMPf     7852032
LDFr     7852032
RDFr     7852032
LLFr     7852032
RLFr     7852032
LMFr     7852032
RMFr     7852032
LMCe     7852032
RMCe     7852032
MiCe     7852032
MiPa     7852032
LDCe     7852032
RDCe     7852032
LDPa     7852032
RDPa     7852032
LMOc     7852032
RMOc     7852032
LLTe     7852032
RLTe     7852032
LLOc     7852032
RLOc     7852032
MiOc     7852032
A2       7852032
HEOG     7852032
rle      7852032
rhz      7852032
dtype: int64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants