-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
High memory usage for MultiIndexes #1752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Well, the solution is not to persist the array of tuples (it is cached the first time it's constructed). I would like to improve the internals of MultiIndex-- for example the hash table footprint could be significantly improved by using a custom C tuple of ints to represent each value. If you write me the relevant data structure code and hash function I'll see what I can do in a future release (or I'll eventually do it myself) |
It'd also be nice to make the hashtable cutoff for large sorted indexes configurable |
I'm not worrying about persistence overhead, I'm worried about memory overhead. Obviously my computers do have more than 500 megabytes of RAM, but every experiment involves dozens of these data sets, and right now using pandas ~halves the number that I can load simultaneously (and many data sets are larger to start with). I already run out of memory before I run out of CPUs. What is this giant array of tuples for? Is it possible to avoid it entirely? [sorry for the truncated notification, firefox somehow decided that I was done with the comment after one sentence.] |
The array of tuples is for reindexing. It's actually the PyTuple objects that are the problem-- they are keys in the hash table. What I'm saying is that if you want a hash table, it needs to contain data structures that are not python objects. I'll investigate some options to fix it; I could spend a whole 'nother year on pandas's internals improving these things, but it's hard when I have no funding and lots of other priorities |
I think if you store the labels in a single C-contiguous 2D array and use |
Okay, right, that makes it clearer what's going on :-). I think the right abstraction here might be numpy structured dtypes, which just represent several arbitrary C objects concatenated into a single buffer? Basically the same idea as your 2D array, but it would make it easy to work with mixed-length integers if desired. (This is what I was thinking of when I mentioned in the original email that a single 64-bit array could suffice -- if you intern the labels in this case to int8, int8, int32, then each "tuple" is only 6 bytes.) For that matter it might be possible to pretty much generalize the Int64Index code to efficiently handle any such dtypes -- a fixed-size chunk of raw memory is a fixed-size chunk of raw memory. You'd just have to check arr.dtype.itemsize once on the way in and then everything else is the same. |
Yeah, I agree that's the "right" solution. It would be a very painful change to make just thinking about it-- for hashing I imagine you would just want to hash the bytes |
Yeah, just a string hash. Easier than writing a new hash function for int64_t[]'s, I guess. |
...I missed the part where khash API doesn't have any way to pass ancillary data to the hash/equality functions. Like data structure size. Suck. Could use length-prefixed (pascal) strings I guess, though it would be an ugly kluge. |
You could make a modified version of the khash-table that does the right thing. I haven't tried yet, though |
Well, here's a hacked khash that support hash tables of fixed-size memory buffers:
|
Alright, very good. This work won't make upcoming 0.8.2 (which contains a load of bugfixes) but I'd like to get it done soon |
Cool. For reference, here's a thread on how to work with the "raw memory" behind structured dtypes in Cython: So far it sounds like you just have to use PyArray_DATA directly, but it's not too bad. |
Any sense of the performance impact of the user_data addition on the other hash tables? |
It should be extraordinarily small. Analysis: memory-wise, we're adding a single pointer to each hash table. This is negligible. Speed-wise, we're passing this pointer into a number of functions, of which probably the hashing function is called most often. So, worst case, this requires for each hash that we have to (1) look up the value of the pointer, (2) push it to the stack. However, the pointer is probably already in the cache, since it's stored right next to the root pointer for the hash table, it's a trivially predictable memory access, etc. (Note that we are not dereferencing the pointer, we're just reading its value from hash struct.) And pushing it to the stack is also absurdly cheap, like a few cycles. But, even this is probably too pessimistic an analysis, because for khash, all of these functions that are now taking an extra argument are short, inlineable, and the compiler can see that they never actually use this extra argument. So in practice it probably gets optimized out altogether. So my bet is that it is literally indistinguishable from zero. |
Actually, I'm being silly -- looking at the code again, the vast majority of your hash/comparison functions are defined as macros, which literally throw away the user_data argument before the compiler even sees the code. So the performance impact really should be nil. (If you really want to be sure you could just make sure that all of the hash/comparison functions are defined as macros like this.) |
@njsmith just noticed this issue (this is not really the same issue, but relevant). Since the indexing code as change significantly I am going to close this. Pls reopen if you have a nice example that involves memory overhead after indexing (a new issue will put it back on the radar). See here #8456 So in 0.15.1 (releasing bug-fix next week). I was able to use this really old pickle. The reported memory usage was completely wrong for a MultiIndex. Here's approx the usage.
and here's the reported usage (which is correct - this includes the actual by the levels as well)
|
This was on the list, but so it doesn't get forgotten: https://groups.google.com/d/msg/pydata/EaPRB4KNeTQ/Y-3kG5gW3xMJ
Summary: storing a large multi-index takes a lot of memory, with current pandas master, once it has been indexed into once. ~200 megabytes for a 1 mega-entry index. (As compared to ~20 megabytes pickled.)
The offending dataframe is available here (this is different from the url in the email thread): http://vorpus.org/~njs/tmp/df-pandas-master.pickle.gz
Partly this is just a bug; there is a large object array of tuples that Wes says shouldn't exist, but seems to anyway.
I was going to also say that it would be nice to be able to just opt out of the hash table entirely, but code inspection suggests that this already happens for indexes with more than pandas/src/engines.pyx:_SIZE_CUTOFF = 1000000 entries. The array above is just under this threshold. Perhaps this cutoff should be lowered for indices that have higher memory overhead, like multiindexes? Or even be user-controllable when the index is created? (The code I'm using to load in and set up these dataframes already takes elaborate measures to manage memory overhead, so this wouldn't be a hassle.)
The text was updated successfully, but these errors were encountered: