-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Example of High Memory Usage of MultiIndex #13904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't know much about the indexing internals, but is it possible to design a "fast-path" MultiIndex when, say, level arrays are all integers? And fall back on the original behaviour when the MultiIndex gets more complicated? |
The problem here is that a array of tuples is created and cached on some operations. But the tuples aren't needed for many things you'd do with a
Then, you can still do most things with the
|
can a simpler example be created for this? (simpler in that with no numba) |
yes, you don't ever actually materalize the tuples, unless |
the act of printing a frame is probably the culprit here somewhere |
Actually one could try disabling the caching of and see if perf suite suffers. |
@jreback If dependence on Numba is a problem, then the @chris-b1 I tried your trick, but unfortunately it just breaks all sorts of things.
EDIT: Only |
I'm on 0.18.1 as well - the slice I wrote doesn't call
|
I've also found that |
Yeah, this is actually more problematic than I realized - it's the same problem noted in the original issue - that the tuples are what's being used for the underlying hash table - it's just that some ops, like groupby and slice, are smart enough to not need it. |
It looks to me that, because of this problem, MultiIndex is fundamentally broken for large arrays. They just take up too much RAM to be useful in production code. Either MutliIndex needs be refactored so that PyTuples aren't used for I'm under a time crunch so the most I can do for the moment is refactor my code to drop as many MultiIndex's as possible (of which I am using many). But I'm interested in learning about the indexing internals so I may revisit this. |
I agree that the underlying implementation just doesn't scale, and probably was always intended to be replaced. I'm sure a well thought out PR fixing the internals would be accepted, but given the discussion around internals refactoring, this might be the type of the thing that is punted to "pandas 2.0" |
actually this could / should be done independtly or pandas 2.0. This is clearly an area that could have improvement, but the API is not touched. so @PeterKucirek @chris-b1 welcome improvements here in the current environment. |
The previous thread contains a link to a stale branch which looks to implement a fix. So anyone working on this may not have to start from zero. |
Totally with you on this. The complicated part is hashing, so we would need to devise a custom hash table implementation that does not rely on materialized PyTuple objects |
so @PeterKucirek if you are interested in trying this out (still a WIP): this uses the new-ish hashing algos
|
Here's my simple test (all done in separate sessions)
This PR
|
Ok, I will try it out and report back. |
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 BUG: a qualifer (+) would always display with a MultiIndex, regardless if it needed deep introspection for memory usage PERF: rework MultiIndex.is_monotonic as per @ssanderson idea
closes pandas-dev#13904 Creates an efficient MultiIndexHashTable in cython. This allows us to efficiently store a multi-index for fast indexing (.get_loc() and .get_indexer()), with the current tuple-based (and gil holding) use of the PyObject Hash Table. This uses the pandas.tools.hashing routines to hash each of the 'values' of a MI to a single uint64. So this makes MI more memory friendly and much more efficient. You get these speedups, because the creation of the hashtable is now much more efficient. Author: Jeff Reback <[email protected]> Closes pandas-dev#15245 from jreback/mi and squashes the following commits: 7df6c34 [Jeff Reback] PERF: high memory in MI
@jreback @PeterKucirek This is my pull request adress: mahsa1991ebrahimian-netflix-OutOfMemory |
you would have o show a reproducible example - pls open a new issue |
I switched to a newer version of Pandas, which is lighter on memory for large MultiIndexes. I don't recall the specific version, but I'm mostly using 0.23 for work, and that version definitely has better performance. |
@jreback here is the new issue I created, could you please have a look |
on the pandas tracker and it should be reproducible with code |
Sorry but I dont get your point about pandas tracker, I am bit new to github, Could you please clarify what do you mean by on pandas tracker? and how my code is not reproducible? |
this is the pandas tracker the post should have an example |
Thank you, it is done now . |
This is a dupe of #1752, in which @jreback recommended a new issue be opened. I have a (hopefully) reproduce-able example which shows how using MultiIndex can explode RAM usage.
At the top I define a function to get the memory of the current Python process. This will be called as I create arrays to get the "true" space of each object.
I need some functions to construct a plausible MultiIndex which matches roughly the actual data I'm using.
Constructing the actual MultiIndex:
Based on this, my fake trip table is taking up about 171.7 - 121.7 = 50MB of RAM. I haven't done any fancy indexing, just initialized the table. But, when I call
trips.info()
(which appears to instantiate the array of PyTuples) this happens:My process's memory usage balloons to 723MB!. Doing the math, the cached indexer takes up 723.6 - 171.7 = 551 MB, a tenfold increase over the actual DataFrame!
For this fake dataset, this is not so much of a problem, but my production code is 20x the size and I soak up 27 GB of RAM when I as much as look at my trips table.
Any performance tips would be appreciated; the only thing I can think of right now is "don't use a MultiIndex". It sounds like a 'proper' fix for this lies deep in the indexing internals which is far above my pay grade. But I wanted to at least log this issue with the community.
Output of
pd.show_versions()
:The text was updated successfully, but these errors were encountered: