-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CoW: Clear dead references every time we add a new one #55008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I wasn't sure if it was better to comment on the past issue, the .copy() issue or create a new issue.... I think this may hurt more the performance, here is a quick code:
The list approach is now O(n) per insertion, these informal numbers show how it's much worse than weakSet on every case. Cleaning on every 2^16 insertions or something like that will help but I IMO it's not easy to chose the right value and it will create corner cases where the performance will degrade much more than the predictable regression from using WeakSet(). Also, if you change your mind, I'd like to contribute with code too, no matter the approach. Update: there was a bug in the code, the list seems to be worse than weakset from n=1 instead of n=15 |
Maybe we can use a regular set instead and use the finalize parameter of the weakref to delete it from the set? Or is that basically just what WeakSet does? |
The Weakrefs aren’t hashable, so that won’t work |
I think weakref are actually hashable:
not really formal benchmarks, but if you think about it, weakset would be probably already implemented based on set() if it was faster. Now, something that needs to be changed in order to use weakset instead of list again is to make Index (and maybe something else?) hashable |
FWIW the benchmarks aren’t really representative, it is very very unlikely that any user has 100 different variables that are referencing dataframes pointing to the same data. I don’t think that there are many realistic cases where even 10 different objects are alive |
Doesn't it only reinforce how the WeakSet is better suited for the situation? it's faster most of the cases (it's already +10% faster at n=1) and doesn't require handcrafted solutions |
The weakset slows us down when creating objects that are zero copy right now, that was noticeable. Can you show a pandas example where this is actually noticeable and not noise? Operations have quite some overhead. The list implementation is in the nanosecond range with 20 elements, that is very very cheap. I am not convinced that this will have a noticeable impact. I am happy to discuss other options if there is something where this would slow us down |
Just to understand better, do you mean operations like DataFrame._from_arrays as in the original PR? |
yes |
I think the easiest way to reach worst cases unintentionally is when you have tons of slices. I'm still not sure if these cases are common enough though. |
I think there are a few options here:
Let me know if I'm overthinking 😅 |
If I am reading the benchmarks correctly the difference is 10us with 30,000 views? If so I also agree that the theory here should take a backseat to what's like to happen in practice |
Yeah you are overthinking. I don't care much if someone creates 10k views of a single object. What I wanted initially is a real world pandas use-case where this matters, not a while loop where you call the same method over and over again. |
It's actually 10 seconds.
Perfect. |
Ah OK. 10 seconds definitely worse than 10 us, but I don't think we should hold this up for that. Of course performance can always be improved in follow ups, but getting the right functionality first is most important. I like the attention to detail @wangwillian0 and probably other places in the code base we can take a critical review like that; I just don't think this particular case is the most urgent to optimize |
Just for completeness: I might be wrong here as well. We will switch if this shows up negatively in our benchmarks in any way. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
) (cherry picked from commit 7134f2c)
@phofl Can you take a look if you have the time? I don't have a list but if you go to |
And in addition to our benchmarks, in the meantime we also got several reports (#55256, #55245, apache/arrow#38260), so there were actual use cases hitting this. The last two of those linked issues are essentially cases where have repeated column access in wide dataframes (typically getting each column as a Series object). A small illustration of this: N = 10_000
df = pd.DataFrame(np.random.normal(size=(50, N)))
result = list(df.items()) This has become slower in 2.1.1 compared to 2.1.0:
(I need to recreate the dataframe in the timeit setup, otherwise the item_cache would get used from the second run, and when using that, there is no speed difference) For further illustration, my understanding is that the reason we originally did this was to ensure the list of weakrefs gets clean-up regularly, because if it keeps growing, that can start to consume quite some memory. For this example (and with accessing each column only once): >>> res = list(df.items())
>>> referenced_blocks = df._mgr.blocks[0].refs.referenced_blocks
>>> len(referenced_blocks)
10001
>>> import sys
>>> (sys.getsizeof(referenced_blocks) + sys.getsizeof(referenced_blocks[0]) * len(referenced_blocks)) / 1024**2
0.8442459106445312 the weakrefs use almost one MB (which is of course not that much, but if this is done repeatedly without cleaning up, it can grow). Something that also worsens the problem is that it's not enough for those column objects to go out of scope, because they are cached in the parent object: >>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
10001
>>> del res
>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
10001
>>> df._clear_item_cache()
>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
1 So in the specific case here, even if the application that iterates over the columns doesn't keep a track on those columns, the problem still occurs (the weakref list growing, and thus slowing down column access when we try to clear dead refs every time) |
Yeah I’ll look Into this today and see if exponential backoffs helps with these problems (hopefully) |
Thinking through the possible solutions (including the ones listed by @wangwillian0 above):
@wangwillian0 I didn't understand everything you wrote in the overview above at #55008 (comment), so if I am missing something, let me know Without trying to code anything yet, my thought is that for 2.1.2, let's try to do the "quick" fix of only calling |
Hi, sorry for the delay
I mentioned the possibility of making the Index hashable because I thought it was immutable, should a new issue be created? maybe at least a rephrase to the documentation?
I created a PR with an alternative solution based on the code of weakset at #55539:
The 20% was added by just declaring an extra function at the constructor, so I don't think there is a lot of elegant alternatives |
I am not entirely sure rephrasing will help there (without making it needlessly complex for most users, or maybe putting it in a note further down): the sequence itself is immutable, in the sense you can't change values with setitem. It's only certain attributes that are settable. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I'd like to try that one before going back to a set