-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API / CoW: always return new objects for column access (don't use item_cache) #49450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API / CoW: always return new objects for column access (don't use item_cache) #49450
Conversation
I think you're right on both fronts. If this isn't necessary anymore, I'd be OK with getting rid of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable
One of the tricky cases that the tests turned up, is how we decide in the groupby code whether to exclude a column (for nuisance columns / numeric_only) / include the groups as column in the output (if The code that checks for this case (and to see whether the passed Series actually corresponds to a column in the calling dataframe) is located at: pandas/pandas/core/groupby/grouper.py Lines 842 to 854 in 645b499
and so this relies on identity ( To illustrate the behavioural consequence of this (with the main branch, not using the changes in the current PR, but using >>> df = pd.DataFrame({"key": ['a', 'b', 'a'], "col": [1, 2, 3]})
>>> df.groupby(df["key"]).sum() # the same output as `df.groupby("key").sum()`
col
key
a 4
b 2
>>> df.groupby(df["key"], as_index=False).sum() # the same output as `df.groupby("key", as_index=False).sum()`
key col
0 a 4
1 b 2
>>> df.groupby(df["key"].copy()).sum()
FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
col
key
a 4
b 2
>>> df.groupby(df["key"].copy(), as_index=False).sum()
FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
col
0 4
1 2 So the last case actually doesn't include the keys in the output (is this a bug on main?), and the two cases with passing a copied series also trigger the warning (and will raise an error in the future because of the "key" column not being automatically excluded) |
Does this problem get easier once numeric_only=None is gone? |
Not really, it would only make it more urgent to fix, since the current warning would actually become an error (so the code example would stop working). The main alternative to the identity check is doing an actual equality check (if the Series' name matches one of the column names, then check if that Series' values are equal to the column). I will have a check how expensive this is compared to the actual groupby operation. |
Doing some timings to explore the cost of checking for equality (when passing a Series with a name that matches a column name), and taking the extreme case of a single column to aggregate (so the cost of the check is relatively larger compared to the actual grouped reduction operation). For an integer group key, the cost seems to be relatively small (2-3% of the time that the groupby takes):
But for a string group values, the equality is much more expensive (almost as much time as the groupby):
The time taken by I can't directly think of a better way of checking or keeping track that two Series objects are identical. On the short term, since this additional equality check is 1) only done when CoW is enabled, and 2) only done in case you actually provide a Series with a name that is present in the dataframe, and 3) will only be costly if all values are actually the same (so only if you are actually passing a column that is present in the dataframe, and not a derived column), I think it is fine to use this equality check. |
Any more comments here? (cc @phofl @mroeschke) |
This sounds like it could work but would be really tough to maintain (i already struggle with _cacher stuff). Sounds like the perf hit is only in a few corner cases. I think were OK to not stress about it for now, can revisit if we get complaints.
maybe clever usage of np.shares_memory? |
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]}) | ||
df_orig = df.copy() | ||
|
||
s1 = df["a"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth testing the other indexing methods that will return a columns as a Series (loc/iloc)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, loc
seems to use the same caching, so good to cover that in the test as well. Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also appears to need a whatsnew note too
That's an interesting idea. We would need to look at the actual memory address as well (eg |
Thanks for updating this PR. I think it might be worth it to remove item_cache entirely even with the perf hit. I was just investigating #29411, and it seems like anything that iterates through columns with |
Personally I would like to keep that out of this PR. This one is specifically for CoW, where this new behaviour is actually required to implement the semantics properly, and doesn't change any default behaviour. If we want to remove the item_cache entirely, that requires a more elaborate discussion about potential short term impacts of doing this (not only performance but also behavioural, as it is a breaking change, so we would have to see if that's something we want to do for 2.0). (and to be clear, I am all for getting rid of the item_cache, but a route towards that could also be just leaving the current item_cache as is, and then when we move towards CoW by default (eg pandas 3.0), then we automatically get rid of it) |
Yeah agreed. I just decided to post here since there already seemed to be a little bit of discussion here on item_cache, but I'll open up an issue to discuss removing item_cache entirely. (Also going to look into potential improvements for
|
Yeah, so we actually still call |
Can we check if a slice over the first element of both arrays share memory? e.g.
This should remove the need to check for different starting points? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments
I suppose this is in response to #49450 (comment)? I am not sure how that helps. Your code snippet returns True, but what we actually want is something that returns False in such a case (you can have two Series objects where the underlying array is a different slice of the same parent array, like |
@phofl can you have a look at my question above? |
Sorry missed your question. So if the first entry of the array shares memory and the length is equal shouldn't they be the same?
These are two different starting points in the same array, but that returns False. What am I missing? |
I was just running your snippet, and there the starting point was the same .. and didn't think further ;) |
I will look into optimizing this equality check in a follow-up (it's for a corner case that doesn't affect normal non-CoW mode anyway, and it's a bit more complicated than just using |
-> #50730 |
Currently, we use an item cache for DataFrame columns -> Series. Whenever we access a certain column, we cache the resulting Series in
df._item_cache
, and the next time we access a column, we first check if that column already exists in the cache and if so return that directly.I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think). But is also has some behavioral consequences, i.e. Series objects from column access can be identical objects, depending on the context:
This can actually already cause very subtle differences in our standard behaviour. But specifically for Copy-on-Write, we need that every indexing operation returns a new object in order to have the CoW mechanism work correctly. So the fact that in the example above two separate indexing operations returns the same object (
s1
ands2
) is a problem / bug in the CoW implementation (mutatings1
would trigger CoW and not mutate the parentdf
, but because of the caching and being identical objects,s2
is still changed as well, violating the copy/view rules under CoW).So in this PR I am fixing that by never making use of the item cache if we are using Copy-on-Write. This also turns up some corner cases in the tests ..
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.xref #48998