Skip to content

API / CoW: always return new objects for column access (don't use item_cache) #49450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 1, 2022

Currently, we use an item cache for DataFrame columns -> Series. Whenever we access a certain column, we cache the resulting Series in df._item_cache, and the next time we access a column, we first check if that column already exists in the cache and if so return that directly.
I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think). But is also has some behavioral consequences, i.e. Series objects from column access can be identical objects, depending on the context:

>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> s1 = df["a"]
>>> s2 = df["a"]
>>> df['b'] = 10 # set existing column -> clears the item cache
>>> s3 = df["a"]
>>> s1 is s2
True
>>> s1 is s3
False

This can actually already cause very subtle differences in our standard behaviour. But specifically for Copy-on-Write, we need that every indexing operation returns a new object in order to have the CoW mechanism work correctly. So the fact that in the example above two separate indexing operations returns the same object (s1 and s2) is a problem / bug in the CoW implementation (mutating s1 would trigger CoW and not mutate the parent df, but because of the caching and being identical objects, s2 is still changed as well, violating the copy/view rules under CoW).

So in this PR I am fixing that by never making use of the item cache if we are using Copy-on-Write. This also turns up some corner cases in the tests ..

xref #48998

@jbrockmendel
Copy link
Member

I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think)

I think you're right on both fronts. If this isn't necessary anymore, I'd be OK with getting rid of it.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable

@jorisvandenbossche
Copy link
Member Author

One of the tricky cases that the tests turned up, is how we decide in the groupby code whether to exclude a column (for nuisance columns / numeric_only) / include the groups as column in the output (if as_index=False) when passing a Series (instead of the column name) as the groupby key.

The code that checks for this case (and to see whether the passed Series actually corresponds to a column in the calling dataframe) is located at:

# if the grouper is obj[name]
def is_in_obj(gpr) -> bool:
if not hasattr(gpr, "name"):
return False
try:
return gpr is obj[gpr.name]
except (KeyError, IndexError, InvalidIndexError):
# IndexError reached in e.g. test_skip_group_keys when we pass
# lambda here
# InvalidIndexError raised on key-types inappropriate for index,
# e.g. DatetimeIndex.get_loc(tuple())
return False

and so this relies on identity (gpr is obj[gpr.name] , which stops working with the change in this PR to return a new object from every time the column is accessed.

To illustrate the behavioural consequence of this (with the main branch, not using the changes in the current PR, but using copy() to trigger grouping by a Series that is not identical):

>>> df = pd.DataFrame({"key": ['a', 'b', 'a'], "col": [1, 2, 3]})
>>> df.groupby(df["key"]).sum()  # the same output as `df.groupby("key").sum()`
     col
key     
a      4
b      2
>>> df.groupby(df["key"], as_index=False).sum()  # the same output as `df.groupby("key", as_index=False).sum()`
  key  col
0   a    4
1   b    2

>>> df.groupby(df["key"].copy()).sum()
FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
     col
key     
a      4
b      2

>>> df.groupby(df["key"].copy(), as_index=False).sum()
FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
   col
0    4
1    2

So the last case actually doesn't include the keys in the output (is this a bug on main?), and the two cases with passing a copied series also trigger the warning (and will raise an error in the future because of the "key" column not being automatically excluded)

@jbrockmendel
Copy link
Member

Does this problem get easier once numeric_only=None is gone?

@jorisvandenbossche
Copy link
Member Author

Not really, it would only make it more urgent to fix, since the current warning would actually become an error (so the code example would stop working).

The main alternative to the identity check is doing an actual equality check (if the Series' name matches one of the column names, then check if that Series' values are equal to the column). I will have a check how expensive this is compared to the actual groupby operation.
This would also affect the behaviour when you actually pass a Series that is not directly coming out of a dataframe but happens to have the same name and values as one of the columns of that dataframe. But I can't directly think of any actual use cases where this would be a problem.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 4, 2022

Doing some timings to explore the cost of checking for equality (when passing a Series with a name that matches a column name), and taking the extreme case of a single column to aggregate (so the cost of the check is relatively larger compared to the actual grouped reduction operation).

For an integer group key, the cost seems to be relatively small (2-3% of the time that the groupby takes):

df = pd.DataFrame({"key": list(range(10))*10000, "col": np.random.randn(100000)})
s1 = df["key"].copy()
s2 = df["key"].copy()

In [38]: %timeit df.groupby("key").mean()
1.38 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [41]: %timeit s1.equals(s2)
42.4 µs ± 405 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But for a string group values, the equality is much more expensive (almost as much time as the groupby):

df = pd.DataFrame({"key": list("abcdefghij")*10000, "col": np.random.randn(100000)})
s1 = df["key"].copy()
s2 = df["key"].copy()

In [46]: %timeit df.groupby("key").mean()
3.29 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [47]: %timeit s1.equals(s2)
2.94 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The time taken by equals here is almost entirely due to lib.array_equivalent_object. This could be optimized little bit for just strings (if you would use StringDtype), but only ~30% based on a quick test.

I can't directly think of a better way of checking or keeping track that two Series objects are identical.
We could actually use the current machinery that keeps track of the parent dataframe (Series._cacher, which is currently being used to update the DataFrame's item_cache if we update the Series), and then check if two Series objects with the same name have the same parent dataframe in their cache. But it would be nice if we could actually get rid of this code with CoW, and I am not sure it is worth to keep it for this performance issue in a corner case in groupby.

On the short term, since this additional equality check is 1) only done when CoW is enabled, and 2) only done in case you actually provide a Series with a name that is present in the dataframe, and 3) will only be costly if all values are actually the same (so only if you are actually passing a column that is present in the dataframe, and not a derived column), I think it is fine to use this equality check.

@jorisvandenbossche
Copy link
Member Author

Any more comments here? (cc @phofl @mroeschke)

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 18, 2022

We could actually use the current machinery that keeps track of the parent dataframe [...] But it would be nice if we could actually get rid of this code

This sounds like it could work but would be really tough to maintain (i already struggle with _cacher stuff).

Sounds like the perf hit is only in a few corner cases. I think were OK to not stress about it for now, can revisit if we get complaints.

I can't directly think of a better way of checking or keeping track that two Series objects are identical.

maybe clever usage of np.shares_memory?

df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})
df_orig = df.copy()

s1 = df["a"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth testing the other indexing methods that will return a columns as a Series (loc/iloc)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, loc seems to use the same caching, so good to cover that in the test as well. Updated.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also appears to need a whatsnew note too

@jorisvandenbossche
Copy link
Member Author

I can't directly think of a better way of checking or keeping track that two Series objects are identical.

maybe clever usage of np.shares_memory?

That's an interesting idea. We would need to look at the actual memory address as well (eg ndarray.__array_interface__["data"] from python), to catch the fact that two slices of the same array (and of the same length, and having some overlap but starting at a different location in the parent array) share memory but are not identical.
So, let's leave as is for now, but if this turns out to be a perf issue in practice, that could be a potential way to tackle it.

@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Jan 3, 2023
@lithomas1
Copy link
Member

lithomas1 commented Jan 3, 2023

Thanks for updating this PR. I think it might be worth it to remove item_cache entirely even with the perf hit.

I was just investigating #29411, and it seems like anything that iterates through columns with items()(stuff like round(), memory_usage()) seems to be affected(seems to double memory usage, but I'll have to double check with ASV), while with something like sort_values memory usage is flat after garbage collection.

@jorisvandenbossche
Copy link
Member Author

I think it might be worth it to remove item_cache entirely even with the perf hit.

Personally I would like to keep that out of this PR. This one is specifically for CoW, where this new behaviour is actually required to implement the semantics properly, and doesn't change any default behaviour.

If we want to remove the item_cache entirely, that requires a more elaborate discussion about potential short term impacts of doing this (not only performance but also behavioural, as it is a breaking change, so we would have to see if that's something we want to do for 2.0).

(and to be clear, I am all for getting rid of the item_cache, but a route towards that could also be just leaving the current item_cache as is, and then when we move towards CoW by default (eg pandas 3.0), then we automatically get rid of it)

@lithomas1
Copy link
Member

lithomas1 commented Jan 3, 2023

I think it might be worth it to remove item_cache entirely even with the perf hit.

Personally I would like to keep that out of this PR. This one is specifically for CoW, where this new behaviour is actually required to implement the semantics properly, and doesn't change any default behaviour.

If we want to remove the item_cache entirely, that requires a more elaborate discussion about potential short term impacts of doing this (not only performance but also behavioural, as it is a breaking change, so we would have to see if that's something we want to do for 2.0).

(and to be clear, I am all for getting rid of the item_cache, but a route towards that could also be just leaving the current item_cache as is, and then when we move towards CoW by default (eg pandas 3.0), then we automatically get rid of it)

Yeah agreed. I just decided to post here since there already seemed to be a little bit of discussion here on item_cache, but I'll open up an issue to discuss removing item_cache entirely. (Also going to look into potential improvements for array_equivalent_object)

For this PR, though, do you think it would make sense to still clear the item cache with CoW enabled?
(Nevermind, I misunderstood how CoW works)

@jorisvandenbossche
Copy link
Member Author

For this PR, though, do you think it would make sense to still clear the item cache with CoW enabled?
(Nevermind, I misunderstood how CoW works)

Yeah, so we actually still call self._clear_item_cache() in various places, but because we never update the _item_cache when CoW is enabled, this operation is a no-op (it's clearing an empty dict).

@phofl
Copy link
Member

phofl commented Jan 7, 2023

Can we check if a slice over the first element of both arrays share memory? e.g.

na = np.array([1, 2, 3])
na2 = na[:]
np.shares_memory(na[:1], na2[:1])

This should remove the need to check for different starting points?

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments

@jorisvandenbossche
Copy link
Member Author

Can we check if a slice over the first element of both arrays share memory? ... This should remove the need to check for different starting points?

I suppose this is in response to #49450 (comment)? I am not sure how that helps. Your code snippet returns True, but what we actually want is something that returns False in such a case (you can have two Series objects where the underlying array is a different slice of the same parent array, like na[:1] and na2[:1], but while those share memory, they are _not_equal).

@jorisvandenbossche
Copy link
Member Author

@phofl can you have a look at my question above?

@phofl
Copy link
Member

phofl commented Jan 13, 2023

Sorry missed your question.
I can't reproduce that this returns True, if the starting point is not the same. Just to be sure that I understand you correctly: We are comparing a column from a DataFrame with our group key, which is a Series, correct?

So if the first entry of the array shares memory and the length is equal shouldn't they be the same?

na = np.array([1, 2, 3, 4])
np.shares_memory(na[:1], na[1:2])

These are two different starting points in the same array, but that returns False. What am I missing?

@jorisvandenbossche
Copy link
Member Author

I can't reproduce that this returns True, if the starting point is not the same.

I was just running your snippet, and there the starting point was the same .. and didn't think further ;)

@jorisvandenbossche
Copy link
Member Author

I will look into optimizing this equality check in a follow-up (it's for a corner case that doesn't affect normal non-CoW mode anyway, and it's a bit more complicated than just using np.shares_memory for extension arrays)

@jorisvandenbossche jorisvandenbossche merged commit b1b56d4 into pandas-dev:main Jan 13, 2023
@jorisvandenbossche jorisvandenbossche deleted the cow-no-item-cache branch January 13, 2023 12:56
@jorisvandenbossche
Copy link
Member Author

I will look into optimizing this equality check in a follow-up (it's for a corner case that doesn't affect normal non-CoW mode anyway, and it's a bit more complicated than just using np.shares_memory for extension arrays)

-> #50730

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants