API / CoW: always return new objects for column access (don't use item_cache) #49450

jorisvandenbossche · 2022-11-01T19:44:28Z

Currently, we use an item cache for DataFrame columns -> Series. Whenever we access a certain column, we cache the resulting Series in df._item_cache, and the next time we access a column, we first check if that column already exists in the cache and if so return that directly.
I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think). But is also has some behavioral consequences, i.e. Series objects from column access can be identical objects, depending on the context:

>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> s1 = df["a"]
>>> s2 = df["a"]
>>> df['b'] = 10 # set existing column -> clears the item cache
>>> s3 = df["a"]
>>> s1 is s2
True
>>> s1 is s3
False

This can actually already cause very subtle differences in our standard behaviour. But specifically for Copy-on-Write, we need that every indexing operation returns a new object in order to have the CoW mechanism work correctly. So the fact that in the example above two separate indexing operations returns the same object (s1 and s2) is a problem / bug in the CoW implementation (mutating s1 would trigger CoW and not mutate the parent df, but because of the caching and being identical objects, s2 is still changed as well, violating the copy/view rules under CoW).

So in this PR I am fixing that by never making use of the item cache if we are using Copy-on-Write. This also turns up some corner cases in the tests ..

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

xref #48998

…is parent of Series

jbrockmendel · 2022-11-01T19:52:30Z

I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think)

I think you're right on both fronts. If this isn't necessary anymore, I'd be OK with getting rid of it.

WillAyd

Seems reasonable

pandas/tests/frame/methods/test_to_dict_of_blocks.py

jorisvandenbossche · 2022-11-02T18:54:43Z

One of the tricky cases that the tests turned up, is how we decide in the groupby code whether to exclude a column (for nuisance columns / numeric_only) / include the groups as column in the output (if as_index=False) when passing a Series (instead of the column name) as the groupby key.

The code that checks for this case (and to see whether the passed Series actually corresponds to a column in the calling dataframe) is located at:

pandas/pandas/core/groupby/grouper.py

Lines 842 to 854 in 645b499

    
           # if the grouper is obj[name] 
        
           def is_in_obj(gpr) -> bool: 
        
               if not hasattr(gpr, "name"): 
        
                   return False 
        
               try: 
        
                   return gpr is obj[gpr.name] 
        
               except (KeyError, IndexError, InvalidIndexError): 
        
                   # IndexError reached in e.g. test_skip_group_keys when we pass 
        
                   #  lambda here 
        
                   # InvalidIndexError raised on key-types inappropriate for index, 
        
                   #  e.g. DatetimeIndex.get_loc(tuple()) 
        
                   return False

and so this relies on identity (gpr is obj[gpr.name] , which stops working with the change in this PR to return a new object from every time the column is accessed.

To illustrate the behavioural consequence of this (with the main branch, not using the changes in the current PR, but using copy() to trigger grouping by a Series that is not identical):

>>> df = pd.DataFrame({"key": ['a', 'b', 'a'], "col": [1, 2, 3]})
>>> df.groupby(df["key"]).sum()  # the same output as `df.groupby("key").sum()`
     col
key     
a      4
b      2
>>> df.groupby(df["key"], as_index=False).sum()  # the same output as `df.groupby("key", as_index=False).sum()`
  key  col
0   a    4
1   b    2

>>> df.groupby(df["key"].copy()).sum()
FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
     col
key     
a      4
b      2

>>> df.groupby(df["key"].copy(), as_index=False).sum()
FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
   col
0    4
1    2

So the last case actually doesn't include the keys in the output (is this a bug on main?), and the two cases with passing a copied series also trigger the warning (and will raise an error in the future because of the "key" column not being automatically excluded)

jbrockmendel · 2022-11-02T20:19:51Z

Does this problem get easier once numeric_only=None is gone?

jorisvandenbossche · 2022-11-02T20:30:39Z

Not really, it would only make it more urgent to fix, since the current warning would actually become an error (so the code example would stop working).

The main alternative to the identity check is doing an actual equality check (if the Series' name matches one of the column names, then check if that Series' values are equal to the column). I will have a check how expensive this is compared to the actual groupby operation.
This would also affect the behaviour when you actually pass a Series that is not directly coming out of a dataframe but happens to have the same name and values as one of the columns of that dataframe. But I can't directly think of any actual use cases where this would be a problem.

jorisvandenbossche · 2022-11-04T11:51:05Z

Doing some timings to explore the cost of checking for equality (when passing a Series with a name that matches a column name), and taking the extreme case of a single column to aggregate (so the cost of the check is relatively larger compared to the actual grouped reduction operation).

For an integer group key, the cost seems to be relatively small (2-3% of the time that the groupby takes):

df = pd.DataFrame({"key": list(range(10))*10000, "col": np.random.randn(100000)})
s1 = df["key"].copy()
s2 = df["key"].copy()

In [38]: %timeit df.groupby("key").mean()
1.38 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [41]: %timeit s1.equals(s2)
42.4 µs ± 405 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But for a string group values, the equality is much more expensive (almost as much time as the groupby):

df = pd.DataFrame({"key": list("abcdefghij")*10000, "col": np.random.randn(100000)})
s1 = df["key"].copy()
s2 = df["key"].copy()

In [46]: %timeit df.groupby("key").mean()
3.29 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [47]: %timeit s1.equals(s2)
2.94 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The time taken by equals here is almost entirely due to lib.array_equivalent_object. This could be optimized little bit for just strings (if you would use StringDtype), but only ~30% based on a quick test.

I can't directly think of a better way of checking or keeping track that two Series objects are identical.
We could actually use the current machinery that keeps track of the parent dataframe (Series._cacher, which is currently being used to update the DataFrame's item_cache if we update the Series), and then check if two Series objects with the same name have the same parent dataframe in their cache. But it would be nice if we could actually get rid of this code with CoW, and I am not sure it is worth to keep it for this performance issue in a corner case in groupby.

On the short term, since this additional equality check is 1) only done when CoW is enabled, and 2) only done in case you actually provide a Series with a name that is present in the dataframe, and 3) will only be costly if all values are actually the same (so only if you are actually passing a column that is present in the dataframe, and not a derived column), I think it is fine to use this equality check.

…column

jorisvandenbossche · 2022-11-18T13:21:07Z

Any more comments here? (cc @phofl @mroeschke)

jbrockmendel · 2022-11-18T18:07:10Z

We could actually use the current machinery that keeps track of the parent dataframe [...] But it would be nice if we could actually get rid of this code

This sounds like it could work but would be really tough to maintain (i already struggle with _cacher stuff).

Sounds like the perf hit is only in a few corner cases. I think were OK to not stress about it for now, can revisit if we get complaints.

I can't directly think of a better way of checking or keeping track that two Series objects are identical.

maybe clever usage of np.shares_memory?

mroeschke · 2022-11-18T18:13:03Z

pandas/tests/copy_view/test_indexing.py

+    df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0.1, 0.2, 0.3]})
+    df_orig = df.copy()
+
+    s1 = df["a"]


Is it worth testing the other indexing methods that will return a columns as a Series (loc/iloc)?

Good call, loc seems to use the same caching, so good to cover that in the test as well. Updated.

mroeschke

Also appears to need a whatsnew note too

jorisvandenbossche · 2022-11-28T12:28:43Z

I can't directly think of a better way of checking or keeping track that two Series objects are identical.

maybe clever usage of np.shares_memory?

That's an interesting idea. We would need to look at the actual memory address as well (eg ndarray.__array_interface__["data"] from python), to catch the fact that two slices of the same array (and of the same length, and having some overlap but starting at a different location in the parent array) share memory but are not identical.
So, let's leave as is for now, but if this turns out to be a perf issue in practice, that could be a potential way to tackle it.

lithomas1 · 2023-01-03T17:04:42Z

Thanks for updating this PR. I think it might be worth it to remove item_cache entirely even with the perf hit.

I was just investigating #29411, and it seems like anything that iterates through columns with items()(stuff like round(), memory_usage()) seems to be affected(seems to double memory usage, but I'll have to double check with ASV), while with something like sort_values memory usage is flat after garbage collection.

jorisvandenbossche · 2023-01-03T17:24:45Z

I think it might be worth it to remove item_cache entirely even with the perf hit.

Personally I would like to keep that out of this PR. This one is specifically for CoW, where this new behaviour is actually required to implement the semantics properly, and doesn't change any default behaviour.

If we want to remove the item_cache entirely, that requires a more elaborate discussion about potential short term impacts of doing this (not only performance but also behavioural, as it is a breaking change, so we would have to see if that's something we want to do for 2.0).

(and to be clear, I am all for getting rid of the item_cache, but a route towards that could also be just leaving the current item_cache as is, and then when we move towards CoW by default (eg pandas 3.0), then we automatically get rid of it)

lithomas1 · 2023-01-03T17:55:00Z

I think it might be worth it to remove item_cache entirely even with the perf hit.

Personally I would like to keep that out of this PR. This one is specifically for CoW, where this new behaviour is actually required to implement the semantics properly, and doesn't change any default behaviour.

If we want to remove the item_cache entirely, that requires a more elaborate discussion about potential short term impacts of doing this (not only performance but also behavioural, as it is a breaking change, so we would have to see if that's something we want to do for 2.0).

(and to be clear, I am all for getting rid of the item_cache, but a route towards that could also be just leaving the current item_cache as is, and then when we move towards CoW by default (eg pandas 3.0), then we automatically get rid of it)

Yeah agreed. I just decided to post here since there already seemed to be a little bit of discussion here on item_cache, but I'll open up an issue to discuss removing item_cache entirely. (Also going to look into potential improvements for array_equivalent_object)

~~For this PR, though, do you think it would make sense to still clear the item cache with CoW enabled?~~
(Nevermind, I misunderstood how CoW works)

jorisvandenbossche · 2023-01-03T20:00:58Z

~~For this PR, though, do you think it would make sense to still clear the item cache with CoW enabled?~~
(Nevermind, I misunderstood how CoW works)

Yeah, so we actually still call self._clear_item_cache() in various places, but because we never update the _item_cache when CoW is enabled, this operation is a no-op (it's clearing an empty dict).

phofl · 2023-01-07T09:52:34Z

Can we check if a slice over the first element of both arrays share memory? e.g.

na = np.array([1, 2, 3])
na2 = na[:]
np.shares_memory(na[:1], na2[:1])

This should remove the need to check for different starting points?

phofl

Couple of comments

doc/source/whatsnew/v2.0.0.rst

pandas/core/frame.py

jorisvandenbossche · 2023-01-11T11:49:23Z

Can we check if a slice over the first element of both arrays share memory? ... This should remove the need to check for different starting points?

I suppose this is in response to #49450 (comment)? I am not sure how that helps. Your code snippet returns True, but what we actually want is something that returns False in such a case (you can have two Series objects where the underlying array is a different slice of the same parent array, like na[:1] and na2[:1], but while those share memory, they are _not_equal).

jorisvandenbossche · 2023-01-13T08:30:39Z

@phofl can you have a look at my question above?

phofl · 2023-01-13T08:39:53Z

Sorry missed your question.
I can't reproduce that this returns True, if the starting point is not the same. Just to be sure that I understand you correctly: We are comparing a column from a DataFrame with our group key, which is a Series, correct?

So if the first entry of the array shares memory and the length is equal shouldn't they be the same?

na = np.array([1, 2, 3, 4])
np.shares_memory(na[:1], na[1:2])

These are two different starting points in the same array, but that returns False. What am I missing?

jorisvandenbossche · 2023-01-13T08:58:26Z

I can't reproduce that this returns True, if the starting point is not the same.

I was just running your snippet, and there the starting point was the same .. and didn't think further ;)

jorisvandenbossche · 2023-01-13T12:55:46Z

I will look into optimizing this equality check in a follow-up (it's for a corner case that doesn't affect normal non-CoW mode anyway, and it's a bit more complicated than just using np.shares_memory for extension arrays)

jorisvandenbossche · 2023-01-13T13:17:06Z

I will look into optimizing this equality check in a follow-up (it's for a corner case that doesn't affect normal non-CoW mode anyway, and it's a bit more complicated than just using np.shares_memory for extension arrays)

-> #50730

…m_cache) (pandas-dev#49450)

jorisvandenbossche added 2 commits November 1, 2022 20:24

CoW: avoid getting column from item_cache

57cf2e7

also avoid populating the reverse cacher (pointing to DataFrame that …

32fa0a4

…is parent of Series

jorisvandenbossche added the Copy / view semantics label Nov 1, 2022

WillAyd reviewed Nov 1, 2022

View reviewed changes

pandas/tests/frame/methods/test_to_dict_of_blocks.py Outdated Show resolved Hide resolved

This was referenced Nov 2, 2022

API / CoW: detect and raise error for chained assignment under Copy-on-Write #49467

Merged

BUG / CoW: also return new object in case of null slice for both rows and columsn (.(i)loc[:, :]) #49469

Merged

jorisvandenbossche mentioned this pull request Nov 4, 2022

DEPR: groupby with as_index=False doesn't add grouper as column when passing a Series as group key #49519

Closed

jorisvandenbossche added 4 commits November 4, 2022 12:51

Merge remote-tracking branch 'upstream/main' into cow-no-item-cache

565f36b

use equality check in groupby to determine if grouper is an existing …

5078914

…column

use xfail in test_to_dict_of_blocks_item_cache

cea5fcb

fix sort_values inplace tests

d4e9b7e

mroeschke reviewed Nov 18, 2022

View reviewed changes

jorisvandenbossche added 4 commits December 9, 2022 13:13

Merge remote-tracking branch 'upstream/main' into cow-no-item-cache

8da630c

Merge remote-tracking branch 'upstream/main' into cow-no-item-cache

4d13648

expand test to cover other indexing methods

55b3785

add whatsnew

3e79282

jorisvandenbossche added this to the 2.0 milestone Jan 3, 2023

lithomas1 mentioned this pull request Jan 3, 2023

PERF: Remove _item_cache #50547

Closed

phofl reviewed Jan 7, 2023

View reviewed changes

doc/source/whatsnew/v2.0.0.rst Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

update whatsnew note

1db5a6f

Merge remote-tracking branch 'upstream/main' into cow-no-item-cache

92cfa34

jorisvandenbossche merged commit b1b56d4 into pandas-dev:main Jan 13, 2023

jorisvandenbossche deleted the cow-no-item-cache branch January 13, 2023 12:56

jorisvandenbossche mentioned this pull request Jan 13, 2023

PERF (CoW): optimize equality check in groupby is_in_obj #50730

Closed

1 task

phofl pushed a commit to phofl/pandas that referenced this pull request Jan 13, 2023

API / CoW: always return new objects for column access (don't use ite…

b74b1c7

…m_cache) (pandas-dev#49450)

jorisvandenbossche mentioned this pull request Feb 20, 2023

ENH: Improve ref-tracking for group keys #51442

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API / CoW: always return new objects for column access (don't use item_cache) #49450

API / CoW: always return new objects for column access (don't use item_cache) #49450

jorisvandenbossche commented Nov 1, 2022 •

edited

Loading

jbrockmendel commented Nov 1, 2022

WillAyd left a comment

jorisvandenbossche commented Nov 2, 2022

jbrockmendel commented Nov 2, 2022

jorisvandenbossche commented Nov 2, 2022

jorisvandenbossche commented Nov 4, 2022 •

edited

Loading

jorisvandenbossche commented Nov 18, 2022

jbrockmendel commented Nov 18, 2022 •

edited

Loading

mroeschke Nov 18, 2022

jorisvandenbossche Jan 3, 2023

mroeschke left a comment

jorisvandenbossche commented Nov 28, 2022

lithomas1 commented Jan 3, 2023 •

edited

Loading

jorisvandenbossche commented Jan 3, 2023

lithomas1 commented Jan 3, 2023 •

edited

Loading

jorisvandenbossche commented Jan 3, 2023

phofl commented Jan 7, 2023

phofl left a comment

jorisvandenbossche commented Jan 11, 2023

jorisvandenbossche commented Jan 13, 2023

phofl commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

API / CoW: always return new objects for column access (don't use item_cache) #49450

API / CoW: always return new objects for column access (don't use item_cache) #49450

Conversation

jorisvandenbossche commented Nov 1, 2022 • edited Loading

jbrockmendel commented Nov 1, 2022

WillAyd left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 2, 2022

jbrockmendel commented Nov 2, 2022

jorisvandenbossche commented Nov 2, 2022

jorisvandenbossche commented Nov 4, 2022 • edited Loading

jorisvandenbossche commented Nov 18, 2022

jbrockmendel commented Nov 18, 2022 • edited Loading

mroeschke Nov 18, 2022

Choose a reason for hiding this comment

jorisvandenbossche Jan 3, 2023

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 28, 2022

lithomas1 commented Jan 3, 2023 • edited Loading

jorisvandenbossche commented Jan 3, 2023

lithomas1 commented Jan 3, 2023 • edited Loading

jorisvandenbossche commented Jan 3, 2023

phofl commented Jan 7, 2023

phofl left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 11, 2023

jorisvandenbossche commented Jan 13, 2023

phofl commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

jorisvandenbossche commented Jan 13, 2023

jorisvandenbossche commented Nov 1, 2022 •

edited

Loading

jorisvandenbossche commented Nov 4, 2022 •

edited

Loading

jbrockmendel commented Nov 18, 2022 •

edited

Loading

lithomas1 commented Jan 3, 2023 •

edited

Loading

lithomas1 commented Jan 3, 2023 •

edited

Loading