Skip to content

PERF: Remove _item_cache #50547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lithomas1 opened this issue Jan 3, 2023 · 7 comments · Fixed by #56614
Closed

PERF: Remove _item_cache #50547

lithomas1 opened this issue Jan 3, 2023 · 7 comments · Fixed by #56614
Labels
Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance

Comments

@lithomas1
Copy link
Member

Discussion copied over from #49450

In OP of #49450(discusses turning on the _item_cache for CoW),

Context:

Currently, we use an item cache for DataFrame columns -> Series. Whenever we access a certain column, we cache the resulting Series in df._item_cache, and the next time we access a column, we first check if that column already exists in the cache and if so return that directly. I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think). But is also has some behavioral consequences, i.e. Series objects from column access can be identical objects, depending on the context:

>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> s1 = df["a"]
>>> s2 = df["a"]
>>> df['b'] = 10 # set existing column -> clears the item cache
>>> s3 = df["a"]
>>> s1 is s2
True
>>> s1 is s3
False

This caching can also have other side effects, though. In investigating #29411, I found that methods like memory_usage(also looks like round, duplicated, may be affected from a quick glance at frame.py) that iterate through all the columns by calling .items(), will actually cause all the columns to be cached in _item_cache, which blows up memory usage.

This might be tricky to do, though, as Joris noted, since this would be a behavior change.
We should discuss here how we want to go about doing this(needs deprecation?).

@lithomas1 lithomas1 added Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action labels Jan 3, 2023
@mroeschke
Copy link
Member

I would be +1 to remove _item_cache.

An intermediate option is to use a functools.lru_cache and have a global parameter controlling the cache size (None = current behavior)?

@jorisvandenbossche
Copy link
Member

What would the benefit / goal of using functools.lru_cache instead?

I found that methods ... that iterate through all the columns by calling .items(), will actually cause all the columns to be cached in _item_cache, which blows up memory usage.

Specifically for this aspect, those columns will normally all be views on the dataframe data, so I don't fully understand how this would blow up memory usage

@mroeschke
Copy link
Member

By allowing users to set a cache size for an LRU _item_cache it could at least bound or almost disable how much is cached?

@lithomas1
Copy link
Member Author

I found that methods ... that iterate through all the columns by calling .items(), will actually cause all the columns to be cached in _item_cache, which blows up memory usage.

Specifically for this aspect, those columns will normally all be views on the dataframe data, so I don't fully understand how this would blow up memory usage

Hm, looking at psutil, it probably doesn't double the memory usage.
(In issue #29411, it only seems to double usage after a bunch of frames are created. That code is using psutil, though, which measures process memory, which would include things like the memory the interpreter is using, so memory usage before and after might not be accurate).

I'm not sure if there are any good alternatives, though. I think DataFrame.memory_usage only includes columns and (optionally) Index memory usage.

Any ideas?

@jorisvandenbossche
Copy link
Member

I looked at the example in #29411 and commented over there: I think the increased memory usage mostly comes from the Series creation overhead, and because the example uses a very wide dataframe (in total 50,000 columns of each 10 rows), this Series creation overhead becomes significant (for long dataframes, you won't notice this).
Of course, it is still due to the _item_cache that we keep those series objects (unnecessarily) alive. But specifically for memory_usage, I think we can also avoid calling .items() altogether in the first place.

@lithomas1
Copy link
Member Author

Thanks a bunch for the help there.

Avoiding .items() sounds like a good fix.
I'll probably still give this a go though to see perf impact without this.

@jbrockmendel
Copy link
Member

Agreed item_cache isn't worth the trouble.

Technically getting rid of it will be an API change, so let's do it now for 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants