PERF: Remove _item_cache #50547

lithomas1 · 2023-01-03T21:47:29Z

Discussion copied over from #49450

In OP of #49450(discusses turning on the _item_cache for CoW),

Context:

Currently, we use an item cache for DataFrame columns -> Series. Whenever we access a certain column, we cache the resulting Series in df._item_cache, and the next time we access a column, we first check if that column already exists in the cache and if so return that directly. I suppose this was done for making repeated column access faster (although the Series construction overhead for this fast path case also has improved I think). But is also has some behavioral consequences, i.e. Series objects from column access can be identical objects, depending on the context:
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> s1 = df["a"]
>>> s2 = df["a"]
>>> df['b'] = 10 # set existing column -> clears the item cache
>>> s3 = df["a"]
>>> s1 is s2
True
>>> s1 is s3
False

This caching can also have other side effects, though. In investigating #29411, I found that methods like memory_usage(also looks like round, duplicated, may be affected from a quick glance at frame.py) that iterate through all the columns by calling .items(), will actually cause all the columns to be cached in _item_cache, which blows up memory usage.

This might be tricky to do, though, as Joris noted, since this would be a behavior change.
We should discuss here how we want to go about doing this(needs deprecation?).

The text was updated successfully, but these errors were encountered:

mroeschke · 2023-01-04T19:32:00Z

I would be +1 to remove _item_cache.

An intermediate option is to use a functools.lru_cache and have a global parameter controlling the cache size (None = current behavior)?

jorisvandenbossche · 2023-01-04T19:35:12Z

What would the benefit / goal of using functools.lru_cache instead?

I found that methods ... that iterate through all the columns by calling .items(), will actually cause all the columns to be cached in _item_cache, which blows up memory usage.

Specifically for this aspect, those columns will normally all be views on the dataframe data, so I don't fully understand how this would blow up memory usage

mroeschke · 2023-01-04T20:42:06Z

By allowing users to set a cache size for an LRU _item_cache it could at least bound or almost disable how much is cached?

lithomas1 · 2023-01-04T21:02:20Z

I found that methods ... that iterate through all the columns by calling .items(), will actually cause all the columns to be cached in _item_cache, which blows up memory usage.

Specifically for this aspect, those columns will normally all be views on the dataframe data, so I don't fully understand how this would blow up memory usage

Hm, looking at psutil, it probably doesn't double the memory usage.
(In issue #29411, it only seems to double usage after a bunch of frames are created. That code is using psutil, though, which measures process memory, which would include things like the memory the interpreter is using, so memory usage before and after might not be accurate).

I'm not sure if there are any good alternatives, though. I think DataFrame.memory_usage only includes columns and (optionally) Index memory usage.

Any ideas?

jorisvandenbossche · 2023-01-05T12:32:10Z

I looked at the example in #29411 and commented over there: I think the increased memory usage mostly comes from the Series creation overhead, and because the example uses a very wide dataframe (in total 50,000 columns of each 10 rows), this Series creation overhead becomes significant (for long dataframes, you won't notice this).
Of course, it is still due to the _item_cache that we keep those series objects (unnecessarily) alive. But specifically for memory_usage, I think we can also avoid calling .items() altogether in the first place.

lithomas1 · 2023-01-05T17:40:23Z

Thanks a bunch for the help there.

Avoiding .items() sounds like a good fix.
I'll probably still give this a go though to see perf impact without this.

jbrockmendel · 2023-01-06T23:41:25Z

Agreed item_cache isn't worth the trouble.

Technically getting rid of it will be an API change, so let's do it now for 2.0.

lithomas1 added Performance Memory or execution speed performance Needs Discussion Requires discussion from core team before further action labels Jan 3, 2023

jorisvandenbossche mentioned this issue Jan 5, 2023

Memory leak in Dataframe.memory_usage #29411

Closed

j-bennet mentioned this issue Mar 31, 2023

PERF: memory usage discrepancy pandas vs psutil #52329

Open

3 tasks

mroeschke mentioned this issue Feb 2, 2024

DEPR: Remove SettingWithCopyWarning #56614

Merged

phofl closed this as completed in #56614 Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Remove _item_cache #50547

PERF: Remove _item_cache #50547

lithomas1 commented Jan 3, 2023

mroeschke commented Jan 4, 2023

jorisvandenbossche commented Jan 4, 2023

mroeschke commented Jan 4, 2023

lithomas1 commented Jan 4, 2023

jorisvandenbossche commented Jan 5, 2023

lithomas1 commented Jan 5, 2023

jbrockmendel commented Jan 6, 2023

PERF: Remove _item_cache #50547

PERF: Remove _item_cache #50547

Comments

lithomas1 commented Jan 3, 2023

mroeschke commented Jan 4, 2023

jorisvandenbossche commented Jan 4, 2023

mroeschke commented Jan 4, 2023

lithomas1 commented Jan 4, 2023

jorisvandenbossche commented Jan 5, 2023

lithomas1 commented Jan 5, 2023

jbrockmendel commented Jan 6, 2023