Skip to content

Improve pandas dataframe inspection #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 5, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions src/inspectorscripts.ts
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,9 @@ def _jupyterlab_variableinspector_getsizeof(x):
elif __torch and isinstance(x, __torch.Tensor):
return x.element_size() * x.nelement()
elif __pd and type(x).__name__ == 'DataFrame':
return x.memory_usage().sum()
# DO NOT CALL df.memory_usage() as this can be very costly
# to the point of crashing the kernel
return "?"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "?"
with pd.option_context("mode.copy_on_write", True):
return x.memory_usage().sum()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, still the same issue with this.

Running a profiler on inspecting a 10 rows x 500_000 columns I see this takes more than 6 seconds to run. When inspecting a 10 rows x 50_000_000 columns, laptop goes out of memory and crashes.

Without doing any memory_usage computation, inspection takes less than 10 ms.

Copy link
Member Author

@martinRenou martinRenou Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion though!

I wonder if we could add a small condition on the shape, if the shape is in the 10 thousands and more we don't compute the memory usage?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what about using a lazy approximation?

Suggested change
return "?"
return x.head(1).memory_usage().sum() * len(x)

Copy link
Member Author

@martinRenou martinRenou Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is still quite slow when there are many columns (500_000 here, 6.8 seconds to compute)

Screenshot from 2024-09-04 14-42-27

Note that this means adding 6.8 seconds of delay between each cell execution once the variable is defined.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess we could use some kind of cache. I guess it is feasible to write a function giving a rough estimate with something like x.dtypes.map(size_of).sum() * len(x) where size_of would take the dtype and compute it's size or return one from cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping over all the columns with x.dtypes.map will still be quite slow with a dataframe with lots of columns. Also invalidating cache may be hard?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be rather fast. You cloud do something like:

sum([
    size_of(dtype) * count
    for dtype, count in x.dtypes.value_counts().items()
]) * len(x)

The harder part is implementing size_of

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be fine if, in this PR, I just add a watchdog that does:

# It seems a big number of rows does not impact performance, only columns
if len(x.columns) < 10_000:
    return x.memory_usage().sum()
else:
    return "?"

And we can open a follow-up issue for a faster calculation of the memory usage, pointing to this discussion?

This would at least fix the crashing issue we're seeing on our side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, souds fine!

else:
return sys.getsizeof(x)

Expand Down Expand Up @@ -136,8 +138,17 @@ def _jupyterlab_variableinspector_getcontentof(x):
content += f'"{key}": {x[key]}'
content += ", ...}"
elif __pd and isinstance(x, __pd.DataFrame):
colnames = ', '.join(x.columns.map(str))
content = "Columns: %s" % colnames
if len(x.columns) <= _jupyterlab_variableinspector_maxitems:
colnames = ', '.join(x.columns.map(str))
content = "Columns: %s" % colnames
else:
content = "Columns: "
for idx in range(_jupyterlab_variableinspector_maxitems):
if idx > 0:
content += ", "
content += str(x.columns[idx])
content += ", ..."
return content
elif __pd and isinstance(x, __pd.Series):
content = str(x.values).replace(" ", ", ")[1:-1]
content = content.replace("\\n", "")
Expand Down
Loading