-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use find_stack_level in pandas.core #44358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending green(ish)
hmm can you rebase again. nothing should be failing except the 3.10 sdist test (or there is an issue) |
very nice @rhshadrach |
FYI, this had a ~10-20% slowdown on a decent number of benchmarks (e.g. https://pandas.pydata.org/speed/pandas/#groupby.Categories.time_groupby_nosort?python=3.8&Cython=0.29.24&commits=01b86edb). Full list at https://pandas.pydata.org/speed/pandas/#regressions?sort=1&dir=desc. Is that a worthwhile tradeoff? Also https://pandas.pydata.org/speed/pandas/#index_cached_properties.IndexCache.time_is_all_dates?python=3.8&Cython=0.29.24&p-index_type='RangeIndex'&commits=01b86edb is a lot larger, but I don't know if that's realted. |
Most likely not - I'll do some benchmarking and see where this is coming from. If it's really widespread, might need to revert. I should be able to check this out in the next few days. |
@TomAugspurger - Is there a way to get the table from the full list as tabular data? |
After digging into this a bit, it does seem worth it to incur the slow down in at least a lot of these benchmarks. It takes about 3ms on my machine to call find_stack_trace. In a benchmark that runs in 10ms like the first example posted above, this is regarded as significant. However, in this particular instance, find_stack_level is called only once. I think that 3ms overhead is worth it to get a better error message to the user and to lessen maintenance (moving code around, calling from multiple places, etc). @TomAugspurger @jreback - Is there a consensus that if find_stack_level is called O(1) times (with a small constant), then this is worth the overhead? What would be concerning is if there is code where a warning is generated multiple times, making the overhead significantly greater. Indeed this is happening in at least two cases, and it should definitely be reverted. To track this, I've run the test suite while outputting (to a file) any test that hit
There can be lots of false positives here, e.g. a for loop in the test. Digging into the top offenders, below are the tests and on the right are comments from investigating the instance:
Called many times means there is e.g. a for loop in the test itself, giving rise to false positives.The two cases I mentioned above that should be reverted are related to MultiIndexing and DataFrame.describe. |
Yeah I think that's a worthwhile tradeoff. |
Thanks @TomAugspurger. I've gone through the list of regressions associated with this PR. All regressions are either (a) less than 4ms or (b) call find_stack_level once (a few of these are timed just slightly greater than 4ms) except for:
All three of these no longer call find_stack_level after doing the reverts mentioned in #44358 (comment) |
Part of #44347
This handles most of the stacklevels in core. A few more will take a bit more effort, separating out into a separate PR. After all stacklevels are replaced, will be removing
check_stacklevel=False
from the tests.