-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: make is_list_like handle non iterable numpy-like arrays correctly #35127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @znicholls! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-02-13 08:43:11 UTC |
c5d9c17
to
ca94b7d
Compare
This is likely to have a non-trivial effect on performance. It might be time to implement a separate is_list_like (and possibly is_scalar) with less-strict/EA-customizable behavior (cc @jorisvandenbossche I know this has come up w/r/t geopandas) Can you identify a subset of is_list_like usages where we need the less-strict behavior? |
I can check with the other pint-pandas developers. Can I also ask which benchmarks I should look at to examine whether performance has taken a hit? |
yeah, sorry, |
I think it would have to be |
I think the test failures are unrelated to our change?
|
To have an idea about this, I timed the following cases:
So for numpy arrays the slowdown is only very limited (I assume because in this case we needed to look up |
It's not only EAs, though, but eg also for object dtype (see #26333, #27911) But indeed, a less strict |
These were the tests which started passing when deleting the
|
It looks like using numpy.iterable speeds up the list and array cases but not the numpy scalar one (which I think makes sense as you have to do some more thinking on that one). (This comment is pending tests passing...) import numpy as np
import pandas as pd
a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]
%timeit pd.api.types.is_list_like(a1)
289 ns ± 16.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master
%timeit pd.api.types.is_list_like(a2)
302 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master
%timeit pd.api.types.is_list_like(a3)
596 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master |
Having dug a bit deeper, we're going to need to back to the drawing board on this. My impression is that in order to be handled by I'm happy to close this in the meantime. However it would be great if we could reopen #22582 to at least add pint-pandas to the docs (buggy as it is). |
I would be nice to talk to geopandas devs to see if we can join forces. We might be facing the same problem. |
@hgrecco you are alread talking to them ;) (well, at least, I am one of the geopandas devs) |
It seems that
(if we decide to use this, I would prefer doing the code above explicitly instead of using But so basically it is a choosing between |
@jorisvandenbossche thanks for the clarification. @znicholls Have you tried reinstating |
Reverting the changes as suggested by @hgrecco leads to a slowdown for lists, no change for numpy arrays and a speedup for numpy scalars (essentially the reverse of previously) import numpy as np
import pandas as pd
a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]
%timeit pd.api.types.is_list_like(a1)
640 ns ± 6.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master
%timeit pd.api.types.is_list_like(a2)
414 ns ± 17.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master
%timeit pd.api.types.is_list_like(a3)
385 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) #PR
440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master |
e7319e7
to
f527f3c
Compare
We still have the |
@jbrockmendel I went through and identified where different |
The backstory is #35131. Hopefully that explains why we are looking to make such a change? (@jorisvandenbossche just double checking no progress has been made elsewhere that I've missed?) A concern about performance impacts was raised in this comment #35127 (comment). After some discussion, we arrived at #35127 (comment). For completeness, we also did some checks about performance penalties, e.g. #35127 (comment), #35127 (comment) and #35127 (comment). None of our solutions was perfect.
That makes sense, because I am also unsure. I've only got to step 1 of #35127 (comment). Based on the 'hot spots' identified in master...znicholls:handle-extension-arrays, I am looking for advice about whether we could use less strict (potentially slower) implementations of If we can use less strict implementations, my only proposed implementation is the changes made in this PR i.e. master...znicholls:non-iterable-duck-arrays. In short, the less strict implementation would check for objects which are numpy-style zero-dimensional arrays with |
@znicholls what is the status here |
I would greatly appreciate your (or another pandas developer) thoughts on #35127 (comment). I'm not sure I can sum it up any more succinctly than what is in that comment. |
i think this is reasonable. of course asv's will tell us whether its an acceptable perf hit. trying to balance additional complexity (seems minor in this case) with perf vs enhancements. |
Ok great thank you, I will start on that path then. |
Now pd.core.dtypes.inference.is_list_like correctly identifies numpy-like scalars as not being iterable
Co-authored-by: keewis <[email protected]>
I'm not completely sure why, but reverting here for simplicity
Also avoid np.iterable
f9cf44b
to
928842f
Compare
Thanks for the PR draft, but it appears this draft has stalled in development. Probably best if this change starts in a smaller PR with continuation of #39790 (let us know if you'd like to continue in that PR. Closing. |
What exactly is it that is missing here? There seems to be quite a number of issues in many downstream projects, all stalling on #35131. I myself would love to see pint-pandas move forward and would be willing to picking up this PR. If I understand correctly, the main concern with the proposed solution here is the slight performance regression. To me, that then asks for performance re-optimisation rather than introducing two different versions of As for the performance re-optimisation: It seems there is no regression on actual numpy arrays, only on real lists. Under the assumption that real lists are the only relevant non-array types, we might be able to get by with a simple short-cut test for actual lists. Would an iteration on this PR along these lines be appreciated? |
@burnpanck like anything in pandas a proper well formed and tested PR would be considered |
I just rebased #39830 so you can see the real diff. It's only 59 lines and only 1 in the actual codebase (the rest are all tests). https://github.com/pandas-dev/pandas/compare/master...znicholls:update-is-list-like?expand=1 Thanks for picking up @burnpanck, as you can tell I had a few goes but always hit a wall at some point |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff