ENH: make is_list_like handle non iterable numpy-like arrays correctly #35127

znicholls · 2020-07-05T05:27:10Z

closes ENH: Identify numpy-like zero-dimensional arrays as non-iterable #35131
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-07-05T05:27:13Z

Hello @znicholls! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-13 08:43:11 UTC

pandas/_libs/lib.pyx

jbrockmendel · 2020-07-05T23:47:32Z

This is likely to have a non-trivial effect on performance. It might be time to implement a separate is_list_like (and possibly is_scalar) with less-strict/EA-customizable behavior (cc @jorisvandenbossche I know this has come up w/r/t geopandas)

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

znicholls · 2020-07-06T00:10:57Z

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

I can check with the other pint-pandas developers.

Can I also ask which benchmarks I should look at to examine whether performance has taken a hit?

keewis · 2020-07-06T00:13:26Z

I'm not completely sure why, but reverting here for simplicity

yeah, sorry, getattr only checks for AttributeError but pint raises TypeError

znicholls · 2020-07-06T00:16:15Z

yeah, sorry, getattr only checks for AttributeError but pint raises TypeError

I think it would have to be getattr(obj, "ndim", 1) != 0 i.e. you assume everything is iterable unless you have evidence otherwise.

znicholls · 2020-07-06T00:56:07Z

I think the test failures are unrelated to our change?

# CI / Web and docs failure text (I think this is the relevant bit)
...
WARNING: autodoc: failed to import method 'sparse.from_spmatrix' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.to_coo' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.to_dense' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.from_coo' from module 'Series'; the following exception was raised:
No module named 'Series'
WARNING: autodoc: failed to import method 'sparse.to_coo' from module 'Series'; the following exception was raised:
No module named 'Series'
...

jorisvandenbossche · 2020-07-06T07:38:53Z

This is likely to have a non-trivial effect on performance.

To have an idea about this, I timed the following cases:

a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]

In [2]: %timeit pd.api.types.is_list_like(a1)      
390 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master  
553 ns ± 1.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # PR

In [3]: %timeit pd.api.types.is_list_like(a2) 
405 ns ± 0.972 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master
408 ns ± 0.745 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # PR

In [4]: %timeit pd.api.types.is_list_like(a3)  
395 ns ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master
411 ns ± 1.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)    PR

So for numpy arrays the slowdown is only very limited (I assume because in this case we needed to look up obj.ndim anyway, so then then getattr(obj, "ndim") seems to be similar). For actual lists, the slowdown is more significant, though (but I would think the array cases are most relevant for performance).

jorisvandenbossche · 2020-07-06T07:46:01Z

It might be time to implement a separate is_list_like (and possibly is_scalar) with less-strict/EA-customizable behavior

It's not only EAs, though, but eg also for object dtype (see #26333, #27911)

But indeed, a less strict is_list_like check depending on the dtype of the involved data might be worth exploring, although not that easy I think. The first thing would be to identify the cases where this less strict case is actually important. I suppose the problems in pint-pandas mainly occur in indexing + arithmetic/comparison operations ?

andrewgsavage · 2020-07-06T07:51:35Z

These were the tests which started passing when deleting the __iter__ method from the scalar class to make is_list_like return False:

TestGroupby::test_groupby_apply_identity
TestMethods::test_fillna_copy_frame
TestMethods::test_fillna_copy_series
TestMissing::test_fillna_scalar
TestMissing::test_fillna_series
TestMissing::test_fillna_frame
TestReshaping::test_unstack[series-index1]
TestReshaping::test_unstack[series-index3]
TestReshaping::test_unstack[frame-index1]
TestReshaping::test_unstack[frame-index3]
TestSetitem::test_setitem_mask_broadcast[loc]
TestSetitem::test_setitem_sequence_broadcasts[True]
TestSetitem::test_setitem_integer_array[True-list]
TestSetitem::test_setitem_integer_array[True-integer-array]
TestSetitem::test_setitem_integer_array[True-numpy-array]
TestSetitem::test_setitem_slice[True]
TestSetitem::test_setitem_loc_iloc_slice

znicholls · 2020-07-07T00:30:47Z

It looks like using numpy.iterable speeds up the list and array cases but not the numpy scalar one (which I think makes sense as you have to do some more thinking on that one). (This comment is pending tests passing...)

import numpy as np
import pandas as pd

a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]


%timeit pd.api.types.is_list_like(a1)
289 ns ± 16.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a2)
302 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a3)
596 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

znicholls · 2020-07-07T01:06:53Z

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

Having dug a bit deeper, we're going to need to back to the drawing board on this. My impression is that in order to be handled by is_list_like, is_scalar and item_from_zerodim correctly, the class has to basically be a numpy array (or sub-class thereof). I'll go back to the pint-pandas devs and we can discuss how we go forward there, including identifying the cases where a less strict is_list_like, is_scalar and item_from_zerodim would solve our problems.

I'm happy to close this in the meantime. However it would be great if we could reopen #22582 to at least add pint-pandas to the docs (buggy as it is).

hgrecco · 2020-07-07T03:34:28Z

I would be nice to talk to geopandas devs to see if we can join forces. We might be facing the same problem.

jorisvandenbossche · 2020-07-07T14:12:56Z

@hgrecco you are alread talking to them ;) (well, at least, I am one of the geopandas devs)
See also my comment at #35131 (comment): although very much related, I think it's not exactly the same. As for GeoPandas, the scalar elements we put in a Series are actually iterable (in contrast to the (duck-typed) numpy scalars, which are iterable in theory (have the dunder method), but raise once you try to iterate).

jorisvandenbossche · 2020-07-07T14:16:31Z

It looks like using numpy.iterable speeds up the list and array cases

It seems that np.iterable is actually just a try/except of iter(value):

    try:
        iter(y)
    except TypeError:
        return False
    return True

(if we decide to use this, I would prefer doing the code above explicitly instead of using np.iterable, as I think most people are not familiar with that numpy method, and the code is simple).

But so basically it is a choosing between isinstance(value, Iterable) + special case check to exclude ndim==0 vs try/except of iter(value) (not fully sure if there are other consequences of the choice between both)

hgrecco · 2020-07-07T14:27:55Z

@jorisvandenbossche thanks for the clarification.

@znicholls Have you tried reinstating getattr(obj, "ndim", 1) != 0 and moving it first? Python's and/or are shortcircuit operators so we might want to put first those checks which are fast and likely to return false.

znicholls · 2020-07-07T22:57:03Z

Reverting the changes as suggested by @hgrecco leads to a slowdown for lists, no change for numpy arrays and a speedup for numpy scalars (essentially the reverse of previously)

import numpy as np
import pandas as pd

a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]


%timeit pd.api.types.is_list_like(a1)
640 ns ± 6.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a2)
414 ns ± 17.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a3)
385 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) #PR
440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

znicholls · 2020-07-07T22:57:56Z

We still have the is_scalar and item_from_zerodim issues of course...

znicholls · 2020-07-08T03:34:37Z

@jbrockmendel I went through and identified where different is_list_like and is_scalar would be needed, see master...znicholls:handle-extension-arrays. Is that helpful?

znicholls · 2021-01-10T22:32:41Z

The backstory is #35131. Hopefully that explains why we are looking to make such a change? (@jorisvandenbossche just double checking no progress has been made elsewhere that I've missed?)

A concern about performance impacts was raised in this comment #35127 (comment). After some discussion, we arrived at #35127 (comment).

For completeness, we also did some checks about performance penalties, e.g. #35127 (comment), #35127 (comment) and #35127 (comment). None of our solutions was perfect.

I am not sure what your proposed implementation even looks like.

That makes sense, because I am also unsure. I've only got to step 1 of #35127 (comment). Based on the 'hot spots' identified in master...znicholls:handle-extension-arrays, I am looking for advice about whether we could use less strict (potentially slower) implementations of is_list_like in these places.

If we can use less strict implementations, my only proposed implementation is the changes made in this PR i.e. master...znicholls:non-iterable-duck-arrays. In short, the less strict implementation would check for objects which are numpy-style zero-dimensional arrays with not (hasattr(obj, "ndim") and obj.ndim == 0) rather than not (util.is_array(obj) and obj.ndim == 0) because the latter only works for numpy arrays and doesn't work for numpy-style arrays (such as Pint).

jreback · 2021-02-11T00:39:00Z

@znicholls what is the status here

znicholls · 2021-02-11T01:37:36Z

what is the status here

I would greatly appreciate your (or another pandas developer) thoughts on #35127 (comment). I'm not sure I can sum it up any more succinctly than what is in that comment.

jreback · 2021-02-11T01:45:32Z

If we can use less strict implementations, my only proposed implementation is the changes made in this PR i.e. master...znicholls:non-iterable-duck-arrays. In short, the less strict implementation would check for objects which are numpy-style zero-dimensional arrays with not (hasattr(obj, "ndim") and obj.ndim == 0) rather than not (util.is_array(obj) and obj.ndim == 0) because the latter only works for numpy arrays and doesn't work for numpy-style arrays (such as Pint).

i think this is reasonable. of course asv's will tell us whether its an acceptable perf hit. trying to balance additional complexity (seems minor in this case) with perf vs enhancements.

znicholls · 2021-02-11T01:48:36Z

i think this is reasonable. of course asv's will tell us whether its an acceptable perf hit. trying to balance additional complexity (seems minor in this case) with perf vs enhancements.

Ok great thank you, I will start on that path then.

znicholls · 2021-02-13T08:20:09Z

@jreback I have tried to start small with #39790, if you could take a look that would be great

Now pd.core.dtypes.inference.is_list_like correctly identifies numpy-like scalars as not being iterable

Co-authored-by: keewis <[email protected]>

I'm not completely sure why, but reverting here for simplicity

Also avoid np.iterable

mroeschke · 2021-07-11T23:27:17Z

Thanks for the PR draft, but it appears this draft has stalled in development. Probably best if this change starts in a smaller PR with continuation of #39790 (let us know if you'd like to continue in that PR. Closing.

burnpanck · 2021-11-25T22:34:53Z

What exactly is it that is missing here? There seems to be quite a number of issues in many downstream projects, all stalling on #35131. I myself would love to see pint-pandas move forward and would be willing to picking up this PR.

If I understand correctly, the main concern with the proposed solution here is the slight performance regression. To me, that then asks for performance re-optimisation rather than introducing two different versions of is_list_like that do slightly different things - the latter is calling for all kinds of subtle bugs. The size of the diff in #39830 (2500 changed lines) with the solution presented here (150 changed lines) also speaks for this approach here.

As for the performance re-optimisation: It seems there is no regression on actual numpy arrays, only on real lists. Under the assumption that real lists are the only relevant non-array types, we might be able to get by with a simple short-cut test for actual lists. Would an iteration on this PR along these lines be appreciated?

jreback · 2021-11-25T22:55:58Z

@burnpanck like anything in pandas a proper well formed and tested PR would be considered

znicholls · 2021-11-26T00:54:47Z

The size of the diff in #39830 (2500 changed lines) with the solution presented here (150 changed lines) also speaks for this approach here.

I just rebased #39830 so you can see the real diff. It's only 59 lines and only 1 in the actual codebase (the rest are all tests). https://github.com/pandas-dev/pandas/compare/master...znicholls:update-is-list-like?expand=1

Thanks for picking up @burnpanck, as you can tell I had a few goes but always hit a wall at some point

znicholls changed the title ~~Non iterable duck arrays~~ is_list_like correctly handles non iterable duck arrays Jul 5, 2020

This was referenced Jul 5, 2020

Shall we create a QuantityArray class? hgrecco/pint#1128

Open

ENH: Identify numpy-like zero-dimensional arrays as non-iterable #35131

Closed

znicholls changed the title ~~is_list_like correctly handles non iterable duck arrays~~ is_list_like correctly handles non iterable numpy-like arrays Jul 5, 2020

znicholls changed the title ~~is_list_like correctly handles non iterable numpy-like arrays~~ make is_list_like handle non iterable numpy-like arrays correctly Jul 5, 2020

znicholls force-pushed the non-iterable-duck-arrays branch 2 times, most recently from c5d9c17 to ca94b7d Compare July 5, 2020 23:11

znicholls marked this pull request as ready for review July 5, 2020 23:11

keewis reviewed Jul 5, 2020

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

znicholls changed the title ~~make is_list_like handle non iterable numpy-like arrays correctly~~ ENH: make is_list_like handle non iterable numpy-like arrays correctly Jul 6, 2020

znicholls force-pushed the non-iterable-duck-arrays branch from e7319e7 to f527f3c Compare July 7, 2020 22:57

znicholls marked this pull request as draft July 8, 2020 03:34

Add failing test

0ecb8c2

znicholls and others added 11 commits February 13, 2021 19:38

Add __array__ method to mock numpy-like

ad5f714

TST: GH35131 Add failing test of numpy-like array handling

139493f

ENH: GH35131 Implement fix which allows numpy-like handling

ba59fb8

Now pd.core.dtypes.inference.is_list_like correctly identifies numpy-like scalars as not being iterable

Simplify ndim check

451fbfc

Co-authored-by: keewis <[email protected]>

Revert change because it broke tests

57e186b

I'm not completely sure why, but reverting here for simplicity

Use slightly clearer logic

590c2a2

Update to use numpy iterable

27bd4e4

Add failing is_scalar tests

d9a2c8f

Revert to relying on python's shortcircuit operators

764e7b1

Also avoid np.iterable

Make a mess

088dff8

Fix missing module

928842f

znicholls force-pushed the non-iterable-duck-arrays branch from f9cf44b to 928842f Compare February 13, 2021 08:43

znicholls mentioned this pull request Feb 16, 2021

ENH: Hand numpy-like arrays with is_list_like #39830

Closed

4 tasks

mroeschke closed this Jul 11, 2021

dcoukos mentioned this pull request Aug 24, 2021

ENH: Change implementation of pandas.api.types.is_list_like to check for iterability in more pythonic way #43196

Closed

znicholls mentioned this pull request Aug 31, 2021

ENH: unit of measurement / physical quantities #10349

Closed

znicholls mentioned this pull request Nov 23, 2021

RTFM and could not make pint[Mt CO2] work as a unit openscm/openscm-units#32

Closed

burnpanck mentioned this pull request Nov 26, 2021

ENH: [Draft] Fix issue #35131 Identify zero-dimensional duck arrays as non-iterable #44626

Merged

4 tasks

znicholls mentioned this pull request May 6, 2023

Removing pint-pandas dependency openscm/scmdata#234

Open

Uh oh!

ENH: make is_list_like handle non iterable numpy-like arrays correctly #35127

ENH: make is_list_like handle non iterable numpy-like arrays correctly #35127

Uh oh!

Conversation

znicholls commented Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-02-13 08:43:11 UTC

Uh oh!

Uh oh!

jbrockmendel commented Jul 5, 2020

Uh oh!

znicholls commented Jul 6, 2020

Uh oh!

keewis commented Jul 6, 2020

Uh oh!

znicholls commented Jul 6, 2020

Uh oh!

znicholls commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jul 6, 2020

Uh oh!

jorisvandenbossche commented Jul 6, 2020

Uh oh!

andrewgsavage commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

znicholls commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

znicholls commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hgrecco commented Jul 7, 2020

Uh oh!

jorisvandenbossche commented Jul 7, 2020

Uh oh!

jorisvandenbossche commented Jul 7, 2020

Uh oh!

hgrecco commented Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

znicholls commented Jul 7, 2020

Uh oh!

znicholls commented Jul 7, 2020

Uh oh!

znicholls commented Jul 8, 2020

Uh oh!

znicholls commented Jan 10, 2021

Uh oh!

jreback commented Feb 11, 2021

Uh oh!

znicholls commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Feb 11, 2021

Uh oh!

znicholls commented Feb 11, 2021

Uh oh!

znicholls commented Feb 13, 2021

Uh oh!

mroeschke commented Jul 11, 2021

Uh oh!

burnpanck commented Nov 25, 2021

Uh oh!

jreback commented Nov 25, 2021

Uh oh!

znicholls commented Nov 26, 2021

Uh oh!

Uh oh!

znicholls commented Jul 5, 2020 •

edited

Loading

pep8speaks commented Jul 5, 2020 •

edited

Loading

znicholls commented Jul 6, 2020 •

edited

Loading

andrewgsavage commented Jul 6, 2020 •

edited

Loading

znicholls commented Jul 7, 2020 •

edited

Loading

znicholls commented Jul 7, 2020 •

edited

Loading

hgrecco commented Jul 7, 2020 •

edited

Loading

znicholls commented Feb 11, 2021 •

edited

Loading