Skip to content

BUG: Bug in loc raised ValueError when setting value via boolean list #37761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
May 25, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 11, 2020

Not quite sure if conversion to numpy array is necessary, but is_dtype_bool does not accept lists

@phofl phofl added the Indexing Related to indexing on series/frames, not to indexes themselves label Nov 11, 2020
@@ -602,7 +603,7 @@ def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
if self.name == "loc" or is_bool_dtype(np.array(key)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it only lists we want here, or would com.is_bool_indexer(key) work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is better, changed it :)

@@ -1955,6 +1955,13 @@ def test_loc_setitem_dt64tz_values(self):
result = s2["a"]
assert result == expected

def test_loc_setitem_boolean_list(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put this in TestLocBooleanMask

can this be shared by DataFrame?

few words after the GH ref to the effect of "specifically list key, not arraylike"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a DataFrame part to the test, parametrization would be quite ugly here

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the original issue wa about iloc, can you test that as well (not sure if this is a different bug then)

@@ -1644,6 +1644,18 @@ def test_loc_setitem_mask_td64_series_value(self):
assert expected == result
tm.assert_frame_equal(df, df_copy)

def test_loc_setitem_boolean_list(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize the rhs by array as well as list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i see the comment, but still think can parameterize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@phofl
Copy link
Member Author

phofl commented Nov 13, 2020

cc @jreback Thanks very much, iloc actually was a different issue.

The validation in
https://github.com/pandas-dev/pandas/blob/b96665777c083516c9cdf21d25b7d9ccf70c81bd/pandas/core/indexers.py#L153:L157
did not allow boolean lists. We have two options to fix this for iloc. I converted a boolean list to an array in the indexing module, because I was not sure if this validation was done on purpose this way. Alternatively we could adjust the referenced validation to allow boolean lists too

def test_loc_setitem_boolean_list(self, func, rhs_func):
# GH#20438 testing specifically list key, not arraylike
ser = Series([0, 1, 2])
getattr(ser, func)[[True, False, True]] = rhs_func([5, 10])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also parameterize on the indexing type as well (e.g. list / array / pd.Index)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index is not allowed for iloc, added the remaining ones

@jreback
Copy link
Contributor

jreback commented Nov 13, 2020

cc @jbrockmendel

@@ -476,6 +476,7 @@ Indexing
- Bug in :meth:`DataFrame.loc` returned requested key plus missing values when ``loc`` was applied to single level from :class:`MultiIndex` (:issue:`27104`)
- Bug in indexing on a :class:`Series` or :class:`DataFrame` with a :class:`CategoricalIndex` using a listlike indexer containing NA values (:issue:`37722`)
- Bug in :meth:`DataFrame.xs` ignored ``droplevel=False`` for columns (:issue:`19056`)
- Bug in :meth:`Series.loc` raised an Error when input was filtered with a boolean list and values to set were a list with lower dimension (:issue:`20438`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a specific exception we can say instead of Error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError, added it

@pytest.mark.parametrize("indexing_func", [list, np.array])
@pytest.mark.parametrize("func", ["loc", "iloc"])
@pytest.mark.parametrize("rhs_func", [list, np.array])
def test_loc_setitem_boolean_list(self, func, rhs_func, indexing_func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its only the *-iloc-list cases that fail on master. at its heart, is this an iloc bug or a loc bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iloc, thx.

def test_loc_setitem_boolean_list(self, func, rhs_func, indexing_func):
# GH#20438 testing specifically list key, not arraylike
ser = Series([0, 1, 2])
getattr(ser, func)[indexing_func([True, False, True])] = rhs_func([5, 10])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the np.array case, do we care if the ndarray has bool vs object dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -602,7 +602,7 @@ def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
if self.name == "loc" or com.is_bool_indexer(key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't fully formed an opinion, but i think another alternative would be to properly handle boolean masks in Block.setitem.

Copy link
Member Author

@phofl phofl Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, loc is not failing, this means we do not need this or. Removed it. Must have mixed something up before

Copy link
Member Author

@phofl phofl Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists can be handled in Block.setitem already, but arrays from dtype object don't work. Lists are killed by the validation mentioned above in pandas.core.indexers. If we let them through, setitem seems to be able to handle them in the iloc testcases

@jbrockmendel
Copy link
Member

does this have a perf impact? IIRC is_bool_indexer isn't completely cheap

@phofl phofl changed the title [BUG]: Bug in loc raised error when setting value via boolean list [BUG]: Bug in loc raised ValueError when setting value via boolean list Nov 14, 2020
@phofl phofl changed the title [BUG]: Bug in loc raised ValueError when setting value via boolean list BUG: Bug in loc raised ValueError when setting value via boolean list Nov 14, 2020
@phofl
Copy link
Member Author

phofl commented Nov 15, 2020

Ran indexing asvs. Seems a bit weird

before           after         ratio
     [50ae0bfc]       [b5dd4534]
     <master>         <20438>   
-         249±9ms          225±8ms     0.90  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-        30.5±2ms       27.4±0.4ms     0.90  indexing.ChainIndexing.time_chained_indexing(None)
-         252±7ms          218±8ms     0.87  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         255±7ms          215±5ms     0.84  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
-         181±3ms          144±2ms     0.80  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-         185±4ms          147±2ms     0.79  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-         190±9ms          150±3ms     0.79  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        43.1±1μs       26.1±0.9μs     0.60  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-     1.86±0.04ms      1.11±0.04ms     0.60  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-     1.70±0.05ms         953±30μs     0.56  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-         308±8μs          166±6μs     0.54  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        296±10μs          159±4μs     0.54  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        86.8±3μs         44.3±1μs     0.51  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jbrockmendel
Copy link
Member

couple comments, needs rebase otherwise LGTM

� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/tests/indexing/test_iloc.py
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 20, 2020
� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/tests/indexing/test_iloc.py
@phofl phofl removed the Stale label Dec 21, 2020
@phofl
Copy link
Member Author

phofl commented Dec 21, 2020

ping @jbrockmendel ?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 21, 2021
@jbrockmendel
Copy link
Member

It looks like in master the test case raises in check_setitem_lengths. would it make sense to patch at that point?

@phofl
Copy link
Member Author

phofl commented Jan 28, 2021

We could check for lists there too and count the True occurrences. Implemented it.

@jreback jreback added this to the 1.3 milestone May 2, 2021
� Conflicts:
�	pandas/tests/indexing/test_iloc.py
@jreback
Copy link
Contributor

jreback commented May 24, 2021

@phofl can you rebase (i am not sure when it was before), and ping on green.

@phofl
Copy link
Member Author

phofl commented May 24, 2021

@jreback greenish, we have same number of failures as on master

@simonjayhawkins simonjayhawkins merged commit e4d0c1d into pandas-dev:master May 25, 2021
@simonjayhawkins
Copy link
Member

Thanks @phofl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

iloc boolean mask setitem
4 participants