Skip to content

Fix passing empty label to df drop #21515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 21, 2018
Merged

Fix passing empty label to df drop #21515

merged 14 commits into from
Jun 21, 2018

Conversation

alimcmaster1
Copy link
Member

@alimcmaster1 alimcmaster1 commented Jun 17, 2018

-Drop method in indexes/base.py, docs say KeyError should only be raised if none of labels are found in selected axis. However pd.DataFrame(index=[1,2,3]).drop([1, 4]) throws.

-Makes behaviour consistent for .drop() across unique/non-unique indexes.
Both the below will now raise a KeyError

  • pd.DataFrame(index=[1,2,3]).drop([1, 4])
  • pd.DataFrame(index=[1,1,3]).drop([1, 4])

-Remove unused var indexer and _axis

@pep8speaks
Copy link

pep8speaks commented Jun 17, 2018

Hello @alimcmaster1! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 21, 2018 at 08:11 Hours UTC

@alimcmaster1
Copy link
Member Author

Test failure in: test_multilevel.py looking into this

# Check if label doesn't exist along axis
if len(labels):
labels_missing = (~np.array([label in axis
for label in labels])).any()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(axis.get_indexer_for(labels) == -1).any() ?

if errors == 'raise' and indexer.all():
raise KeyError('{} not found in axis'.format(labels))
# Check if label doesn't exist along axis
if len(labels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to check if the list is (non-)empty. The rule "if any label is missing" is fine (trivially false) with no labels.

"""
arr_dtype = 'object' if self.dtype == 'object' else None
labels = com._index_labels_to_array(labels, dtype=arr_dtype)
indexer = self.get_indexer(labels)
mask = indexer == -1
if mask.any():
if mask.any() and len(mask):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: this shouldn't be needed.

([1, 1, 2], [], [1, 1, 2]),
([1, 2, 3], [2], [1, 3]),
([1, 1, 3], [1], [3]),
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to parametrize separately index and drop_labels (and construct expected_index as [l for l in index if not l in drop_labels])?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree will do

Copy link
Member

@toobaz toobaz Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already better, but what I meant was something like

@pytest.mark.parametrize('index', [[1, 2, 3], [1, 1, 2], [1, 2, 2]])
@pytest.mark.parametrize('drop_labels', [[], [1]])


@pytest.mark.parametrize('index, drop_labels, error_key', [
([1, 2, 3], [1, 4], 'not contained in axis'),
([1, 2, 2], [1, 4], 'not found in axis'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest instead to make the two errors the same (e.g. fix the one in Index.drop() in pandas/core/indexes/base.py). There is no real reason why the wording should differ.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense

@alimcmaster1
Copy link
Member Author

Thanks @toobaz updated as per your comments and fixed that test. ( When level is not None is true we still need to call indexer.all()

@toobaz
Copy link
Member

toobaz commented Jun 18, 2018

When level is not None is true we still need to call indexer.all()

OK, I see the problem, pd.MultiIndex.from_product([[1, 2]]*2).drop([3], level=0) does not actually raise ( #18561 ). OK to fix it elsewhere, but please refer to that issue in a comment.
(I think the indexer.all() only solves some specific case by chance)

# GH 21494
expected_index = [i for i in index if i not in drop_labels]
df = pd.DataFrame(index=index).drop(drop_labels)
assert (df.index.values == expected_index).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alway do a tm.assert_* comparison.

this test is in /series/ but you are testing /dataframe can you move. its ok to leave if you make this a series (and add to /frame/ as well)



@pytest.mark.parametrize('index, drop_labels, error_key', [
([1, 2, 3], [1, 4], 'not found in axis'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the message doesn't need to be a parameter

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jun 18, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew, 0.23.2 is ok

@codecov
Copy link

codecov bot commented Jun 18, 2018

Codecov Report

Merging #21515 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21515      +/-   ##
==========================================
+ Coverage   91.92%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49564    49560       -4     
==========================================
- Hits        45560    45558       -2     
+ Misses       4004     4002       -2
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.8% <25%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 94.97% <ø> (-0.01%) ⬇️
pandas/core/indexes/base.py 96.62% <100%> (ø) ⬆️
pandas/core/generic.py 96.22% <100%> (+0.09%) ⬆️
pandas/core/sorting.py 98.19% <0%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f91a704...454cb7e. Read the comment docs.


else:
labels = _ensure_object(com._index_labels_to_array(labels))
if level is not None:
if not isinstance(axis, MultiIndex):
raise AssertionError('axis must be a MultiIndex')
indexer = ~axis.get_level_values(level).isin(labels)
# GH 18561 MultiIndex.drop should raise if label is absent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before the comment

except AttributeError:
pass
result = dropped
result = self.reindex(**{axis_name: new_axis})

else:
labels = _ensure_object(com._index_labels_to_array(labels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment to indicate this is the non-unique case

@@ -3515,3 +3515,24 @@ def test_functions_no_warnings(self):
with tm.assert_produces_warning(False):
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False,
labels=labels)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move with other drop tests in test_axis_select_reindex



@pytest.mark.parametrize('index, drop_labels', [
([1, 2, 3], []),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to test_base with other test_drop tests

@alimcmaster1
Copy link
Member Author

Thanks @jreback updated as per your comments, also when building the docs I noticed we didn't have the whatsnew entry for 0.23.2 in whatsnew.rst, this intended or want me to add?

@@ -1565,6 +1565,79 @@ def test_drop_tuple(self, values, to_drop):
for drop_me in to_drop[1], [to_drop[1]]:
pytest.raises(KeyError, removed.drop, drop_me)

def test_drop_unique_and_non_unique_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to parameterize some of this (e.g. you can break into 2 tests if that makes it easier)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I told you to move them here, but on 2nd thought, there they belong in test_alter_index (where the tests were originally). sorry about that.

test_base is for generically testing Index/Series types and not doing that here so kind of superfluous to put the tests here.

@alimcmaster1
Copy link
Member Author

Np @jreback moved them to original file and added parameterization

@jreback jreback added this to the 0.23.2 milestone Jun 20, 2018
@jreback
Copy link
Contributor

jreback commented Jun 20, 2018

@toobaz pls approve & merge when satisfied. lgtm.

Copy link
Member

@toobaz toobaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just two minor comments on tests

([1, 1, 2], []),
([1, 2, 3], [2]),
([1, 1, 3], [1]),
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think (see my previous comment) I would rather write this as:

@pytest.mark.parametrize('index', [[1, 2, 3], [1, 1, 2]])
@pytest.mark.parametrize('drop_labels', [[], [1]])

Do you have any reason to explictly parametrize the combinations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you, your approach seems more compact. Let me update

@pytest.mark.parametrize('index, drop_labels', [
([1, 2, 3], [1, 4]),
([1, 2, 2], [1, 4]),
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as above, I don't think you need to parametrize drop_labels.

@jreback , what do you think? Do we have a general rule concerning making parameter combinations more explicit/more compact?

# bad axis
(range(3), list('abc'), ('a',),
0, KeyError, 'not found in axis'),
(range(3), list('abc'), 'one',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jreback that't much neater than my attempt!

@alimcmaster1
Copy link
Member Author

@toobaz updated as per your comments, thanks

([1, 1, 2], []),
([1, 2, 3], [2]),
([1, 1, 3], [1]),
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you, your approach seems more compact. Let me update

OK, then here the same applies ;-)

tm.assert_frame_equal(frame, pd.DataFrame(index=expected_index))

@pytest.mark.parametrize('index', [[1, 2, 3], [1, 2, 2]])
@pytest.mark.parametrize('drop_labels', [[1, 4]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also include an entirely missing list (e.g. [4, 5]). So that we at least give this decorator a reason to exist ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz thanks updated :)

@toobaz toobaz merged commit f4fba9e into pandas-dev:master Jun 21, 2018
@toobaz
Copy link
Member

toobaz commented Jun 21, 2018

@alimcmaster1 thanks!

@alimcmaster1
Copy link
Member Author

thanks for the help @toobaz

@alimcmaster1 alimcmaster1 deleted the df-drop-errors-2 branch June 21, 2018 08:38
jorisvandenbossche pushed a commit that referenced this pull request Jun 29, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jul 2, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants