Skip to content

subset keyword argument will include last column if incorrect keys are given #8303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arijun opened this issue Sep 17, 2014 · 13 comments · Fixed by #8384
Closed

subset keyword argument will include last column if incorrect keys are given #8303

arijun opened this issue Sep 17, 2014 · 13 comments · Fixed by #8384
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@arijun
Copy link

arijun commented Sep 17, 2014

If you give a non-existent key to the subset argument it will default to the last column. For example:

In [3]: d = pd.DataFrame({'a':[1],'b':[2],'c':[np.nan]})
In [4]: len(d.dropna(subset=['x']))
Out[4]: 0

It seems the problem is in the line where dropna takes the subset:

agg_obj = self.take(ax.get_indexer_for(subset),axis=agg_axis)

Here, ax.get_indexer_for(subset) will return -1 as a sentinel value for any keys that were not found in subset, and self.take interprets the -1 as a request for the last column.

@jreback
Copy link
Contributor

jreback commented Sep 17, 2014

yep would call that a buggy....pull-request welcome!

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 17, 2014
@jreback jreback modified the milestones: 0.16, 0.15.1 Sep 17, 2014
@mcjcode
Copy link
Contributor

mcjcode commented Sep 22, 2014

Would it be best for dropna to raise a KeyError exception, like d['x'] would in the example above? or to return all of the rows, the thinking being that column 'x' contains no nan's whatsoever (in fact it contains no values whatsoever, since 'x' doesn't even exist).

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

no this is very easy to fix
just filter the -1 from the indexer before passing to take

@mcjcode
Copy link
Contributor

mcjcode commented Sep 22, 2014

Easy enough, but does this mean we don't support location based indices for the subset argument? I mean, what if someone actually wrote d.dropna(subset=[-1]) intending to drop the rows where the last column is nan?

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

only support labels - you have the whole complication the of having to have 2 methods to disambiguate

and get indexer converts labels to locations

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

use your example (with subset = [-1]) as a test
that will KeyError I think

@mcjcode
Copy link
Contributor

mcjcode commented Sep 22, 2014

Yes, I like just supporting column labels here. I do note though, that when I do this

import pandas as pd
import numpy as np
print(pd.version.version)
d = pd.DataFrame({'a':[1,2],'b':[np.nan,3],'c':[4,np.nan]})
print(d.dropna(subset=[-1]))

I get this (i.e. it works as if it was dropping rows where the last were nan)

0.14.1-429-g0eb9023
   a   b  c
0  1 NaN  4

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

the indexer returns -1 (meaning it can't find the key) and take returns the last value

so you will need to filter out -1 or maybe if any -1 should raise
(if u have 1 good subset then should it work if another is not found)

we usually do this (eg in loc)

though I think in this case prob should just a raise of any columns are not found
as prob a user error

@jreback jreback modified the milestones: 0.15.0, 0.15.1 Sep 23, 2014
@arijun
Copy link
Author

arijun commented Sep 23, 2014

As far as raising an error goes, I was using subset specifically because I didn't care which of ['Sym', 'Symbol', 'SYMBOL'] it dropped on.

It's a feature that could (and probably should) be reproduced by a more general function, but would be missed (by me at least) if it was removed without replacement.

It's worth noting dropna is currently inconsistent with duplicated (and therefore drop_duplicates), which raise a KeyError.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2014

yes, I think this will increase consistency between drop_duplicates/duplicated,drop (which will raise a KeyError for an invalid column) and dropna (which currently does not)

@arijun
@mcjcode
pls see this related issue as well: #6599

@hayd what do you think?

@arijun
Copy link
Author

arijun commented Sep 23, 2014

I wouldn't mind doing this, but I've never developed for a large project and so might have problems with some aspects like

  • VS's compiler seems to not work for building the c code in python packages
  • I don't know the testing procedures here
  • I don't really know the contribution workflow in general.

If you're still interested in having me contribute, can someone please point me to where I can get help with these sorts of issues.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2014

oh, @mcjcode already has a soln (to this immediate problem). love to have contributions, see here for more info (and windows section on how to built for that). All of the above are covered.

Start and issue and lmk when you need help!

https://github.com/pydata/pandas/wiki

@hayd
Copy link
Contributor

hayd commented Sep 23, 2014

Just to throw in an edge case: column label -1.

Subset arg is clear (we can ignore positions):

Labels along other axis to consider,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants