-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add percentage threshold to DataFrame.dropna #35299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This generally would just be equivalent to |
No not really, this method can be applied over row or column axis and thus takes each row / column separately into account. @mroeschke |
I'd be +0 if we could roll this into Let's wait for opinions from other members of the core team |
Thanks @erfannariman for the report. I would agree with @mroeschke without a more detailed proposal/discussion. Can you update the OP using the Feature Request template instead. (available when raising issue but also included in details below) Is your feature request related to a problem?[this should provide a description of what the problem is, e.g. "I wish I could use pandas to do [...]"] Describe the solution you'd like[this should provide a description of the feature request, e.g. " API breaking implications[this should provide a description of how this feature will affect the API] Describe alternatives you've considered[this should provide a description of any alternative solutions or features you've considered] Additional context[add any other context, code examples, or references to existing implementations about the feature request here] # Your code here, if applicable |
|
Yes, updated OP @simonjayhawkins , this idea came up late at night and just started writing without a proper ticket request, sorry.
IMO this wouldn't make much sense, first of all I think this would be confusing, second when setting a threshold I think it's important that rows or columns can be threated and also important that they can be treated separately. |
Can you provide an example where this distinction matters? |
@WillAyd sure, I added a more extensive example in OP, see |
I still don't see how this example means that the so why would accepting a float between 0.0 and 1.0 (0.4 in the example) not produce the same output? the length of axis 1 is 4. 0.6*4 = 2.4 (rounded to 3) produces the same output.
so I still don't understand the distinction, with the added example. |
Basically in your message you are saying two things, so I will react on both of them:
This can still be used and in my opinion this should not be deprecated, but there are situations (and I find myself more in that situation) where defining a absolute discrete number is not preferable, since the shape of your data changes, so a absolute number does not say anything and you want a standarized metric, which is the percentage.
Technically this does produce the same output, but I think it would be confusing to have the same argument accept values which cover different functionality, in this case hard discrete values, vs a proportion. So then we would need to deprecate one, so make the choice between the two stated above in bold. And that means it would be API breaking. So in my opinion, adding a |
The discussion is about the api for adding the requested functionality. I'm +1 on adding that functionality. I don't think there has been any suggestion of deprecation so far. The crux of the discussion that we should exhaust other options before adding new keywords. One option suggested is adding more utiltiy to the from the docs, http://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.dropna.html
so an option could be to define thresh something like
at first this appears to be a non-breaking api change, but could be confusing. However, the undocumented behaviour is that so this is not a great solution but has to be considered against the alternative of adding a new keyword. There is also the another option is that this parameter could take a percentage as a string, e.g. '40%' specifying numerical values as strings is not great either, but the percentage maybe fits this parameter better, where 'any' could be an alias for '0%' and 'all' an alias for '100%' (or vice-versa) stepping back a bit, maybe the api is already confusing because we have two parameters, |
I would not be +1 on this, it would be confusing to add multiple functionality's in a single argument where the only difference for the user is if it's a
Not sure if I understand what you mean, wouldn't |
after some discussion, using the IIUC the result from example 2 in the OP can be easily achieved without the addition of additional parameters.
|
Gotcha. Yeah I can see how that may lead to confusion.
Right that's what I was referencing in #35299 (comment), which why I wasn't too enthusiastic to add a separate |
from %prun?
|
Where do we draw the line of how "user friendly" pandas has to be and how much we want to abstract away? Because following the logic of
Not sure if I understand this comment, could you elaborate? @simonjayhawkins |
it was just to show an example of a parameter taking an int and a float to represent a number or percentage.
Agreed. adding the functionality is a good idea. We just need to make sure the api design is also good. |
Sounds good, what is the way forward from here? Do we need other devs to share their opinion? |
could "winsorize" be the appropriate keyword/name? |
Would that be as replacement for "thresh"? |
Thats what I had in mind, yes. |
I would still be in favor of adding a new |
@jbrockmendel @simonjayhawkins is there a way to make more progress on this ticket? Would like to work on it, but the discussion above is not yet decided upon. |
-1 on trying to combine number of value and per in the same parameter -0 on adding perc- one more parameter is just more complexity and now u have 2 for the same thing would be better to just doc this i think not completely against - but did anything then should match sample parameter names |
Can you elaborate what you mean with "now u have 2 for the same thing". You mean when |
you can already compute the perc if u wanted it's just a function of the length so weighing this vs added keyword which is increasing complexity |
I think your point boiles down to this comment I made above:
If more members don't see the usecase for this, I can close this issue. Although I still think this would be a useful future for the data science stack and specifically the cleaning part. At the minimum I would like to add this to the docs. |
It's not really clear for me if we have a consensus how to move forward with this ENH. I would have time to finish this PR, or close it if the change is undesirable. |
Based on the discussion above I don't see this getting a lot of traction. It's not necessarily that it is "undesirable" but rather that we have an already very large API and have a high bar for expanding that, which I'm not sure this meets |
I see and I can agree on that. Thanks everyone for taking the time to review. Will close this and the PR. |
I think this is really useful, should be added. Here are some reasons why:
In summary, I think it solves many more problems than it creates. Finally, I think things should be as easy/thoughtless as possible for a user. In my humble opinion - compute is cheap, think isn't 😄 |
@JoshuaC3 pls read the discussion above. what does |
That would be the same as I proposed with adding the |
right @erfannariman agree its the same as you proposed about -0 in that (as adding additional keywords) but others can weigh in we don't have percentage in any other methods (eg limit for fill a, rolling) so that's why i am especially resistant to this seemingly small change |
I have read the discussion above and addressed this in point 4. Perhaps I wasn't clear enough about why the "1/1.0" pain-point is not an issue... The current functionality of
All this said and done, I don't really care if the API is |
df = pd.DataFrame(np.ones((5, 5)))
df.loc[1:1, 1:1] = np.nan
df.dropna(thresh=-12.4, axis=1) #??!! weird existing float behaviour
df.dropna(thresh=2.5, axis=1) #??!! weird existing float behaviour
df.dropna(thresh=4, axis=0) Thanks all |
@JoshuaC3 obviously we should check for invalid values, happy to take a PR for that. |
@jreback Is that to say checking for values and the mixing of int and float or just the checking for values? |
checking values |
And any change in heart on the int+float thresh based on this: or should I propose another API change? |
no the possible api change is orthogonal |
Is your feature request related to a problem?
When doing feature selection for models, or with handling missing data, in most cases you want to dynamically set a threshold to drop column which exceed this. A percentage is a more general metric to use to see how many data is missing data per row or per column. Right now we have the
thresh
argument, which accepts an int, but in most cases this is not desirable. For example when running the code in production, the shape of the data changes over time and thus a percentage threshold would make more sense to me. Or when you don't know the data well enough, what does "a number" mean in this case. Percentage is standarized.Besides that, the thresh right now is not flexible to go over column or index axis. For data cleaning you might want to drop rows which exceed the threshold, but for feature selection, you want to treat each column separately.
Describe the solution you'd like
Have a percentage threshold for both column and index axis.
API breaking implications
Not that I can think of, but not 100% sure.
Describe alternatives you've considered
Writing own custom functions.
Additional context
Locally I created a version where the
perc
argument is created (see linked draft PR):example 1:
example 2:
As we can see, index 3 (
axis=0
) has 75% missing values and column C (axis=1
) has 60% missing values. With thepercentage
argument we can specify what the threshold is, but also consider row wise or column wise:Example consider per row:
Above we can see row 3 and 8 got dropped because these had
> 40%
missing values.Same command but with
axis=1
, so we can consider percentage threshold per column, and we see that column C got dropped because it had 60% missing valuesThe text was updated successfully, but these errors were encountered: