ENH: add percentage threshold to DataFrame.dropna #35299

erfannariman · 2020-07-15T21:01:28Z

Is your feature request related to a problem?
When doing feature selection for models, or with handling missing data, in most cases you want to dynamically set a threshold to drop column which exceed this. A percentage is a more general metric to use to see how many data is missing data per row or per column. Right now we have the thresh argument, which accepts an int, but in most cases this is not desirable. For example when running the code in production, the shape of the data changes over time and thus a percentage threshold would make more sense to me. Or when you don't know the data well enough, what does "a number" mean in this case. Percentage is standarized.

Besides that, the thresh right now is not flexible to go over column or index axis. For data cleaning you might want to drop rows which exceed the threshold, but for feature selection, you want to treat each column separately.

Describe the solution you'd like
Have a percentage threshold for both column and index axis.

API breaking implications
Not that I can think of, but not 100% sure.

Describe alternatives you've considered
Writing own custom functions.

Additional context

Locally I created a version where the perc argument is created (see linked draft PR):

example 1:

>>> df = pd.DataFrame({'col': ["A", "A", "B", "B"],
...                    'A': [80, np.nan, np.nan, np.nan],
...                    'B': [80, np.nan, 76, 67]})
>>> df
  col     A     B
0   A  80.0  80.0
1   A   NaN   NaN
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5)
  col     A     B
0   A  80.0  80.0
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5, axis=1)
  col     B
0   A  80.0
1   A   NaN
2   B  76.0
3   B  67.0

example 2:

>>> df = pd.DataFrame(np.random.randint(1, 10, (10,4)), columns=list('ABCD'))
>>> df.loc[3, 'B':] = np.nan
>>> df.loc[8, :'B'] = np.nan
>>> df.loc[1:6, 'C'] = np.nan
>>> df
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
3  7.0  NaN  NaN  NaN 
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
8  NaN  NaN  9.0  5.0
9  7.0  7.0  3.0  1.0

As we can see, index 3 (axis=0) has 75% missing values and column C (axis=1) has 60% missing values. With the percentage argument we can specify what the threshold is, but also consider row wise or column wise:

Example consider per row:

>>> df.dropna(perc=.4, axis=0)
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
9  7.0  7.0  3.0  1.0

Above we can see row 3 and 8 got dropped because these had > 40% missing values.

>>> df.dropna(perc=.4, axis=1)
     A    B    D
0  7.0  2.0  7.0
1  3.0  2.0  9.0
2  2.0  2.0  2.0
3  7.0  NaN  NaN
4  1.0  9.0  4.0
5  9.0  9.0  1.0
6  2.0  2.0  6.0
7  9.0  3.0  6.0
8  NaN  NaN  5.0
9  7.0  7.0  1.0

Same command but with axis=1, so we can consider percentage threshold per column, and we see that column C got dropped because it had 60% missing values

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-07-16T02:46:09Z

This generally would just be equivalent to df.dropna(thresh=round(len(df) * percentage)) correct?

erfannariman · 2020-07-16T07:48:21Z

This generally would just be equivalent to df.dropna(thresh=round(len(df) * percentage)) correct?

No not really, this method can be applied over row or column axis and thus takes each row / column separately into account. @mroeschke

mroeschke · 2020-07-16T16:13:53Z

I'd be +0 if we could roll this into thresh instead of adding a new parameter.

Let's wait for opinions from other members of the core team

simonjayhawkins · 2020-07-16T18:24:13Z

Thanks @erfannariman for the report. I would agree with @mroeschke without a more detailed proposal/discussion. Can you update the OP using the Feature Request template instead. (available when raising issue but also included in details below)

Is your feature request related to a problem?

[this should provide a description of what the problem is, e.g. "I wish I could use pandas to do [...]"]

Describe the solution you'd like

[this should provide a description of the feature request, e.g. "DataFrame.foo should get a new parameter bar that [...]", try to write a docstring for the desired feature]

API breaking implications

[this should provide a description of how this feature will affect the API]

Describe alternatives you've considered

[this should provide a description of any alternative solutions or features you've considered]

Additional context

[add any other context, code examples, or references to existing implementations about the feature request here]

# Your code here, if applicable

simonjayhawkins · 2020-07-16T18:31:36Z

I'd be +0 if we could roll this into thresh instead of adding a new parameter.

thres currently accepts an int. would allowing a float here make sense or would that lead to a more confusing api, with 1 having a different meaning to 1.0

erfannariman · 2020-07-16T18:54:10Z

Thanks @erfannariman for the report. I would agree with @mroeschke without a more detailed proposal/discussion. Can you update the OP using the Feature Request template instead.

Yes, updated OP @simonjayhawkins , this idea came up late at night and just started writing without a proper ticket request, sorry.

thres currently accepts an int. would allowing a float here make sense or would that lead to a more confusing api, with 1 having a different meaning to 1.0

IMO this wouldn't make much sense, first of all I think this would be confusing, second when setting a threshold I think it's important that rows or columns can be threated and also important that they can be treated separately.

WillAyd · 2020-07-21T19:34:04Z

This generally would just be equivalent to df.dropna(thresh=round(len(df) * percentage)) correct?

No not really, this method can be applied over row or column axis and thus takes each row / column separately into account. @mroeschke

Can you provide an example where this distinction matters?

erfannariman · 2020-07-22T13:11:19Z

@WillAyd sure, I added a more extensive example in OP, see example 2.

simonjayhawkins · 2020-07-22T13:41:51Z

@WillAyd sure, I added a more extensive example in OP, see example 2.

I still don't see how this example means that the thresh parameter cannot be used. thresh also uses the axis parameter, it's just an absolute discrete number instead of a ratio to the axis length.

so why would accepting a float between 0.0 and 1.0 (0.4 in the example) not produce the same output?

the length of axis 1 is 4. 0.6*4 = 2.4 (rounded to 3)
the length of axis 0 is 10. 0.6*10 = 6

produces the same output.

>>> df.dropna(thresh=3, axis=0)
     A    B    C    D
0  5.0  7.0  4.0  5.0
1  2.0  3.0  NaN  3.0
2  9.0  8.0  NaN  5.0
4  2.0  6.0  NaN  4.0
5  6.0  6.0  NaN  8.0
6  4.0  3.0  NaN  9.0
7  7.0  4.0  3.0  2.0
9  3.0  3.0  7.0  2.0
>>>
>>> df.dropna(thresh=6, axis=1)
     A    B    D
0  5.0  7.0  5.0
1  2.0  3.0  3.0
2  9.0  8.0  5.0
3  9.0  NaN  NaN
4  2.0  6.0  4.0
5  6.0  6.0  8.0
6  4.0  3.0  9.0
7  7.0  4.0  2.0
8  NaN  NaN  2.0
9  3.0  3.0  2.0
>>>

second when setting a threshold I think it's important that rows or columns can be threated and also important that they can be treated separately.

so I still don't understand the distinction, with the added example.

erfannariman · 2020-07-22T16:36:18Z

Basically in your message you are saying two things, so I will react on both of them:

I still don't see how this example means that the thresh parameter cannot be used. thresh also uses the axis parameter, it's just an absolute discrete number instead of a ratio to the axis length.

This can still be used and in my opinion this should not be deprecated, but there are situations (and I find myself more in that situation) where defining a absolute discrete number is not preferable, since the shape of your data changes, so a absolute number does not say anything and you want a standarized metric, which is the percentage.

so why would accepting a float between 0.0 and 1.0 (0.4 in the example) not produce the same output?

Technically this does produce the same output, but I think it would be confusing to have the same argument accept values which cover different functionality, in this case hard discrete values, vs a proportion.

So then we would need to deprecate one, so make the choice between the two stated above in bold. And that means it would be API breaking. So in my opinion, adding a percentage argument would make it more distinct for the user in which scenario to use what argument.

@simonjayhawkins

simonjayhawkins · 2020-07-22T18:02:06Z

The discussion is about the api for adding the requested functionality. I'm +1 on adding that functionality. I don't think there has been any suggestion of deprecation so far.

The crux of the discussion that we should exhaust other options before adding new keywords.

One option suggested is adding more utiltiy to the thresh parameter.

from the docs, http://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.dropna.html

thresh : int, optional
Require that many non-NA values.

so an option could be to define thresh something like

thresh : int or float, optional
Require that many non-NA values if int or proportion of total if float (must be between 0.0 and 1.0)

at first this appears to be a non-breaking api change, but could be confusing.

However, the undocumented behaviour is that thresh currently accepts a float. (so I actually didn't need to round in the example I gave before)

so this is not a great solution but has to be considered against the alternative of adding a new keyword.

There is also the how parameter. This currently accepts a string and can currently be any or all.

another option is that this parameter could take a percentage as a string, e.g. '40%'

specifying numerical values as strings is not great either, but the percentage maybe fits this parameter better, where 'any' could be an alias for '0%' and 'all' an alias for '100%' (or vice-versa)

stepping back a bit, maybe the api is already confusing because we have two parameters, thresh and how controlling the criteria for dropping rows or columns. These parameters should not be specified together, so maybe there should have just been one parameter for this.

erfannariman · 2020-07-27T15:59:08Z

thresh : int or float, optional
Require that many non-NA values if int or proportion of total if float (must be between 0.0 and 1.0)

I would not be +1 on this, it would be confusing to add multiple functionality's in a single argument where the only difference for the user is if it's a int or float. I still think by letting the thresh argument be and add perc or percentage would be more user friendly and make a clear distinguishment when to use what.

specifying numerical values as strings is not great either, but the percentage maybe fits this parameter better, where 'any' could be an alias for '0%' and 'all' an alias for '100%' (or vice-versa)

Not sure if I understand what you mean, wouldn't any be > 0%? I actually liked the how argument as a user and it wasn't really confusing. But I would like to hear more opinions about this matter.

simonjayhawkins · 2020-07-27T19:57:54Z

@mroeschke

I'd be +0 if we could roll this into thresh instead of adding a new parameter.

after some discussion, using the thresh parameter may lead to confusion.

IIUC the result from example 2 in the OP can be easily achieved without the addition of additional parameters.

>>> # df.dropna(perc=.4, axis=0)
>>> df.dropna(thresh=(1 - 0.4) * len(df.columns))
     A    B    C    D
0  3.0  8.0  3.0  7.0
1  8.0  8.0  NaN  6.0
2  7.0  7.0  NaN  2.0
4  8.0  5.0  NaN  1.0
5  2.0  7.0  NaN  5.0
6  1.0  3.0  NaN  3.0
7  1.0  3.0  5.0  4.0
9  6.0  6.0  2.0  6.0
>>>
>>> # df.dropna(perc=.4, axis=1)
>>> df.dropna(axis=1, thresh=(1 - 0.4) * len(df))
     A    B    D
0  3.0  8.0  7.0
1  8.0  8.0  6.0
2  7.0  7.0  2.0
3  2.0  NaN  NaN
4  8.0  5.0  1.0
5  2.0  7.0  5.0
6  1.0  3.0  3.0
7  1.0  3.0  4.0
8  NaN  NaN  6.0
9  6.0  6.0  6.0
>>>

mroeschke · 2020-07-27T22:06:41Z

Gotcha. Yeah I can see how that may lead to confusion.

IIUC the result from example 2 in the OP can be easily achieved without the addition of additional parameters.

Right that's what I was referencing in #35299 (comment), which why I wasn't too enthusiastic to add a separate perc parameter as a shorthand for that.

simonjayhawkins · 2020-07-28T17:39:05Z

from %prun?

-l <limit>
  you can place restrictions on what or how much of the
  profile gets printed. The limit value can be:

     * A string: only information for function names containing this string
       is printed.

     * An integer: only these many lines are printed.

     * A float (between 0 and 1): this fraction of the report is printed
       (for example, use a limit of 0.4 to see the topmost 40% only).

erfannariman · 2020-07-28T19:01:22Z

Where do we draw the line of how "user friendly" pandas has to be and how much we want to abstract away? Because following the logic of df.dropna(axis=1, thresh=(1 - 0.4) * len(df)), we could also apply the same for Series.mean for example, because that is the same as Series.sum / len(df).

from %prun?

Not sure if I understand this comment, could you elaborate? @simonjayhawkins

simonjayhawkins · 2020-07-29T10:50:29Z

Not sure if I understand this comment, could you elaborate? @simonjayhawkins

it was just to show an example of a parameter taking an int and a float to represent a number or percentage.

Where do we draw the line of how "user friendly" pandas has to be and how much we want to abstract away? Because following the logic of df.dropna(axis=1, thresh=(1 - 0.4) * len(df)), we could also apply the same for Series.mean for example, because that is the same as Series.sum / len(df).

Agreed. adding the functionality is a good idea. We just need to make sure the api design is also good.

erfannariman · 2020-08-04T09:35:07Z

Agreed. adding the functionality is a good idea. We just need to make sure the api design is also good.

Sounds good, what is the way forward from here? Do we need other devs to share their opinion?

jbrockmendel · 2020-09-11T17:41:43Z

could "winsorize" be the appropriate keyword/name?

erfannariman · 2020-09-13T18:48:27Z

Would that be as replacement for "thresh"?

jbrockmendel · 2020-09-13T19:00:23Z

Would that be as replacement for "thresh"?

Thats what I had in mind, yes.

erfannariman · 2020-09-13T19:08:39Z

I would still be in favor of adding a new perc argument, this needs deprecation, besides the other arguments I gave above about more clear distinguishment than adding a float or int.

erfannariman · 2020-09-15T11:59:25Z

@jbrockmendel @simonjayhawkins is there a way to make more progress on this ticket? Would like to work on it, but the discussion above is not yet decided upon.

jreback · 2020-09-15T12:50:24Z

-1 on trying to combine number of value and per in the same parameter

-0 on adding perc- one more parameter is just more complexity and now u have 2 for the same thing

would be better to just doc this i think

not completely against - but did anything then should match sample parameter names

erfannariman · 2020-09-15T13:51:27Z

Can you elaborate what you mean with "now u have 2 for the same thing". You mean when perc is added? But arn't they distinguished by the functionality? @jreback

jreback · 2020-09-16T02:10:01Z

you can already compute the perc if u wanted it's just a function of the length

so weighing this vs added keyword which is increasing complexity

erfannariman · 2020-09-16T22:06:10Z

I think your point boiles down to this comment I made above:

Where do we draw the line of how "user friendly" pandas has to be and how much we want to abstract away? Because following the logic of df.dropna(axis=1, thresh=(1 - 0.4) * len(df)), we could also apply the same for Series.mean for example, because that is the same as Series.sum / len(df).

@jreback

If more members don't see the usecase for this, I can close this issue. Although I still think this would be a useful future for the data science stack and specifically the cleaning part.

At the minimum I would like to add this to the docs.

erfannariman · 2020-09-25T11:55:33Z

It's not really clear for me if we have a consensus how to move forward with this ENH. I would have time to finish this PR, or close it if the change is undesirable.

WillAyd · 2020-09-29T20:43:44Z

Based on the discussion above I don't see this getting a lot of traction. It's not necessarily that it is "undesirable" but rather that we have an already very large API and have a high bar for expanding that, which I'm not sure this meets

erfannariman · 2020-09-29T23:45:37Z

I see and I can agree on that. Thanks everyone for taking the time to review. Will close this and the PR.

JoshuaC3 · 2021-03-29T12:45:39Z

#40676

JoshuaC3 · 2021-03-29T13:17:45Z

I think this is really useful, should be added. Here are some reasons why:

I regularly but up against this but it has always been easier to split my method-chain and add: df.dropna(axis=1, thresh=0.4 * len(df)) than raise an Issue/EHN. Basically, I think it is being under-reported as a desire.
df.dropna(axis=1, thresh=0.4 * len(df)) causes you to split a method-chain (which is very readable).
df.dropna(axis=1, thresh=0.4 * len(df)) doesn't convert nicely to df.dropna(axis=0, thresh=0.4 * df.shape[0]) - in fact, this is the primary reason I raised this. I changed df.dropna(axis=0, thresh=0.4 * len(df)) (notice the len(df) and the axis=0 are not different dimension without thinking and got untold errors. It isn't blatantly obvious this was the error.
There seems to be some concern around the int vs float arguments causing confusion. I would disagree with this. It is unlikely that you wish to drop all your data as with 1.0 float vs the single row 1 int. Additionally, 0.0 and 0 will behave the same, so no issue there. Anyone wanting to use a fraction/percentage/decimal clearly has the intuition and intention to do so, as does an int user. Adding the floats (0, 1.0) doesn't affect the current functionality/API. Finally, as @simonjayhawkins showed, it isn't uncommon to use types like this. ENH: add percentage threshold to DataFrame.dropna #35299 (comment)

In summary, I think it solves many more problems than it creates.

Finally, I think things should be as easy/thoughtless as possible for a user. In my humble opinion - compute is cheap, think isn't 😄

jreback · 2021-03-29T13:43:37Z

@JoshuaC3 pls read the discussion above. what does thresh=1 mean? that is the confusion. we cannot possibly mix this in as that will create a disaster. when I said new api, you can propose thread_pct or suggest deprecating thresh / rename in favor of a new argument. -1 on mixing these.

erfannariman · 2021-03-29T13:46:13Z

@JoshuaC3 pls read the discussion above. what does thresh=1 mean? that is the confusion. we cannot possibly mix this in as that will create a disaster. when I said new api, you can propose thread_pct or suggest deprecating thresh / rename in favor of a new argument. -1 on mixing these.

That would be the same as I proposed with adding the perc argument combined with axis=0 / axis=1 right? Which I still see useful btw. Mixing thresh and floats and ints would be -1, I agree.

jreback · 2021-03-29T14:07:28Z

right @erfannariman agree its the same as you proposed

about -0 in that (as adding additional keywords) but others can weigh in

we don't have percentage in any other methods (eg limit for fill a, rolling) so that's why i am especially resistant to this seemingly small change

JoshuaC3 · 2021-03-29T15:05:29Z

I have read the discussion above and addressed this in point 4. Perhaps I wasn't clear enough about why the "1/1.0" pain-point is not an issue...

The current functionality of thresh is essentially: "bigger-thresh -> more dropping".

Therefore, in the float context, thresh=0.0 is drop nothing. Exactly the same as thresh=0 in the int context.
In the float context, thresh=1.0 is drop everything. I can't see why you would want to allow this anyway, you can just use drop. Therefore, we can set the floats to be (0, 1.0) in mathematical notation. That is, everything between 0 and 1 but not including 0 and 1 would be considered a float.
It makes even less sense to have -0.3 or +1.3 as values passed. to thresh, which is currently allowed and does nothing. I'm not sure why or what this means... Therefore, the (0, 1) restriction makes total sense to me and should throw an error when a float outside of that range.

All this said and done, I don't really care if the API is thresh, thresh_pct, q like in .quantile() or xxxxx, it is the functionality that I find most useful and am very happy to change stance on this if it means it gets accepted!

JoshuaC3 · 2021-03-29T15:08:18Z

df = pd.DataFrame(np.ones((5, 5)))

df.loc[1:1, 1:1] = np.nan

df.dropna(thresh=-12.4, axis=1) #??!! weird existing float behaviour
df.dropna(thresh=2.5, axis=1) #??!! weird existing float behaviour
df.dropna(thresh=4, axis=0)

Thanks all

jreback · 2021-03-29T15:18:45Z

@JoshuaC3 obviously we should check for invalid values, happy to take a PR for that.

JoshuaC3 · 2021-03-30T07:25:23Z

@jreback Is that to say checking for values and the mixing of int and float or just the checking for values?

jreback · 2021-03-30T10:39:27Z

checking values

JoshuaC3 · 2021-03-30T11:58:44Z

And any change in heart on the int+float thresh based on this:
#35299 (comment)

or should I propose another API change?

jreback · 2021-03-30T12:24:56Z

no the possible api change is orthogonal

erfannariman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 15, 2020

erfannariman changed the title ~~ENH: add perc to DataFrame.dropna~~ ENH: add percentage threshold to DataFrame.dropna Jul 15, 2020

erfannariman mentioned this issue Jul 15, 2020

ENH: add percentage threshold to dropna #35300

Closed

5 tasks

mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 16, 2020

simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Jul 28, 2020

erfannariman closed this as completed Sep 29, 2020

jreback mentioned this issue Mar 29, 2021

ENH: Allow dropna to accept floats [0, 1] as thresh values #40676

Open

ENH: add percentage threshold to DataFrame.dropna #35299

ENH: add percentage threshold to DataFrame.dropna #35299

Comments

erfannariman commented Jul 15, 2020 • edited Loading

mroeschke commented Jul 16, 2020

erfannariman commented Jul 16, 2020

mroeschke commented Jul 16, 2020

simonjayhawkins commented Jul 16, 2020

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

simonjayhawkins commented Jul 16, 2020

erfannariman commented Jul 16, 2020 • edited Loading

WillAyd commented Jul 21, 2020

erfannariman commented Jul 22, 2020

simonjayhawkins commented Jul 22, 2020

erfannariman commented Jul 22, 2020 • edited Loading

simonjayhawkins commented Jul 22, 2020

erfannariman commented Jul 27, 2020

simonjayhawkins commented Jul 27, 2020

mroeschke commented Jul 27, 2020 • edited Loading

simonjayhawkins commented Jul 28, 2020

erfannariman commented Jul 28, 2020

simonjayhawkins commented Jul 29, 2020

erfannariman commented Aug 4, 2020

jbrockmendel commented Sep 11, 2020

erfannariman commented Sep 13, 2020 • edited Loading

jbrockmendel commented Sep 13, 2020

erfannariman commented Sep 13, 2020

erfannariman commented Sep 15, 2020

jreback commented Sep 15, 2020

erfannariman commented Sep 15, 2020

jreback commented Sep 16, 2020

erfannariman commented Sep 16, 2020 • edited Loading

erfannariman commented Sep 25, 2020

WillAyd commented Sep 29, 2020

erfannariman commented Sep 29, 2020

JoshuaC3 commented Mar 29, 2021

JoshuaC3 commented Mar 29, 2021

jreback commented Mar 29, 2021

erfannariman commented Mar 29, 2021 • edited Loading

jreback commented Mar 29, 2021

JoshuaC3 commented Mar 29, 2021

JoshuaC3 commented Mar 29, 2021 • edited Loading

jreback commented Mar 29, 2021

JoshuaC3 commented Mar 30, 2021

jreback commented Mar 30, 2021

JoshuaC3 commented Mar 30, 2021

jreback commented Mar 30, 2021

erfannariman commented Jul 15, 2020 •

edited

Loading

erfannariman commented Jul 16, 2020 •

edited

Loading

erfannariman commented Jul 22, 2020 •

edited

Loading

mroeschke commented Jul 27, 2020 •

edited

Loading

erfannariman commented Sep 13, 2020 •

edited

Loading

erfannariman commented Sep 16, 2020 •

edited

Loading

erfannariman commented Mar 29, 2021 •

edited

Loading

JoshuaC3 commented Mar 29, 2021 •

edited

Loading