Skip to content

ENH: add percentage threshold to DataFrame.dropna #35299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
erfannariman opened this issue Jul 15, 2020 · 42 comments
Closed

ENH: add percentage threshold to DataFrame.dropna #35299

erfannariman opened this issue Jul 15, 2020 · 42 comments
Labels
API Design Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action

Comments

@erfannariman
Copy link
Member

erfannariman commented Jul 15, 2020

Is your feature request related to a problem?
When doing feature selection for models, or with handling missing data, in most cases you want to dynamically set a threshold to drop column which exceed this. A percentage is a more general metric to use to see how many data is missing data per row or per column. Right now we have the thresh argument, which accepts an int, but in most cases this is not desirable. For example when running the code in production, the shape of the data changes over time and thus a percentage threshold would make more sense to me. Or when you don't know the data well enough, what does "a number" mean in this case. Percentage is standarized.

Besides that, the thresh right now is not flexible to go over column or index axis. For data cleaning you might want to drop rows which exceed the threshold, but for feature selection, you want to treat each column separately.

Describe the solution you'd like
Have a percentage threshold for both column and index axis.

API breaking implications
Not that I can think of, but not 100% sure.

Describe alternatives you've considered
Writing own custom functions.

Additional context

Locally I created a version where the perc argument is created (see linked draft PR):

example 1:

>>> df = pd.DataFrame({'col': ["A", "A", "B", "B"],
...                    'A': [80, np.nan, np.nan, np.nan],
...                    'B': [80, np.nan, 76, 67]})
>>> df
  col     A     B
0   A  80.0  80.0
1   A   NaN   NaN
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5)
  col     A     B
0   A  80.0  80.0
2   B   NaN  76.0
3   B   NaN  67.0

>>> df.dropna(perc=0.5, axis=1)
  col     B
0   A  80.0
1   A   NaN
2   B  76.0
3   B  67.0

example 2:

>>> df = pd.DataFrame(np.random.randint(1, 10, (10,4)), columns=list('ABCD'))
>>> df.loc[3, 'B':] = np.nan
>>> df.loc[8, :'B'] = np.nan
>>> df.loc[1:6, 'C'] = np.nan
>>> df
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
3  7.0  NaN  NaN  NaN 
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
8  NaN  NaN  9.0  5.0
9  7.0  7.0  3.0  1.0

As we can see, index 3 (axis=0) has 75% missing values and column C (axis=1) has 60% missing values. With the percentage argument we can specify what the threshold is, but also consider row wise or column wise:

Example consider per row:

>>> df.dropna(perc=.4, axis=0)
     A    B    C    D
0  7.0  2.0  3.0  7.0
1  3.0  2.0  NaN  9.0
2  2.0  2.0  NaN  2.0
4  1.0  9.0  NaN  4.0
5  9.0  9.0  NaN  1.0
6  2.0  2.0  NaN  6.0
7  9.0  3.0  5.0  6.0
9  7.0  7.0  3.0  1.0

Above we can see row 3 and 8 got dropped because these had > 40% missing values.

>>> df.dropna(perc=.4, axis=1)
     A    B    D
0  7.0  2.0  7.0
1  3.0  2.0  9.0
2  2.0  2.0  2.0
3  7.0  NaN  NaN
4  1.0  9.0  4.0
5  9.0  9.0  1.0
6  2.0  2.0  6.0
7  9.0  3.0  6.0
8  NaN  NaN  5.0
9  7.0  7.0  1.0

Same command but with axis=1, so we can consider percentage threshold per column, and we see that column C got dropped because it had 60% missing values

@erfannariman erfannariman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 15, 2020
@erfannariman erfannariman changed the title ENH: add perc to DataFrame.dropna ENH: add percentage threshold to DataFrame.dropna Jul 15, 2020
@mroeschke
Copy link
Member

This generally would just be equivalent to df.dropna(thresh=round(len(df) * percentage)) correct?

@erfannariman
Copy link
Member Author

This generally would just be equivalent to df.dropna(thresh=round(len(df) * percentage)) correct?

No not really, this method can be applied over row or column axis and thus takes each row / column separately into account. @mroeschke

@mroeschke
Copy link
Member

I'd be +0 if we could roll this into thresh instead of adding a new parameter.

Let's wait for opinions from other members of the core team

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 16, 2020
@simonjayhawkins
Copy link
Member

Thanks @erfannariman for the report. I would agree with @mroeschke without a more detailed proposal/discussion. Can you update the OP using the Feature Request template instead. (available when raising issue but also included in details below)

Is your feature request related to a problem?

[this should provide a description of what the problem is, e.g. "I wish I could use pandas to do [...]"]

Describe the solution you'd like

[this should provide a description of the feature request, e.g. "DataFrame.foo should get a new parameter bar that [...]", try to write a docstring for the desired feature]

API breaking implications

[this should provide a description of how this feature will affect the API]

Describe alternatives you've considered

[this should provide a description of any alternative solutions or features you've considered]

Additional context

[add any other context, code examples, or references to existing implementations about the feature request here]

# Your code here, if applicable

@simonjayhawkins
Copy link
Member

I'd be +0 if we could roll this into thresh instead of adding a new parameter.

thres currently accepts an int. would allowing a float here make sense or would that lead to a more confusing api, with 1 having a different meaning to 1.0

@erfannariman
Copy link
Member Author

erfannariman commented Jul 16, 2020

Thanks @erfannariman for the report. I would agree with @mroeschke without a more detailed proposal/discussion. Can you update the OP using the Feature Request template instead.

Yes, updated OP @simonjayhawkins , this idea came up late at night and just started writing without a proper ticket request, sorry.

thres currently accepts an int. would allowing a float here make sense or would that lead to a more confusing api, with 1 having a different meaning to 1.0

IMO this wouldn't make much sense, first of all I think this would be confusing, second when setting a threshold I think it's important that rows or columns can be threated and also important that they can be treated separately.

@WillAyd
Copy link
Member

WillAyd commented Jul 21, 2020

This generally would just be equivalent to df.dropna(thresh=round(len(df) * percentage)) correct?

No not really, this method can be applied over row or column axis and thus takes each row / column separately into account. @mroeschke

Can you provide an example where this distinction matters?

@erfannariman
Copy link
Member Author

@WillAyd sure, I added a more extensive example in OP, see example 2.

@simonjayhawkins
Copy link
Member

@WillAyd sure, I added a more extensive example in OP, see example 2.

I still don't see how this example means that the thresh parameter cannot be used. thresh also uses the axis parameter, it's just an absolute discrete number instead of a ratio to the axis length.

so why would accepting a float between 0.0 and 1.0 (0.4 in the example) not produce the same output?

the length of axis 1 is 4. 0.6*4 = 2.4 (rounded to 3)
the length of axis 0 is 10. 0.6*10 = 6

produces the same output.

>>> df.dropna(thresh=3, axis=0)
     A    B    C    D
0  5.0  7.0  4.0  5.0
1  2.0  3.0  NaN  3.0
2  9.0  8.0  NaN  5.0
4  2.0  6.0  NaN  4.0
5  6.0  6.0  NaN  8.0
6  4.0  3.0  NaN  9.0
7  7.0  4.0  3.0  2.0
9  3.0  3.0  7.0  2.0
>>>
>>> df.dropna(thresh=6, axis=1)
     A    B    D
0  5.0  7.0  5.0
1  2.0  3.0  3.0
2  9.0  8.0  5.0
3  9.0  NaN  NaN
4  2.0  6.0  4.0
5  6.0  6.0  8.0
6  4.0  3.0  9.0
7  7.0  4.0  2.0
8  NaN  NaN  2.0
9  3.0  3.0  2.0
>>>

second when setting a threshold I think it's important that rows or columns can be threated and also important that they can be treated separately.

so I still don't understand the distinction, with the added example.

@erfannariman
Copy link
Member Author

erfannariman commented Jul 22, 2020

Basically in your message you are saying two things, so I will react on both of them:

I still don't see how this example means that the thresh parameter cannot be used. thresh also uses the axis parameter, it's just an absolute discrete number instead of a ratio to the axis length.

This can still be used and in my opinion this should not be deprecated, but there are situations (and I find myself more in that situation) where defining a absolute discrete number is not preferable, since the shape of your data changes, so a absolute number does not say anything and you want a standarized metric, which is the percentage.


so why would accepting a float between 0.0 and 1.0 (0.4 in the example) not produce the same output?

Technically this does produce the same output, but I think it would be confusing to have the same argument accept values which cover different functionality, in this case hard discrete values, vs a proportion.

So then we would need to deprecate one, so make the choice between the two stated above in bold. And that means it would be API breaking. So in my opinion, adding a percentage argument would make it more distinct for the user in which scenario to use what argument.

@simonjayhawkins

@simonjayhawkins
Copy link
Member

The discussion is about the api for adding the requested functionality. I'm +1 on adding that functionality. I don't think there has been any suggestion of deprecation so far.

The crux of the discussion that we should exhaust other options before adding new keywords.

One option suggested is adding more utiltiy to the thresh parameter.

from the docs, http://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.dropna.html

thresh : int, optional
Require that many non-NA values.

so an option could be to define thresh something like

thresh : int or float, optional
Require that many non-NA values if int or proportion of total if float (must be between 0.0 and 1.0)

at first this appears to be a non-breaking api change, but could be confusing.

However, the undocumented behaviour is that thresh currently accepts a float. (so I actually didn't need to round in the example I gave before)

so this is not a great solution but has to be considered against the alternative of adding a new keyword.

There is also the how parameter. This currently accepts a string and can currently be any or all.

another option is that this parameter could take a percentage as a string, e.g. '40%'

specifying numerical values as strings is not great either, but the percentage maybe fits this parameter better, where 'any' could be an alias for '0%' and 'all' an alias for '100%' (or vice-versa)

stepping back a bit, maybe the api is already confusing because we have two parameters, thresh and how controlling the criteria for dropping rows or columns. These parameters should not be specified together, so maybe there should have just been one parameter for this.

@erfannariman
Copy link
Member Author

thresh : int or float, optional
Require that many non-NA values if int or proportion of total if float (must be between 0.0 and 1.0)

I would not be +1 on this, it would be confusing to add multiple functionality's in a single argument where the only difference for the user is if it's a int or float. I still think by letting the thresh argument be and add perc or percentage would be more user friendly and make a clear distinguishment when to use what.

specifying numerical values as strings is not great either, but the percentage maybe fits this parameter better, where 'any' could be an alias for '0%' and 'all' an alias for '100%' (or vice-versa)

Not sure if I understand what you mean, wouldn't any be > 0%? I actually liked the how argument as a user and it wasn't really confusing. But I would like to hear more opinions about this matter.

@simonjayhawkins
Copy link
Member

@mroeschke

I'd be +0 if we could roll this into thresh instead of adding a new parameter.

after some discussion, using the thresh parameter may lead to confusion.

IIUC the result from example 2 in the OP can be easily achieved without the addition of additional parameters.

>>> # df.dropna(perc=.4, axis=0)
>>> df.dropna(thresh=(1 - 0.4) * len(df.columns))
     A    B    C    D
0  3.0  8.0  3.0  7.0
1  8.0  8.0  NaN  6.0
2  7.0  7.0  NaN  2.0
4  8.0  5.0  NaN  1.0
5  2.0  7.0  NaN  5.0
6  1.0  3.0  NaN  3.0
7  1.0  3.0  5.0  4.0
9  6.0  6.0  2.0  6.0
>>>
>>> # df.dropna(perc=.4, axis=1)
>>> df.dropna(axis=1, thresh=(1 - 0.4) * len(df))
     A    B    D
0  3.0  8.0  7.0
1  8.0  8.0  6.0
2  7.0  7.0  2.0
3  2.0  NaN  NaN
4  8.0  5.0  1.0
5  2.0  7.0  5.0
6  1.0  3.0  3.0
7  1.0  3.0  4.0
8  NaN  NaN  6.0
9  6.0  6.0  6.0
>>>

@mroeschke
Copy link
Member

mroeschke commented Jul 27, 2020

Gotcha. Yeah I can see how that may lead to confusion.

IIUC the result from example 2 in the OP can be easily achieved without the addition of additional parameters.

Right that's what I was referencing in #35299 (comment), which why I wasn't too enthusiastic to add a separate perc parameter as a shorthand for that.

@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Jul 28, 2020
@simonjayhawkins
Copy link
Member

from %prun?

-l <limit>
  you can place restrictions on what or how much of the
  profile gets printed. The limit value can be:

     * A string: only information for function names containing this string
       is printed.

     * An integer: only these many lines are printed.

     * A float (between 0 and 1): this fraction of the report is printed
       (for example, use a limit of 0.4 to see the topmost 40% only).

@erfannariman
Copy link
Member Author

Where do we draw the line of how "user friendly" pandas has to be and how much we want to abstract away? Because following the logic of df.dropna(axis=1, thresh=(1 - 0.4) * len(df)), we could also apply the same for Series.mean for example, because that is the same as Series.sum / len(df).

from %prun?

Not sure if I understand this comment, could you elaborate? @simonjayhawkins

@simonjayhawkins
Copy link
Member

Not sure if I understand this comment, could you elaborate? @simonjayhawkins

it was just to show an example of a parameter taking an int and a float to represent a number or percentage.

Where do we draw the line of how "user friendly" pandas has to be and how much we want to abstract away? Because following the logic of df.dropna(axis=1, thresh=(1 - 0.4) * len(df)), we could also apply the same for Series.mean for example, because that is the same as Series.sum / len(df).

Agreed. adding the functionality is a good idea. We just need to make sure the api design is also good.

@erfannariman
Copy link
Member Author

Agreed. adding the functionality is a good idea. We just need to make sure the api design is also good.

Sounds good, what is the way forward from here? Do we need other devs to share their opinion?

@jbrockmendel
Copy link
Member

could "winsorize" be the appropriate keyword/name?

@erfannariman
Copy link
Member Author

erfannariman commented Sep 13, 2020

Would that be as replacement for "thresh"?

@jbrockmendel
Copy link
Member

Would that be as replacement for "thresh"?

Thats what I had in mind, yes.

@erfannariman
Copy link
Member Author

I would still be in favor of adding a new perc argument, this needs deprecation, besides the other arguments I gave above about more clear distinguishment than adding a float or int.

@erfannariman
Copy link
Member Author

@jbrockmendel @simonjayhawkins is there a way to make more progress on this ticket? Would like to work on it, but the discussion above is not yet decided upon.

@jreback
Copy link
Contributor

jreback commented Sep 15, 2020

-1 on trying to combine number of value and per in the same parameter

-0 on adding perc- one more parameter is just more complexity and now u have 2 for the same thing

would be better to just doc this i think

not completely against - but did anything then should match sample parameter names

@erfannariman
Copy link
Member Author

Can you elaborate what you mean with "now u have 2 for the same thing". You mean when perc is added? But arn't they distinguished by the functionality? @jreback

@jreback
Copy link
Contributor

jreback commented Sep 16, 2020

you can already compute the perc if u wanted it's just a function of the length

so weighing this vs added keyword which is increasing complexity

@erfannariman
Copy link
Member Author

erfannariman commented Sep 16, 2020

I think your point boiles down to this comment I made above:

Where do we draw the line of how "user friendly" pandas has to be and how much we want to abstract away? Because following the logic of df.dropna(axis=1, thresh=(1 - 0.4) * len(df)), we could also apply the same for Series.mean for example, because that is the same as Series.sum / len(df).

@jreback

If more members don't see the usecase for this, I can close this issue. Although I still think this would be a useful future for the data science stack and specifically the cleaning part.

At the minimum I would like to add this to the docs.

@erfannariman
Copy link
Member Author

It's not really clear for me if we have a consensus how to move forward with this ENH. I would have time to finish this PR, or close it if the change is undesirable.

@WillAyd
Copy link
Member

WillAyd commented Sep 29, 2020

Based on the discussion above I don't see this getting a lot of traction. It's not necessarily that it is "undesirable" but rather that we have an already very large API and have a high bar for expanding that, which I'm not sure this meets

@erfannariman
Copy link
Member Author

I see and I can agree on that. Thanks everyone for taking the time to review. Will close this and the PR.

@JoshuaC3
Copy link

#40676

@JoshuaC3
Copy link

I think this is really useful, should be added. Here are some reasons why:

  1. I regularly but up against this but it has always been easier to split my method-chain and add: df.dropna(axis=1, thresh=0.4 * len(df)) than raise an Issue/EHN. Basically, I think it is being under-reported as a desire.

  2. df.dropna(axis=1, thresh=0.4 * len(df)) causes you to split a method-chain (which is very readable).

  3. df.dropna(axis=1, thresh=0.4 * len(df)) doesn't convert nicely to df.dropna(axis=0, thresh=0.4 * df.shape[0]) - in fact, this is the primary reason I raised this. I changed df.dropna(axis=0, thresh=0.4 * len(df)) (notice the len(df) and the axis=0 are not different dimension without thinking and got untold errors. It isn't blatantly obvious this was the error.

  4. There seems to be some concern around the int vs float arguments causing confusion. I would disagree with this. It is unlikely that you wish to drop all your data as with 1.0 float vs the single row 1 int. Additionally, 0.0 and 0 will behave the same, so no issue there. Anyone wanting to use a fraction/percentage/decimal clearly has the intuition and intention to do so, as does an int user. Adding the floats (0, 1.0) doesn't affect the current functionality/API. Finally, as @simonjayhawkins showed, it isn't uncommon to use types like this. ENH: add percentage threshold to DataFrame.dropna #35299 (comment)

In summary, I think it solves many more problems than it creates.

Finally, I think things should be as easy/thoughtless as possible for a user. In my humble opinion - compute is cheap, think isn't 😄

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

@JoshuaC3 pls read the discussion above. what does thresh=1 mean? that is the confusion. we cannot possibly mix this in as that will create a disaster. when I said new api, you can propose thread_pct or suggest deprecating thresh / rename in favor of a new argument. -1 on mixing these.

@erfannariman
Copy link
Member Author

erfannariman commented Mar 29, 2021

@JoshuaC3 pls read the discussion above. what does thresh=1 mean? that is the confusion. we cannot possibly mix this in as that will create a disaster. when I said new api, you can propose thread_pct or suggest deprecating thresh / rename in favor of a new argument. -1 on mixing these.

That would be the same as I proposed with adding the perc argument combined with axis=0 / axis=1 right? Which I still see useful btw. Mixing thresh and floats and ints would be -1, I agree.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

right @erfannariman agree its the same as you proposed

about -0 in that (as adding additional keywords) but others can weigh in

we don't have percentage in any other methods (eg limit for fill a, rolling) so that's why i am especially resistant to this seemingly small change

@JoshuaC3
Copy link

I have read the discussion above and addressed this in point 4. Perhaps I wasn't clear enough about why the "1/1.0" pain-point is not an issue...

The current functionality of thresh is essentially: "bigger-thresh -> more dropping".

  1. Therefore, in the float context, thresh=0.0 is drop nothing. Exactly the same as thresh=0 in the int context.
  2. In the float context, thresh=1.0 is drop everything. I can't see why you would want to allow this anyway, you can just use drop. Therefore, we can set the floats to be (0, 1.0) in mathematical notation. That is, everything between 0 and 1 but not including 0 and 1 would be considered a float.
  3. It makes even less sense to have -0.3 or +1.3 as values passed. to thresh, which is currently allowed and does nothing. I'm not sure why or what this means... Therefore, the (0, 1) restriction makes total sense to me and should throw an error when a float outside of that range.

All this said and done, I don't really care if the API is thresh, thresh_pct, q like in .quantile() or xxxxx, it is the functionality that I find most useful and am very happy to change stance on this if it means it gets accepted!

@JoshuaC3
Copy link

JoshuaC3 commented Mar 29, 2021

df = pd.DataFrame(np.ones((5, 5)))

df.loc[1:1, 1:1] = np.nan

df.dropna(thresh=-12.4, axis=1) #??!! weird existing float behaviour
df.dropna(thresh=2.5, axis=1) #??!! weird existing float behaviour
df.dropna(thresh=4, axis=0)

Thanks all

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

@JoshuaC3 obviously we should check for invalid values, happy to take a PR for that.

@JoshuaC3
Copy link

@jreback Is that to say checking for values and the mixing of int and float or just the checking for values?

@jreback
Copy link
Contributor

jreback commented Mar 30, 2021

checking values

@JoshuaC3
Copy link

And any change in heart on the int+float thresh based on this:
#35299 (comment)

or should I propose another API change?

@jreback
Copy link
Contributor

jreback commented Mar 30, 2021

no the possible api change is orthogonal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants