Skip to content

DEPR: Dropping nuisance columns in core.apply #43740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks
rhshadrach opened this issue Sep 25, 2021 · 10 comments
Closed
2 tasks

DEPR: Dropping nuisance columns in core.apply #43740

rhshadrach opened this issue Sep 25, 2021 · 10 comments
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Sep 25, 2021

Ref: #41475, #39993 (comment)

To keep the API consistent between groupby / non-groupby UDF methods (apply, agg, transform), we should not allow silent failure in core.apply.

@rhshadrach rhshadrach added Deprecate Functionality to remove in pandas Apply Apply, Aggregate, Transform, Map API - Consistency Internal Consistency of API/Behavior labels Sep 25, 2021
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Sep 25, 2021
@rhshadrach rhshadrach changed the title API: Dropping nuisance columns in NDFrame.agg, transform, apply API: Dropping nuisance columns in core.apply Sep 25, 2021
@rhshadrach rhshadrach changed the title API: Dropping nuisance columns in core.apply DEPR: Dropping nuisance columns in core.apply Sep 25, 2021
@rhshadrach rhshadrach added the Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply label Sep 25, 2021
@emirkmo
Copy link

emirkmo commented Oct 13, 2021

Can I ask why this and other silent drops are being deprecated? It was a major convenience function for working with mixed data. Maybe it would make sense to add an override of some sort or a way to simply select and apply the function on valid rows? So instead of existing code breaking without any help provided, provide a drop_non_numeric() method that does the same, and reference it in the depreciation message, so it's a simple fix to old code. Or allow an ignore parameter like old scipy code, where nan_policy could be set to raise or ignore, so for pandas allow passing in, drop_nuisance = True? Sorry if this is not the place to post this. I sometimes have a hard time finding the right place.

As it stands, its not always trivial to prune arrays in old code, and especially to replicate previous behavior (where certain dtypes were pruned and others weren't).

Please let me know if there is better place to voice this concern and I will do so. Cheers!

@rhshadrach
Copy link
Member Author

Thanks for the feedback here @emirkmo.

Can I ask why this and other silent drops are being deprecated?

Others may have different/additional reasons. For me, when it comes to methods such as agg and apply that can take a user-defined function, how can pandas tells when a column is a nuisance column? Currently, we essentially define a "nuisance column" as one that raises a TypeError. This can lead to silent (unintended) failure and odd behaviors.

From a user standpoint, I think a user should be surprised if the following code does not result in a DataFrame with index 'a' and columns 'b' and 'c'.

df[['a', 'b', 'c']].groupby('a').sum()

With silent drops, that may not be the case. Internally, not being able to rely on an output shape gives rise to edge cases (e.g. all columns are nuisance columns) and is a source of bugs.

@emirkmo
Copy link

emirkmo commented Oct 18, 2021

Thanks for the response, the reasons for dropping make a lot of sense. It seems like there is no ideological opposition to adding a method to keep the current behavior as an unsupported (depreciation notice or warning) compatibility compromise? It has existed for so long that it has become part of the expected output of Pandas operations. Many codes rely on it.

Now every single pandas read in literally hundreds of scripts just in my case will require me to go back and add in often complicated logic for dropping certain dtypes. Having a convenience function for backward compatibility doesn't seem like a bad idea? At least to extend the transition period.

Do you think I should put this an issue? I would appreciate your feedback.

@CAM-Gerlach
Copy link
Contributor

I'm also not sure if this is the place (happy to open an issue), but I concur with @emirkmo ; this seems likely to break a large amount of existing user code (including a substantial amount of my own) without offering (or at least clearly describing) a straightforward alternative that replicates the previous behavior for the many use-cases that rely on it.

For example, I heavily rely on this behavior when doing feature extraction/generation for ML, to automatically generate means, sums, and custom aggregated metrics only on input columns where these operations are numerically sensible. Without this feature, I'd have to manually filter the data to just columns with supported data types before every call, which is tedious, verbose and potentially less robust and performant than the current BTAFTP approach.

Similarly, I'll often use groupby and apply including as part of interactive data exploration, sometimes even on-demand with other members of my research team, which gives me a quick way of summarizing more specific data and columns, and running "crosstab"-style analyses on various factors. This would make it far more time-consuming and high-friction to use particularly in an interactive setting, greatly slowing down the feedback loop for data exploration.

Is it possible to add a parameter as part of the API...somewhere...to select the desired behavior, instead of completely removing useful functionality?

@emirkmo
Copy link

emirkmo commented Oct 26, 2021

I hope you can provide some guidance to us @rhshadrach on whether it is appropriate to open up an issue to track dropping of nuisance columns not just in core.apply but elsewhere as well. At least, this sort of silent dropping is extremely useful in many contexts and to my recollection was widely relied upon when Pandas first became popular back in 2012(?).. In the past, one would have often have to manually clean data each time, arduously, before loading it into say a numpy recarray. Pandas opened up the possibility to just start working with mixed data without necessarily investing a huge amount of time pre-cleaning, as was necessary using other existing python solutions.

It's sad to see what was a selling point or at least widely relied upon functionality dropped, even though I understand that inconsistent blackbox behavior like this is heavily frowned upon. It's still extremely helpful for data analysis contexts.

@rhshadrach
Copy link
Member Author

Thanks for the feedback here @CAM-Gerlach and @emirkmo. Does

df.select_dtypes(exclude='object')

satisfy the needs here?

@CAM-Gerlach
Copy link
Contributor

Thanks @rhshadrach . It covers one common case, that of strings and other objects with the most common built-in aggregation functions, but doesn't cover the many other numpy/pandas datetypes that are not compatible with such functions (datetimes, timedeltas, periods, categoricals, fixed-length strings, etc), and neither does it cover other and custom aggregation functions, which may expect different data types or have different constraints.

df.select_dtypes(include=["number", "bool"]) should more or less handle the first set of cases much more completely than the initial suggestion (AFAIK, but there may still be cases I didn't think of that it doesn't cover), but it doesn't address the second, and still substantially increases code verbosity and required complexity (particularly for quick data exploration, one of the primary use cases of groupby) while increasing fragility. And of course, users need to know and memorize it, which gets to be a lot of boilerplate for what was previously a relatively simple and elegant invocation.

@rhshadrach
Copy link
Member Author

rhshadrach commented Nov 7, 2021

and still substantially increases code verbosity and required complexity

An example where it substantially increases code verbosity would be helpful. I see it as a single additional call (column selection in some form), and your comment above suggests to me this might be mistaken.

which gets to be a lot of boilerplate for what was previously a relatively simple and elegant invocation

By ignoring any TypeError that occurs, we are guessing that the user didn't want to aggregate a particular column. This is especially true with user-provided functions (as opposed to e.g. "mean"), where a TypeError could be an issue with the function implementation rather than desired behavior. Having users tell us what they want to aggregate avoids silent failure, which can be hard to debug. I agree this puts more onus on the user, which when viewed alone in isolation is negative, but I would not describe this as boilerplate.

@jbrockmendel
Copy link
Member

@rhshadrach what's the status here? the two PRs linked in the OP are both merged

@rhshadrach
Copy link
Member Author

@jbrockmendel - I believe I left this open because there was some opposition to the changes. I'm still behind the deprecation, and it seems like conversation has ceased. I'm going to close for now, and can reopen if anyone would like to continue the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Deprecate Functionality to remove in pandas Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Projects
None yet
Development

No branches or pull requests

4 participants