-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DEPR: Dropping nuisance columns in core.apply #43740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can I ask why this and other silent drops are being deprecated? It was a major convenience function for working with mixed data. Maybe it would make sense to add an override of some sort or a way to simply select and apply the function on valid rows? So instead of existing code breaking without any help provided, provide a drop_non_numeric() method that does the same, and reference it in the depreciation message, so it's a simple fix to old code. Or allow an ignore parameter like old scipy code, where nan_policy could be set to raise or ignore, so for pandas allow passing in, drop_nuisance = True? Sorry if this is not the place to post this. I sometimes have a hard time finding the right place. As it stands, its not always trivial to prune arrays in old code, and especially to replicate previous behavior (where certain dtypes were pruned and others weren't). Please let me know if there is better place to voice this concern and I will do so. Cheers! |
Thanks for the feedback here @emirkmo.
Others may have different/additional reasons. For me, when it comes to methods such as From a user standpoint, I think a user should be surprised if the following code does not result in a DataFrame with index 'a' and columns 'b' and 'c'.
With silent drops, that may not be the case. Internally, not being able to rely on an output shape gives rise to edge cases (e.g. all columns are nuisance columns) and is a source of bugs. |
Thanks for the response, the reasons for dropping make a lot of sense. It seems like there is no ideological opposition to adding a method to keep the current behavior as an unsupported (depreciation notice or warning) compatibility compromise? It has existed for so long that it has become part of the expected output of Pandas operations. Many codes rely on it. Now every single pandas read in literally hundreds of scripts just in my case will require me to go back and add in often complicated logic for dropping certain dtypes. Having a convenience function for backward compatibility doesn't seem like a bad idea? At least to extend the transition period. Do you think I should put this an issue? I would appreciate your feedback. |
I'm also not sure if this is the place (happy to open an issue), but I concur with @emirkmo ; this seems likely to break a large amount of existing user code (including a substantial amount of my own) without offering (or at least clearly describing) a straightforward alternative that replicates the previous behavior for the many use-cases that rely on it. For example, I heavily rely on this behavior when doing feature extraction/generation for ML, to automatically generate means, sums, and custom aggregated metrics only on input columns where these operations are numerically sensible. Without this feature, I'd have to manually filter the data to just columns with supported data types before every call, which is tedious, verbose and potentially less robust and performant than the current BTAFTP approach. Similarly, I'll often use groupby and apply including as part of interactive data exploration, sometimes even on-demand with other members of my research team, which gives me a quick way of summarizing more specific data and columns, and running "crosstab"-style analyses on various factors. This would make it far more time-consuming and high-friction to use particularly in an interactive setting, greatly slowing down the feedback loop for data exploration. Is it possible to add a parameter as part of the API...somewhere...to select the desired behavior, instead of completely removing useful functionality? |
I hope you can provide some guidance to us @rhshadrach on whether it is appropriate to open up an issue to track dropping of nuisance columns not just in core.apply but elsewhere as well. At least, this sort of silent dropping is extremely useful in many contexts and to my recollection was widely relied upon when Pandas first became popular back in 2012(?).. In the past, one would have often have to manually clean data each time, arduously, before loading it into say a numpy recarray. Pandas opened up the possibility to just start working with mixed data without necessarily investing a huge amount of time pre-cleaning, as was necessary using other existing python solutions. It's sad to see what was a selling point or at least widely relied upon functionality dropped, even though I understand that inconsistent blackbox behavior like this is heavily frowned upon. It's still extremely helpful for data analysis contexts. |
Thanks for the feedback here @CAM-Gerlach and @emirkmo. Does
satisfy the needs here? |
Thanks @rhshadrach . It covers one common case, that of strings and other objects with the most common built-in aggregation functions, but doesn't cover the many other numpy/pandas datetypes that are not compatible with such functions (datetimes, timedeltas, periods, categoricals, fixed-length strings, etc), and neither does it cover other and custom aggregation functions, which may expect different data types or have different constraints.
|
An example where it substantially increases code verbosity would be helpful. I see it as a single additional call (column selection in some form), and your comment above suggests to me this might be mistaken.
By ignoring any TypeError that occurs, we are guessing that the user didn't want to aggregate a particular column. This is especially true with user-provided functions (as opposed to e.g. "mean"), where a TypeError could be an issue with the function implementation rather than desired behavior. Having users tell us what they want to aggregate avoids silent failure, which can be hard to debug. I agree this puts more onus on the user, which when viewed alone in isolation is negative, but I would not describe this as boilerplate. |
@rhshadrach what's the status here? the two PRs linked in the OP are both merged |
@jbrockmendel - I believe I left this open because there was some opposition to the changes. I'm still behind the deprecation, and it seems like conversation has ceased. I'm going to close for now, and can reopen if anyone would like to continue the discussion. |
Ref: #41475, #39993 (comment)
To keep the API consistent between groupby / non-groupby UDF methods (apply, agg, transform), we should not allow silent failure in
core.apply
.The text was updated successfully, but these errors were encountered: