-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DEPR: remove use of nan_as_null
from callers of __dataframe__
#54846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
No one was using it yet and it seemed easier to clean it up from libraries than to ask everyone to add support for it - see data-apis/dataframe-api#125.
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @rgommers
EDIT: please disregard my approval, I'll come back to this next week - sorry for having been too fast on this one
pandas/core/frame.py
Outdated
Whether to tell the DataFrame to overwrite null values in the data | ||
with ``NaN`` (or ``NaT``). | ||
`nan_as_null` is DEPRECATED and has no effect. Please avoid using | ||
it; it will be removed in a future release (after Aug 2024). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it; it will be removed in a future release (after Aug 2024). | |
it; it will be removed in a future release (after Aug 2024). | |
.. deprecated:: 2.2.0 |
pandas/core/frame.py
Outdated
Whether to tell the DataFrame to overwrite null values in the data | ||
with ``NaN`` (or ``NaT``). | ||
`nan_as_null` is DEPRECATED and has no effect. Please avoid using | ||
it; it will be removed in a future release (after Aug 2024). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The date is wrong, this will be removed in 3.0 which will happen before that.
Also: This needs a FutureWarning if a value is passed by the user and we need tests for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here was to keep it around for a year in all dataframe libraries, to avoid unnecessary breakage. This works best if we remove usage of the keyword from all callers of __dataframe__
now, and then from the signature a year later.
Is there a reason to not do this? Removing it sooner seems to not have any upsides, only more disruption. I can remove the word "deprecated" from the PR if it bothers you?
This needs a FutureWarning
This is not a good idea. The end user can do nothing about this. This is a between-dataframes communication mechanism, so emitting user-facing warnings has no purpose.
It's also not a behavior change but a removal of the keyword without any observable changes to how __dataframe__
behavior, so if a warning were needed it should be a deprecation warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do API breaking changes in minor versions, so we can remove it in 3.0 or 4.0 (I don't have a preference), but not in a minor version in between (the date and lack of a warning bothers me, deprecated itself is fine).
You can start with a DeprecationWarning
, that is fine, but it needs a FutureWarning
at some point before the keyword is actually removed.
It is part of our API docs (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.__dataframe__.html), so there is no guarantee that it is only used by library developers and we have to consider this a public method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey all
I've looked at this a bit more - it's true that technically it may be meant to only be used between dataframe libraries in from_dataframe
, but:
- as Patrick noted, it's part of the pandas API docs (and was added there as part of the PR which introduced the interchange protocol)
- there's a blog post which uses
DataFrame.__dataframe__
directy: https://www.thomasjpfan.com/2023/05/accessing-data-from-pythons-dataframe-interchange-protocol/
So, I'd suggest treating this as any other public-facing pandas method, and going through the usual deprecation process
In any case, I've had a look at https://github.com/search?q=__dataframe__%28&type=code&p=1, and it doesn't look like there's many users of __dataframe__
anyway, and nobody seems to be passing anything non-default values to nan_as_null
anyway. If we want to remove it, I'd suggest doing it earlier rather than later, before anyone gets the idea to try using it
This is not a good idea.
generic communication comment - could we not make such statements about maintainers' reviews please? thanks 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the too brief and too certain statement, I should have written "I don't think it is a good idea to emit ...".
I will note that the whole premise of the upstream change and the proposal by @jorisvandenbossche at data-apis/dataframe-api#125 was that it can be removed because no one is using has implemented support for it - and hence that that is quick/safe/silent. A warning now would be really disruptive - so if you insist that the change cannot be made without one, it'd be better to close this PR and revisit it once all other dataframe libraries have stopped passing in this keyword in say 6-9 months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Yeah I think that might be best to be honest - remove it from implementations and the official spec, and then do a proper deprecation cycle in pandas. There's a general desire in pandas to be stricter about the process around breaking changes now that there's a yearly release cycle, and I don't think that making an exception for __dataframe__
would be seen too well
it'd be better to close this PR
I don't think it needs closing - I think we can already note that the argument is deprecated and will be removed, and then, once the above has been completed, we follow the pandas deprecation process. So as this lives in pandas, then for better or for worse, that is what we're bound by
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl gentle ping - OK to get this docs change in now and then do the usual deprecation once other implementations have been updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we do deviate from our standard backwards compatibility / deprecation policies for experimental features. While we didn't explicitly label this one as such (unfortunately), I think we could decide to treat this method as experimental, though.
(we also already noted in our public docstring that this keyword has no effect)
To be clear, I also don't mind doing a longer deprecation cycle if we really want to, on our side. In the end it's just keeping some code that does nothing around for a bit longer, so it's not that this takes much effort to do. But I think we can also just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the actual diff of the PR now: I think this is indeed fine to merge as is (maybe just remove the explicit date when it will be removed, for now). All it does is update the documentation to be clearer that this keyword doesn't do anything and is deprecated, and clean up the internals to already remove it internally (which has no user visible impact).
pandas/core/frame.py
Outdated
Whether to tell the DataFrame to overwrite null values in the data | ||
with ``NaN`` (or ``NaT``). | ||
`nan_as_null` is DEPRECATED and has no effect. Please avoid using | ||
it; it will be removed in a future release (after Aug 2024). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey all
I've looked at this a bit more - it's true that technically it may be meant to only be used between dataframe libraries in from_dataframe
, but:
- as Patrick noted, it's part of the pandas API docs (and was added there as part of the PR which introduced the interchange protocol)
- there's a blog post which uses
DataFrame.__dataframe__
directy: https://www.thomasjpfan.com/2023/05/accessing-data-from-pythons-dataframe-interchange-protocol/
So, I'd suggest treating this as any other public-facing pandas method, and going through the usual deprecation process
In any case, I've had a look at https://github.com/search?q=__dataframe__%28&type=code&p=1, and it doesn't look like there's many users of __dataframe__
anyway, and nobody seems to be passing anything non-default values to nan_as_null
anyway. If we want to remove it, I'd suggest doing it earlier rather than later, before anyone gets the idea to try using it
This is not a good idea.
generic communication comment - could we not make such statements about maintainers' reviews please? thanks 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for the discussion - going back to the 'approval' state then (conditional on addressing @lithomas1 's comment)
@phofl you OK with merging this as-is, with the understanding that formal deprecation cycle will be done later before the argument is actually removed?
Regarding the date mentioned - to be honest I'm OK with it, it just says it'll happen at some point after that date, not that it'll happen on that date
pandas/core/frame.py
Outdated
Whether to tell the DataFrame to overwrite null values in the data | ||
with ``NaN`` (or ``NaT``). | ||
`nan_as_null` is DEPRECATED and has no effect. Please avoid using | ||
it; it will be removed in a future release (after Aug 2024). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Yeah I think that might be best to be honest - remove it from implementations and the official spec, and then do a proper deprecation cycle in pandas. There's a general desire in pandas to be stricter about the process around breaking changes now that there's a yearly release cycle, and I don't think that making an exception for __dataframe__
would be seen too well
it'd be better to close this PR
I don't think it needs closing - I think we can already note that the argument is deprecated and will be removed, and then, once the above has been completed, we follow the pandas deprecation process. So as this lives in pandas, then for better or for worse, that is what we're bound by
Co-authored-by: Joris Van den Bossche <[email protected]>
@phofl gentle ping (your review is blocking the pr) let's get the text in, and if/when we start a formal deprecation cycle we can add the |
Fine by me if we deprecated properly in the future (I don't care when we start the process) |
Thanks - let's get this in then if/when we start the deprecation process we can add adding either now might result in something accidentally enforcing the deprecation in 3.0 |
No one was using it yet and it seemed easier to clean it up from libraries than to ask everyone to add support for it - see data-apis/dataframe-api#125.