Skip to content

DEPR: remove use of nan_as_null from callers of __dataframe__ #54846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 17, 2023

Conversation

rgommers
Copy link
Contributor

No one was using it yet and it seemed easier to clean it up from libraries than to ask everyone to add support for it - see data-apis/dataframe-api#125.

No one was using it yet and it seemed easier to clean it up
from libraries than to ask everyone to add support for it - see
data-apis/dataframe-api#125.
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 29, 2023

pre-commit.ci autofix

@MarcoGorelli MarcoGorelli added the Interchange Dataframe Interchange Protocol label Aug 29, 2023
@MarcoGorelli MarcoGorelli added this to the 2.2 milestone Aug 29, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @rgommers

EDIT: please disregard my approval, I'll come back to this next week - sorry for having been too fast on this one

Whether to tell the DataFrame to overwrite null values in the data
with ``NaN`` (or ``NaT``).
`nan_as_null` is DEPRECATED and has no effect. Please avoid using
it; it will be removed in a future release (after Aug 2024).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it; it will be removed in a future release (after Aug 2024).
it; it will be removed in a future release (after Aug 2024).
.. deprecated:: 2.2.0

Whether to tell the DataFrame to overwrite null values in the data
with ``NaN`` (or ``NaT``).
`nan_as_null` is DEPRECATED and has no effect. Please avoid using
it; it will be removed in a future release (after Aug 2024).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date is wrong, this will be removed in 3.0 which will happen before that.

Also: This needs a FutureWarning if a value is passed by the user and we need tests for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here was to keep it around for a year in all dataframe libraries, to avoid unnecessary breakage. This works best if we remove usage of the keyword from all callers of __dataframe__ now, and then from the signature a year later.

Is there a reason to not do this? Removing it sooner seems to not have any upsides, only more disruption. I can remove the word "deprecated" from the PR if it bothers you?

This needs a FutureWarning

This is not a good idea. The end user can do nothing about this. This is a between-dataframes communication mechanism, so emitting user-facing warnings has no purpose.

It's also not a behavior change but a removal of the keyword without any observable changes to how __dataframe__ behavior, so if a warning were needed it should be a deprecation warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do API breaking changes in minor versions, so we can remove it in 3.0 or 4.0 (I don't have a preference), but not in a minor version in between (the date and lack of a warning bothers me, deprecated itself is fine).

You can start with a DeprecationWarning, that is fine, but it needs a FutureWarning at some point before the keyword is actually removed.

It is part of our API docs (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.__dataframe__.html), so there is no guarantee that it is only used by library developers and we have to consider this a public method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey all

I've looked at this a bit more - it's true that technically it may be meant to only be used between dataframe libraries in from_dataframe, but:

So, I'd suggest treating this as any other public-facing pandas method, and going through the usual deprecation process

In any case, I've had a look at https://github.com/search?q=__dataframe__%28&type=code&p=1, and it doesn't look like there's many users of __dataframe__ anyway, and nobody seems to be passing anything non-default values to nan_as_null anyway. If we want to remove it, I'd suggest doing it earlier rather than later, before anyone gets the idea to try using it

This is not a good idea.

generic communication comment - could we not make such statements about maintainers' reviews please? thanks 🙏

Copy link
Contributor Author

@rgommers rgommers Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the too brief and too certain statement, I should have written "I don't think it is a good idea to emit ...".

I will note that the whole premise of the upstream change and the proposal by @jorisvandenbossche at data-apis/dataframe-api#125 was that it can be removed because no one is using has implemented support for it - and hence that that is quick/safe/silent. A warning now would be really disruptive - so if you insist that the change cannot be made without one, it'd be better to close this PR and revisit it once all other dataframe libraries have stopped passing in this keyword in say 6-9 months.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Yeah I think that might be best to be honest - remove it from implementations and the official spec, and then do a proper deprecation cycle in pandas. There's a general desire in pandas to be stricter about the process around breaking changes now that there's a yearly release cycle, and I don't think that making an exception for __dataframe__ would be seen too well

it'd be better to close this PR

I don't think it needs closing - I think we can already note that the argument is deprecated and will be removed, and then, once the above has been completed, we follow the pandas deprecation process. So as this lives in pandas, then for better or for worse, that is what we're bound by

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl gentle ping - OK to get this docs change in now and then do the usual deprecation once other implementations have been updated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we do deviate from our standard backwards compatibility / deprecation policies for experimental features. While we didn't explicitly label this one as such (unfortunately), I think we could decide to treat this method as experimental, though.

(we also already noted in our public docstring that this keyword has no effect)

To be clear, I also don't mind doing a longer deprecation cycle if we really want to, on our side. In the end it's just keeping some code that does nothing around for a bit longer, so it's not that this takes much effort to do. But I think we can also just remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the actual diff of the PR now: I think this is indeed fine to merge as is (maybe just remove the explicit date when it will be removed, for now). All it does is update the documentation to be clearer that this keyword doesn't do anything and is deprecated, and clean up the internals to already remove it internally (which has no user visible impact).

Whether to tell the DataFrame to overwrite null values in the data
with ``NaN`` (or ``NaT``).
`nan_as_null` is DEPRECATED and has no effect. Please avoid using
it; it will be removed in a future release (after Aug 2024).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey all

I've looked at this a bit more - it's true that technically it may be meant to only be used between dataframe libraries in from_dataframe, but:

So, I'd suggest treating this as any other public-facing pandas method, and going through the usual deprecation process

In any case, I've had a look at https://github.com/search?q=__dataframe__%28&type=code&p=1, and it doesn't look like there's many users of __dataframe__ anyway, and nobody seems to be passing anything non-default values to nan_as_null anyway. If we want to remove it, I'd suggest doing it earlier rather than later, before anyone gets the idea to try using it

This is not a good idea.

generic communication comment - could we not make such statements about maintainers' reviews please? thanks 🙏

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for the discussion - going back to the 'approval' state then (conditional on addressing @lithomas1 's comment)

@phofl you OK with merging this as-is, with the understanding that formal deprecation cycle will be done later before the argument is actually removed?

Regarding the date mentioned - to be honest I'm OK with it, it just says it'll happen at some point after that date, not that it'll happen on that date

Whether to tell the DataFrame to overwrite null values in the data
with ``NaN`` (or ``NaT``).
`nan_as_null` is DEPRECATED and has no effect. Please avoid using
it; it will be removed in a future release (after Aug 2024).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Yeah I think that might be best to be honest - remove it from implementations and the official spec, and then do a proper deprecation cycle in pandas. There's a general desire in pandas to be stricter about the process around breaking changes now that there's a yearly release cycle, and I don't think that making an exception for __dataframe__ would be seen too well

it'd be better to close this PR

I don't think it needs closing - I think we can already note that the argument is deprecated and will be removed, and then, once the above has been completed, we follow the pandas deprecation process. So as this lives in pandas, then for better or for worse, that is what we're bound by

Co-authored-by: Joris Van den Bossche <[email protected]>
@MarcoGorelli
Copy link
Member

@phofl gentle ping (your review is blocking the pr)

let's get the text in, and if/when we start a formal deprecation cycle we can add the .. deprecated:: directive and issue a warning

@phofl
Copy link
Member

phofl commented Nov 17, 2023

Fine by me if we deprecated properly in the future (I don't care when we start the process)

@MarcoGorelli
Copy link
Member

Thanks - let's get this in then

if/when we start the deprecation process we can add .. deprecated:: and raise a warning, but for now noting that it's deprecated in the spec and has no effect seems like a net positive

adding either now might result in something accidentally enforcing the deprecation in 3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interchange Dataframe Interchange Protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants