Skip to content

DEPR: remove use of nan_as_null from callers of __dataframe__ #54846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 3 additions & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -893,8 +893,8 @@ def __dataframe__(
Parameters
----------
nan_as_null : bool, default False
Whether to tell the DataFrame to overwrite null values in the data
with ``NaN`` (or ``NaT``).
`nan_as_null` is DEPRECATED and has no effect. Please avoid using
it; it will be removed in a future release (after Aug 2024).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it; it will be removed in a future release (after Aug 2024).
it; it will be removed in a future release (after Aug 2024).
.. deprecated:: 2.2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date is wrong, this will be removed in 3.0 which will happen before that.

Also: This needs a FutureWarning if a value is passed by the user and we need tests for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here was to keep it around for a year in all dataframe libraries, to avoid unnecessary breakage. This works best if we remove usage of the keyword from all callers of __dataframe__ now, and then from the signature a year later.

Is there a reason to not do this? Removing it sooner seems to not have any upsides, only more disruption. I can remove the word "deprecated" from the PR if it bothers you?

This needs a FutureWarning

This is not a good idea. The end user can do nothing about this. This is a between-dataframes communication mechanism, so emitting user-facing warnings has no purpose.

It's also not a behavior change but a removal of the keyword without any observable changes to how __dataframe__ behavior, so if a warning were needed it should be a deprecation warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do API breaking changes in minor versions, so we can remove it in 3.0 or 4.0 (I don't have a preference), but not in a minor version in between (the date and lack of a warning bothers me, deprecated itself is fine).

You can start with a DeprecationWarning, that is fine, but it needs a FutureWarning at some point before the keyword is actually removed.

It is part of our API docs (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.__dataframe__.html), so there is no guarantee that it is only used by library developers and we have to consider this a public method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey all

I've looked at this a bit more - it's true that technically it may be meant to only be used between dataframe libraries in from_dataframe, but:

So, I'd suggest treating this as any other public-facing pandas method, and going through the usual deprecation process

In any case, I've had a look at https://github.com/search?q=__dataframe__%28&type=code&p=1, and it doesn't look like there's many users of __dataframe__ anyway, and nobody seems to be passing anything non-default values to nan_as_null anyway. If we want to remove it, I'd suggest doing it earlier rather than later, before anyone gets the idea to try using it

This is not a good idea.

generic communication comment - could we not make such statements about maintainers' reviews please? thanks 🙏

Copy link
Contributor Author

@rgommers rgommers Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the too brief and too certain statement, I should have written "I don't think it is a good idea to emit ...".

I will note that the whole premise of the upstream change and the proposal by @jorisvandenbossche at data-apis/dataframe-api#125 was that it can be removed because no one is using has implemented support for it - and hence that that is quick/safe/silent. A warning now would be really disruptive - so if you insist that the change cannot be made without one, it'd be better to close this PR and revisit it once all other dataframe libraries have stopped passing in this keyword in say 6-9 months.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Yeah I think that might be best to be honest - remove it from implementations and the official spec, and then do a proper deprecation cycle in pandas. There's a general desire in pandas to be stricter about the process around breaking changes now that there's a yearly release cycle, and I don't think that making an exception for __dataframe__ would be seen too well

it'd be better to close this PR

I don't think it needs closing - I think we can already note that the argument is deprecated and will be removed, and then, once the above has been completed, we follow the pandas deprecation process. So as this lives in pandas, then for better or for worse, that is what we're bound by

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl gentle ping - OK to get this docs change in now and then do the usual deprecation once other implementations have been updated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we do deviate from our standard backwards compatibility / deprecation policies for experimental features. While we didn't explicitly label this one as such (unfortunately), I think we could decide to treat this method as experimental, though.

(we also already noted in our public docstring that this keyword has no effect)

To be clear, I also don't mind doing a longer deprecation cycle if we really want to, on our side. In the end it's just keeping some code that does nothing around for a bit longer, so it's not that this takes much effort to do. But I think we can also just remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the actual diff of the PR now: I think this is indeed fine to merge as is (maybe just remove the explicit date when it will be removed, for now). All it does is update the documentation to be clearer that this keyword doesn't do anything and is deprecated, and clean up the internals to already remove it internally (which has no user visible impact).

allow_copy : bool, default True
Whether to allow memory copying when exporting. If set to False
it would cause non-zero-copy exports to fail.
Expand All @@ -909,9 +909,6 @@ def __dataframe__(
Details on the interchange protocol:
https://data-apis.org/dataframe-protocol/latest/index.html

`nan_as_null` currently has no effect; once support for nullable extension
dtypes is added, this value should be propagated to columns.

Examples
--------
>>> df_not_necessarily_pandas = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
Expand All @@ -931,7 +928,7 @@ def __dataframe__(

from pandas.core.interchange.dataframe import PandasDataFrameXchg

return PandasDataFrameXchg(self, nan_as_null, allow_copy)
return PandasDataFrameXchg(self, allow_copy=allow_copy)

def __dataframe_consortium_standard__(
self, *, api_version: str | None = None
Expand Down
22 changes: 7 additions & 15 deletions pandas/core/interchange/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,20 @@ class PandasDataFrameXchg(DataFrameXchg):
attributes defined on this class.
"""

def __init__(
self, df: DataFrame, nan_as_null: bool = False, allow_copy: bool = True
) -> None:
def __init__(self, df: DataFrame, allow_copy: bool = True) -> None:
"""
Constructor - an instance of this (private) class is returned from
`pd.DataFrame.__dataframe__`.
"""
self._df = df
# ``nan_as_null`` is a keyword intended for the consumer to tell the
# producer to overwrite null values in the data with ``NaN`` (or ``NaT``).
# This currently has no effect; once support for nullable extension
# dtypes is added, this value should be propagated to columns.
self._nan_as_null = nan_as_null
self._allow_copy = allow_copy

def __dataframe__(
self, nan_as_null: bool = False, allow_copy: bool = True
) -> PandasDataFrameXchg:
return PandasDataFrameXchg(self._df, nan_as_null, allow_copy)
# `nan_as_null` can be removed here once it's removed from
# Dataframe.__dataframe__
return PandasDataFrameXchg(self._df, allow_copy)

@property
def metadata(self) -> dict[str, Index]:
Expand Down Expand Up @@ -84,7 +79,7 @@ def select_columns(self, indices: Sequence[int]) -> PandasDataFrameXchg:
indices = list(indices)

return PandasDataFrameXchg(
self._df.iloc[:, indices], self._nan_as_null, self._allow_copy
self._df.iloc[:, indices], allow_copy=self._allow_copy
)

def select_columns_by_name(self, names: list[str]) -> PandasDataFrameXchg: # type: ignore[override] # noqa: E501
Expand All @@ -93,9 +88,7 @@ def select_columns_by_name(self, names: list[str]) -> PandasDataFrameXchg: # ty
if not isinstance(names, list):
names = list(names)

return PandasDataFrameXchg(
self._df.loc[:, names], self._nan_as_null, self._allow_copy
)
return PandasDataFrameXchg(self._df.loc[:, names], allow_copy=self._allow_copy)

def get_chunks(self, n_chunks: int | None = None) -> Iterable[PandasDataFrameXchg]:
"""
Expand All @@ -109,8 +102,7 @@ def get_chunks(self, n_chunks: int | None = None) -> Iterable[PandasDataFrameXch
for start in range(0, step * n_chunks, step):
yield PandasDataFrameXchg(
self._df.iloc[start : start + step, :],
self._nan_as_null,
self._allow_copy,
allow_copy=self._allow_copy,
)
else:
yield self