Skip to content

DOC: clarify pitfalls of NA vs NaN in nullable floats #51383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

a-reich
Copy link

@a-reich a-reich commented Feb 14, 2023

Following up on my comment: since the current behavior easily leads to mistakes when trying to detect missing/NaN data and does not match the docs' described semantics, this PR is trying to update the docs to clarify this edge case so users can avoid the pitfalls.

Notes/questions on PR work:

  • There may be other places to update, but I figure once the language is approved it's easy to add it elsewhere
  • Should we link to the issue API: distinguish NA vs NaN in floating dtypes #32265 discussing what the behavior should be? That would enable users to weigh in with their preferences but might be too detailed/confusing.
  • Should the docstring describe this scenario in terms of Arrays or the corresponding dtypes?

Thanks!

@a-reich
Copy link
Author

a-reich commented Feb 14, 2023

Asking @jorisvandenbossche if interested in reviewing, as suggested

@mroeschke mroeschke added Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Feb 14, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-reich thanks for taking a look at this trying to clarify the docstring!
Personally, I would limit the addition more to the "facts", i.e. simply that for nullable floating dtype, only pd.NA is recognized.

@a-reich
Copy link
Author

a-reich commented Feb 22, 2023

Hi @jorisvandenbossche, I've tried to slim down the language to address your feedback - thanks.

However, I did realize a slight issue - I wanted to give guidance on using both Float<> and float<>[pyarrow] dtypes, since pandas is starting to support both as "nullable dtype backends". But while users can detect NaN with np.isnan for the former, it seems like there isn't a documented way for the latter (pyarrow has pc.is_nan/is_null but getting the underlying arrow array seems to require private attributes). Wanted to bring that up in case it makes sense to have pandas offer something more there. For this PR, I guess we can just only cover the Float case?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 25, 2023
@a-reich
Copy link
Author

a-reich commented Apr 1, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

I am still interested in completing this! I think it needs a little bit of pandas maintainer feedback - at least, to clarify whether there’s a supported way to detect NaN values in an arrow-dtype array.

@mroeschke
Copy link
Member

I think the topic of #32265 will be discussed and decided on soon which should address this so closing for now

@mroeschke mroeschke closed this Aug 1, 2023
Comment on lines +7879 to +7880
stored but will **not** get mapped to True (these values can be detected
by :func:`numpy.isnan`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s.map(pd.isna) or df.map(pd.isna) is a better suggestion?

@avm19
Copy link
Contributor

avm19 commented Jun 6, 2024

I think the topic of #32265 will be discussed and decided on soon which should address this so closing for now

@mroeschke 10 months later it is still being decided. How about re-opening this PR?

I am under impression that some sort of consensus has started to crystallise around options 2, 3, or 4 with the inclination towards the current behaviour (s.isna() returns False for NaN).

I propose to also add a clarification to "User Guide". Nothing in the documentation says explicitly that NaN is no longer treated as missing in nullable float arrays. On the contrary, examples with df.fillna(0), df.ffill() and df.bfill() clearly suggest that NaN is equivalent to NA in float64[pyarrow].

Placing the link to the aforementioned discussion in documentation seems like a good idea to me as a user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants