-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Add notes to nullable types documentation about pd.NA column type #57050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the contribution @FilipRazek. Changes look reasonable given the discussion in the issue, but I'm not so sure what's proposed in the issue makes sense. @killerrex do you mind expanding on why is it useful to create a column with all missing values? I'm personally unsure about how useful this will be to users, seems to add more noise and confusion than value in my opinion. CC: @rhshadrach |
I've created columns by starting with all NA values, filling in certain indices from one or more sources, and then used Whether or not it may be particularly useful, it can be done, and I think it can be surprising that the dtype results in object.
Noise I can see, but I'm not sure what is meant by confusion here. But no strong opinion here if there is any opposition. |
By confusion I meant that users may wonder why they'd like to create a column with just What I think it makes more sense is to show the full use case here. Instead of saying Maybe I'm just overcomplicating things, feel free to go ahead with this if you think it's better. But I think in particular for beginners it'd be better to show a more real example. |
I used it when I fill a column progressively, like: df: pd.DataFrame = load_the_big_data()
df['value'] = pd.Series(dtype='Int64')
if calc_x:
mask = df['Col_A'] == 'x' # <== Example row selection, may be really complex
df.loc[mask, 'value'] = complex_function(df.loc[mask, :])
if calc_y:
mask = df['Col_B'] == 'y'
df.loc[mask, 'value'] = other_complex_function(df.loc[mask, :])
# etc etc It is specially useful if later I need to plot that column (as the NA are not plotted) but also many times the fact that a column is not calculated is also meaningful. Sometimes you can do the same filling the column with a canary value, but specially when the function is part of a library used in other projects it is not easy or possible to find an appropriate value. The other main use I found is when I need to use a library that expect certain columns in the input dataframe and for the case I am using the concept of that column is not applicable (for example in satellite data analysis, for the same constellation not all the satellites have the same instruments) |
Thanks @datapythonista. My issue with your wording, albeit quite a minor one:
is that this still holds even if you don't fill in the values. As a result, I find this a bit misleading. In more generality, we don't really know why users might create a column of all NA values - I gave a use case I've come across but there might be others. What do you think of wording like the following?
|
How about this; |
I think that looks good @FilipRazek. Some minor tweaks
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.