Skip to content

DOC: Add notes to nullable types documentation about pd.NA column type #57050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

FilipRazek
Copy link

@FilipRazek FilipRazek commented Jan 24, 2024

@datapythonista datapythonista changed the title doc: Add notes to nullable types documentation about pd.NA column type DOC: Add notes to nullable types documentation about pd.NA column type Jan 31, 2024
@datapythonista
Copy link
Member

Thanks for the contribution @FilipRazek. Changes look reasonable given the discussion in the issue, but I'm not so sure what's proposed in the issue makes sense.

@killerrex do you mind expanding on why is it useful to create a column with all missing values?

I'm personally unsure about how useful this will be to users, seems to add more noise and confusion than value in my opinion.

CC: @rhshadrach

@rhshadrach
Copy link
Member

rhshadrach commented Feb 1, 2024

do you mind expanding on why is it useful to create a column with all missing values?

I've created columns by starting with all NA values, filling in certain indices from one or more sources, and then used ffill and/or bfill.

Whether or not it may be particularly useful, it can be done, and I think it can be surprising that the dtype results in object.

seems to add more noise and confusion

Noise I can see, but I'm not sure what is meant by confusion here.

But no strong opinion here if there is any opposition.

@datapythonista
Copy link
Member

By confusion I meant that users may wonder why they'd like to create a column with just NA. I guess the addition will be useful and clear to people who tried to do it before, but may leave more questions than answers for someone who didn't have the need.

What I think it makes more sense is to show the full use case here. Instead of saying when creating a column with NA values, we could show the whole example when creating a new column starting with NA values and the filling the present values, the type would be set to object in the df['new_col'] = pd.NA and performance on this column will be worse than with the appropriate type. It's better to use df['new_col'] = pd.Series(pd.NA, dtype=Int64). Surely it can be phrased better, but I think this shows what I'd do.

Maybe I'm just overcomplicating things, feel free to go ahead with this if you think it's better. But I think in particular for beginners it'd be better to show a more real example.

@killerrex
Copy link

@killerrex do you mind expanding on why is it useful to create a column with all missing values?

I used it when I fill a column progressively, like:

df: pd.DataFrame = load_the_big_data()

df['value'] = pd.Series(dtype='Int64')

if calc_x:
    mask = df['Col_A'] == 'x'   # <== Example row selection, may be really complex
    df.loc[mask, 'value'] = complex_function(df.loc[mask, :])

if calc_y:
    mask = df['Col_B'] == 'y'
    df.loc[mask, 'value'] = other_complex_function(df.loc[mask, :])

# etc etc

It is specially useful if later I need to plot that column (as the NA are not plotted) but also many times the fact that a column is not calculated is also meaningful.

Sometimes you can do the same filling the column with a canary value, but specially when the function is part of a library used in other projects it is not easy or possible to find an appropriate value.

The other main use I found is when I need to use a library that expect certain columns in the input dataframe and for the case I am using the concept of that column is not applicable (for example in satellite data analysis, for the same constellation not all the satellites have the same instruments)
Initialising that sparse columns to NA guarantees that if the data is used unadvertily, the outputs are going to be NA and not some wrong value.

@rhshadrach
Copy link
Member

Thanks @datapythonista. My issue with your wording, albeit quite a minor one:

when creating a new column starting with NA values and the filling the present values

is that this still holds even if you don't fill in the values. As a result, I find this a bit misleading. In more generality, we don't really know why users might create a column of all NA values - I gave a use case I've come across but there might be others.

What do you think of wording like the following?

If you desire to create a column of all NA values...

@FilipRazek
Copy link
Author

How about this;
If you decide to create a column of NA values (for example to fill them later), the type would be set to object in the new column. The performance on this column will be worse than with the appropriate type. It's better to use df['new_col'] = pd.Series(pd.NA, dtype=Int64) than df['new_col'] = pd.NA
Would this correspond to your needs?

@rhshadrach
Copy link
Member

I think that looks good @FilipRazek. Some minor tweaks

If you create a column of NA values (for example to fill them later) with df['new_col'] = pd.NA, the dtype would be set to object in the new column. The performance on this column will be worse than with the appropriate type. It's better to use df['new_col'] = pd.Series(pd.NA, dtype=Int64) (or another dtype as you desire) .

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 25, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Adding new columns in a DataFrame of Nullable type might be better explained
5 participants