DOC: Add notes to nullable types documentation about pd.NA column type #57050

FilipRazek · 2024-01-24T11:13:32Z

closes DOC: Adding new columns in a DataFrame of Nullable type might be better explained #49201
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

datapythonista · 2024-01-31T05:52:08Z

Thanks for the contribution @FilipRazek. Changes look reasonable given the discussion in the issue, but I'm not so sure what's proposed in the issue makes sense.

@killerrex do you mind expanding on why is it useful to create a column with all missing values?

I'm personally unsure about how useful this will be to users, seems to add more noise and confusion than value in my opinion.

CC: @rhshadrach

rhshadrach · 2024-02-01T03:29:56Z

do you mind expanding on why is it useful to create a column with all missing values?

I've created columns by starting with all NA values, filling in certain indices from one or more sources, and then used ffill and/or bfill.

Whether or not it may be particularly useful, it can be done, and I think it can be surprising that the dtype results in object.

seems to add more noise and confusion

Noise I can see, but I'm not sure what is meant by confusion here.

But no strong opinion here if there is any opposition.

datapythonista · 2024-02-01T04:39:50Z

By confusion I meant that users may wonder why they'd like to create a column with just NA. I guess the addition will be useful and clear to people who tried to do it before, but may leave more questions than answers for someone who didn't have the need.

What I think it makes more sense is to show the full use case here. Instead of saying when creating a column with NA values, we could show the whole example when creating a new column starting with NA values and the filling the present values, the type would be set to object in the df['new_col'] = pd.NA and performance on this column will be worse than with the appropriate type. It's better to use df['new_col'] = pd.Series(pd.NA, dtype=Int64). Surely it can be phrased better, but I think this shows what I'd do.

Maybe I'm just overcomplicating things, feel free to go ahead with this if you think it's better. But I think in particular for beginners it'd be better to show a more real example.

killerrex · 2024-02-01T12:20:54Z

@killerrex do you mind expanding on why is it useful to create a column with all missing values?

I used it when I fill a column progressively, like:

df: pd.DataFrame = load_the_big_data()

df['value'] = pd.Series(dtype='Int64')

if calc_x:
    mask = df['Col_A'] == 'x'   # <== Example row selection, may be really complex
    df.loc[mask, 'value'] = complex_function(df.loc[mask, :])

if calc_y:
    mask = df['Col_B'] == 'y'
    df.loc[mask, 'value'] = other_complex_function(df.loc[mask, :])

# etc etc

It is specially useful if later I need to plot that column (as the NA are not plotted) but also many times the fact that a column is not calculated is also meaningful.

Sometimes you can do the same filling the column with a canary value, but specially when the function is part of a library used in other projects it is not easy or possible to find an appropriate value.

The other main use I found is when I need to use a library that expect certain columns in the input dataframe and for the case I am using the concept of that column is not applicable (for example in satellite data analysis, for the same constellation not all the satellites have the same instruments)
Initialising that sparse columns to NA guarantees that if the data is used unadvertily, the outputs are going to be NA and not some wrong value.

rhshadrach · 2024-02-07T03:57:14Z

Thanks @datapythonista. My issue with your wording, albeit quite a minor one:

when creating a new column starting with NA values and the filling the present values

is that this still holds even if you don't fill in the values. As a result, I find this a bit misleading. In more generality, we don't really know why users might create a column of all NA values - I gave a use case I've come across but there might be others.

What do you think of wording like the following?

If you desire to create a column of all NA values...

FilipRazek · 2024-02-22T20:57:31Z

How about this;
If you decide to create a column of NA values (for example to fill them later), the type would be set to object in the new column. The performance on this column will be worse than with the appropriate type. It's better to use df['new_col'] = pd.Series(pd.NA, dtype=Int64) than df['new_col'] = pd.NA
Would this correspond to your needs?

rhshadrach · 2024-02-23T19:54:07Z

I think that looks good @FilipRazek. Some minor tweaks

If you create a column of NA values (for example to fill them later) with df['new_col'] = pd.NA, the dtype would be set to object in the new column. The performance on this column will be worse than with the appropriate type. It's better to use df['new_col'] = pd.Series(pd.NA, dtype=Int64) (or another dtype as you desire) .

github-actions · 2024-03-25T00:06:18Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-03-26T17:39:14Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

doc: Add notes to nullable types documentation about pd.NA column type

2ca9ae7

datapythonista changed the title ~~doc: Add notes to nullable types documentation about pd.NA column type~~ DOC: Add notes to nullable types documentation about pd.NA column type Jan 31, 2024

datapythonista added the Docs label Jan 31, 2024

mroeschke mentioned this pull request Feb 12, 2024

update intiger_na.rst document file #57374

Closed

5 tasks

github-actions bot added the Stale label Mar 25, 2024

mroeschke closed this Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Add notes to nullable types documentation about pd.NA column type #57050

DOC: Add notes to nullable types documentation about pd.NA column type #57050

FilipRazek commented Jan 24, 2024 •

edited

Loading

datapythonista commented Jan 31, 2024

rhshadrach commented Feb 1, 2024 •

edited

Loading

datapythonista commented Feb 1, 2024

killerrex commented Feb 1, 2024

rhshadrach commented Feb 7, 2024

FilipRazek commented Feb 22, 2024

rhshadrach commented Feb 23, 2024

github-actions bot commented Mar 25, 2024

mroeschke commented Mar 26, 2024

DOC: Add notes to nullable types documentation about pd.NA column type #57050

DOC: Add notes to nullable types documentation about pd.NA column type #57050

Conversation

FilipRazek commented Jan 24, 2024 • edited Loading

datapythonista commented Jan 31, 2024

rhshadrach commented Feb 1, 2024 • edited Loading

datapythonista commented Feb 1, 2024

killerrex commented Feb 1, 2024

rhshadrach commented Feb 7, 2024

FilipRazek commented Feb 22, 2024

rhshadrach commented Feb 23, 2024

github-actions bot commented Mar 25, 2024

mroeschke commented Mar 26, 2024

FilipRazek commented Jan 24, 2024 •

edited

Loading

rhshadrach commented Feb 1, 2024 •

edited

Loading