Skip to content

DOC: Adding new columns in a DataFrame of Nullable type might be better explained #49201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
killerrex opened this issue Oct 20, 2022 · 6 comments · Fixed by #58163
Closed
1 task done

DOC: Adding new columns in a DataFrame of Nullable type might be better explained #49201

killerrex opened this issue Oct 20, 2022 · 6 comments · Fixed by #58163
Assignees
Labels
Docs good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@killerrex
Copy link

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Documentation problem

When adding a new column to an exisitng dataframe using pd.NA is tricky.

>>> df = pd.DataFrame({'A': [1, 2]})
>>> df.dtypes
A    int64
dtype: object
>>> df
   A
0  1
1  2

Adding a new column of type datetime64[ns] works as expected using pd.NaT:

>>> df['T'] = pd.NaT
>>> df.dtypes
A             int64
T    datetime64[ns]
dtype: object
>>> df
   A   T
0  1 NaT
1  2 NaT

However, adding a column using pd.NA produces a column of type object:

>>> df['N'] = pd.NA
>>> df.dtypes
A             int64
T    datetime64[ns]
N            object
dtype: object
>>> df
   A   T     N
0  1 NaT  <NA>
1  2 NaT  <NA>

That is exactly what it is asked for but not what I was naively expecting (a column of type Int64)
This is logical as the pd.NA type is used also for string and boolean types and maybe others.

The best way I found to create a new, empty column of a desired nullable type is the following:

>>> df['N'] = pd.Series(dtype='Int64')
>>> df['S'] = pd.Series(dtype='string')
>>> df['B'] = pd.Series(dtype='boolean')
>>> df.dtypes
A             int64
T    datetime64[ns]
N             Int64
S            string
B           boolean
dtype: object
>>> df
   A   T     N     S     B
0  1 NaT  <NA>  <NA>  <NA>
1  2 NaT  <NA>  <NA>  <NA>

Suggested fix for documentation

I think it will be enough to mention this point as an additional entry in the page from nullable datatypes, to show that creating a column initialised with pd.NA is not different to creating a column with any other, non numeric python object.

@killerrex killerrex added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2022
@rhshadrach rhshadrach added good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 16, 2024
@rhshadrach
Copy link
Member

Thanks for the report, I think this makes sense to add.

@venkata-pavani
Copy link

I am looking into this issue

@codeoxygen
Copy link

It makes sense. It should be clearly mentioned in the tech-doc. I am looking in to this issue

@giormala
Copy link
Contributor

giormala commented Apr 4, 2024

take

@giormala
Copy link
Contributor

Hi, I have created a pull request with changes requested. Could I have some help reviewing it? thank you!

@giormala
Copy link
Contributor

giormala commented Jun 5, 2024

Should I change something else for my PR to be merged? Thanks in advance!

@rhshadrach rhshadrach added this to the 3.0 milestone Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment