Skip to content

DataFrame._init_dict handles columns with nan incorrectly if columns passed separately #16894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kernc opened this issue Jul 12, 2017 · 12 comments · Fixed by #31171
Closed

DataFrame._init_dict handles columns with nan incorrectly if columns passed separately #16894

kernc opened this issue Jul 12, 2017 · 12 comments · Fixed by #31171
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@kernc
Copy link
Contributor

kernc commented Jul 12, 2017

Code Sample, a copy-pastable example if possible

>>> df = pd.DataFrame({np.nan: [1, 2]})
>>> df[np.nan]   # Arguably expectedly, nan matches nan
0    1
1    2
Name: nan, dtype: int64

>>> df = pd.DataFrame({np.nan: [1, 2], 2: [2, 3]}, columns=[np.nan, 2])
>>> df   # nan from dict didn't match nan from ensured Float64Index
  NaN    2.0
0  NaN     2
1  NaN     3

Problem description

When DataFrame is initialized from dict, if columns are passed, nan isn't recognized and retrieved from dict correctly. The problem is in loops like:

columns = _ensure_index(columns)  # Float64Index
for c in columns:  # c = np.float64(np.nan)  (is not np.nan)
    if c in data_dict:  # c is not in dict
        ....

If columns aren't passed separately, initialization works as expected.

>>> pd.DataFrame({np.nan: [1, 2], 2: [2, 3]})
   NaN    2.0
0     1     2
1     2     3

Consistentcy would be nice.

Expected Output

>>> df = pd.DataFrame({np.nan: [1, 2], 2: [2, 3]}, columns=[np.nan, 2])
>>> df   # nan from dict matches nan from Float64Index
  NaN    2.0
0  1     2
1  2     3

Output of pd.show_versions()

pandas 0.21.0.dev+225.gb55b1a2fe
@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Bug labels Jul 14, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

You can actually replicate this using dict !

>>> import numpy as np
>>> s = {np.nan : 2}
>>> np.nan in s
True
>>>
>>> s = {np.float64(np.nan) : 2}
>>> np.nan in s
False

Honestly, I blame numpy for this 😄, as this is annoying to patch. On the one hand, ensuring Index makes sense, but then you can't check np.nan anymore.

The obvious candidate workaround is to write columns=[np.float64(np.nan), 2]. Perhaps what we could do is call ensure_index (or some form of it) on the columns to perform the same casting? Not really sure at this point, but let us know if the workaround works for you at least.

@kernc
Copy link
Contributor Author

kernc commented Jul 14, 2017

np.float64(np.nan) wouldn't work, because it's an instance of NaN, and nans don't equate to one another. dict probably uses (or falls back to) operator-is equality, which a singleton like np.nan (pd.NaT, ...) responds positive to. Compare:

>>> np.nan in {np.nan}
True

>>> np.float64(np.nan) in {np.float64(np.nan)}
False

>>> float('nan') in {float('nan')}
False

>>> np.nan is np.nan
True

>>> np.float64(np.nan) is np.float64(np.nan)
False

This bug came up in #16883 (comment). But there are legitimate cases for a (catch-all) nan in index (e.g. #3729).

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@kernc : Hmmm...I suspect we have the support for this indexing somewhere in the code-base, as this code works below:

>>> df = DataFrame({np.float64(np.nan): [1, 2]})
>>> df[np.nan]
2
>>> type(df.columns[0])
<class 'numpy.float64'>

I can't search the code-base from my phone, but I suspect if you walk through the __getitem__ logic for DataFrame, you can find the place where we reconcile the np.nan-handling, which we can then apply to initialization in your example.

@kernc
Copy link
Contributor Author

kernc commented Jul 14, 2017

Yes, Index and subclasses support nan. Initialization from dict doesn't.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Correct, but I'm saying that the logic for __getitem__ (not the link that you are pointing to, which is for something else) should be located, as that would tell you how the bug can be patched I imagine.

@kernc
Copy link
Contributor Author

kernc commented Jul 14, 2017

Not unexpectedly, Indexes use isnull() or isnan() checks internally when constructing their indexers.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Okay, might it be possible to use similar logic when constructing from dict ? I'm trying to see (without being able to see the code-base ATM) whether whatever logic being used to handle np.nan properly for indexing can be used internally when constructing a DataFrame from dict.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

in DataFrame._init_dict this passes thru Index, e.g.

In [1]: Index([float('nan')])
Out[1]: Float64Index([nan], dtype='float64')

In [2]: Index([np.nan])
Out[2]: Float64Index([nan], dtype='float64')

so must be missing this step somewhere.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

so must be missing this step somewhere.

@jreback : confused by what you were just demonstrating.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

@jreback : confused by what you were just demonstrating.

This problem is already solved for several cases (but not the one that illustrated above) and SDF. So the existing solutions should just propagate.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Ah, gotcha, you were referring to what I was saying above

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mroeschke
Copy link
Member

Looks to work on master. Could use a test:

In [51]: >>> df = pd.DataFrame({np.nan: [1, 2], 2: [2, 3]}, columns=[np.nan, 2])
    ...: >>> df   # nan from dict matches nan from Float64Index
Out[51]:
   NaN  2.0
0    1    2
1    2    3

In [52]: pd.__version__
Out[52]: '0.26.0.dev0+593.g9d45934af'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Constructors Series/DataFrame/Index/pd.array Constructors Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 21, 2019
@jreback jreback added this to the 1.1 milestone Jan 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants