assert_frame_equal not differentiating NaN and None within object dtype #18463

WillAyd · 2017-11-24T03:01:52Z

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame([['foo', 'bar', 'baz'], [None, None, None]])
df2 = pd.DataFrame([['foo', 'bar', 'baz'], [np.nan, np.nan, np.nan]])
tm.assert_frame_equal(df1, df2)

Problem description

No AssertionError gets raised in this case, even though I would not expect None and np.nan to be considered equal values (I found this while testing #18450)

Note that if you simply compared a DataFrame with only np.nan with another containing only None you would get an AttributeError that the dtypes are different (float64 vs object) but because the missing values are mixed into object dtypes here I think that differentiation gets lost

Expected Output

AssertionError: DataFrame.iloc[1, :] are different

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: d64995a
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0.dev0+205.gd64995a4f
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-11-24T09:31:38Z

Generally None and np.nan are considered equivalent in pandas, both are seen as missing value. For other types of dtypes the None will on construction be converted to NaN, so you don't have this issue in such a case:

In [19]: pd.Series([1, 2, None])
Out[19]: 
0    1.0
1    2.0
2    NaN
dtype: float64

and for example the equals method also sees both frames in your example as equal:

In [18]: df1.equals(df2)
Out[18]: True

That said, to be able to test we are actually retuning NaN and not None in the issue you referenced, I agree it is annoying that you don't have a way to discriminate between both in assert_frame_equal.

For the PR as a workaround now, you can get the values of that column and assert that those are equal (eg np.testing.assert_array_equal(df1[0].values, df2[0].values) differentiates between None and np.nan)

jreback · 2017-11-24T15:59:58Z

I would add a kwarg to assert_*, maybe check_null='strict' or something. we almost always want 'strict' meaning not only is isnull(a) == isnull(b), but that np.isnan(a) & np.isnan(b).

I don't think working around this is a good idea, let's actually fix the testing.

WillAyd · 2017-11-24T19:36:04Z

@jreback do you know why assert_frame_equal doesn't use the NDFrame.equals method? There's a comment preceding that method that mentions it should.

The reason I'm asking is because while we could add a kwarg to the assert_frame_equal method to control the null checking behavior and do it in a strict manner by default, as @jorisvandenbossche pointed out that would be inconsistent with how the equals method of the NDFrame works. I would think we would either want to make that change in both comparison methods, or attempt to refactor assert_frame_equal to match NDFrame as the comment suggests

FWIW if we wanted to modify the behavior of the NDFrame.equals we'd probably want to add strict_nan=True to the following line:

pandas/pandas/core/internals.py

Line 1532 in aec3347

return array_equivalent(self.values, other.values)

The array_equivalent method might also need some tweaks, as providing that argument is raising a ValueError for me, but nonetheless it seems like the intention is already in place there to handle this

jreback · 2017-11-24T20:12:23Z

assert_frame_equal is NOT the same, it gives excruciating detail on where the error is (this is the entire point of this). .equals is just a user-facing shortcut. Under the hood these both use array_equivalent as the comparisons. I don't see a problem with adding things to the test methods, these are meant to be very fine grained.

so for example, .equals on the example (which compares None/np.nan) should fail. These are not exactly equal. We should not add things to .equals (as this opens a big can of worms).

jreback · 2017-11-24T20:14:08Z

from above, this is actually wrong and we should strictly check this.

In [18]: df1.equals(df2)
Out[18]: True

TomAugspurger · 2020-10-09T15:50:36Z

I think this also affects pd.NA.

In [45]: a = pd.Series([1, np.nan], dtype=object)

In [46]: b = pd.Series([1, pd.NA], dtype=object)

In [47]: pd.testing.assert_series_equal(a, b)

A keyword to control this would be great. Based on the recent experience with check_freq, we should perhaps not make things strict by default. But we could do non-strict + a warning by default, with the option to override in the function or via a global option.

def assert_series_equal(..., check_na=None):
    if check_na is None:
        check_na = pandas.config.get("testing.check_na")
    if check_na is None;
        warnings.warn("default to false, changing to strict in the future")
        check_na = False

jbrockmendel · 2023-03-16T19:38:04Z

In _libs.testing.assert_almost_equal we have

    if checknull(a) and checknull(b):
        # TODO: Should require same-dtype NA?
        # nan / None comparison
        return True

Changing that to

if checknull(a):
    # nan / None comparison
    if is_matching_na(a, b):
        return True
    raise AssertionError(f"{a} != {b}")
elif checknull(b):
    raise AssertionError(f"{a} != {b}")

breaks 431 tests. 224 of those are in tests/strings/, 76 in tests/arithmetic/

is_matching_na has a nan_matches_none keyword we could use to be less strict specifically in the None vs np.nan case while still catching e.g. np.timedelta64("nat") vs pd.NaT (relevant in the tests/arithmetic/ cases). Passing that gets us down to 308 failing tests.

I definitely think we should become stricter here. As Tom mentioned when we made check_freq stricter, doing this probably needs a deprecation cycle.

WillAyd changed the title ~~assert_frame_equal not differentiating NaN and None with object dtype~~ assert_frame_equal not differentiating NaN and None within object dtype Nov 24, 2017

jorisvandenbossche added the Testing pandas testing functions or related to the test suite label Nov 24, 2017

WillAyd mentioned this issue Jul 26, 2019

assert_series_equal doesn't distinguish between different NA values #27613

Closed

mroeschke added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels May 22, 2020

TomAugspurger mentioned this issue Oct 9, 2020

Fix test_unstack #36987

Merged

5 tasks

ghost mentioned this issue Aug 11, 2021

BUG GH#40498 Fillna other missing values not modified #42981

Closed

4 tasks

jbrockmendel mentioned this issue Mar 20, 2023

DEPR: be stricter in assert_almost_equal #52081

Merged

5 tasks

yifanwu-kolena mentioned this issue Jan 23, 2024

chore: preserve None in the system kolenaIO/kolena#472

Merged

3 tasks

This was referenced Mar 29, 2024

TST: compat with pandas 3.0 for strict null check in assert_frame_equal geopandas/geopandas#3234

Merged

Add strict_na keyword to the assert_.._equal methods for object dtype to help with deprecation #58072

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assert_frame_equal not differentiating NaN and None within object dtype #18463

assert_frame_equal not differentiating NaN and None within object dtype #18463

WillAyd commented Nov 24, 2017 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

jorisvandenbossche commented Nov 24, 2017

jreback commented Nov 24, 2017

WillAyd commented Nov 24, 2017

jreback commented Nov 24, 2017 •

edited

Loading

jreback commented Nov 24, 2017

TomAugspurger commented Oct 9, 2020

jbrockmendel commented Mar 16, 2023

assert_frame_equal not differentiating NaN and None within object dtype #18463

assert_frame_equal not differentiating NaN and None within object dtype #18463

Comments

WillAyd commented Nov 24, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

jorisvandenbossche commented Nov 24, 2017

jreback commented Nov 24, 2017

WillAyd commented Nov 24, 2017

jreback commented Nov 24, 2017 • edited Loading

jreback commented Nov 24, 2017

TomAugspurger commented Oct 9, 2020

jbrockmendel commented Mar 16, 2023

WillAyd commented Nov 24, 2017 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

jreback commented Nov 24, 2017 •

edited

Loading