Skip to content

assert_frame_equal not differentiating NaN and None within object dtype #18463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WillAyd opened this issue Nov 24, 2017 · 7 comments
Open
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite

Comments

@WillAyd
Copy link
Member

WillAyd commented Nov 24, 2017

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame([['foo', 'bar', 'baz'], [None, None, None]])
df2 = pd.DataFrame([['foo', 'bar', 'baz'], [np.nan, np.nan, np.nan]])
tm.assert_frame_equal(df1, df2)

Problem description

No AssertionError gets raised in this case, even though I would not expect None and np.nan to be considered equal values (I found this while testing #18450)

Note that if you simply compared a DataFrame with only np.nan with another containing only None you would get an AttributeError that the dtypes are different (float64 vs object) but because the missing values are mixed into object dtypes here I think that differentiation gets lost

Expected Output

AssertionError: DataFrame.iloc[1, :] are different

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: d64995a
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0.dev0+205.gd64995a4f
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd changed the title assert_frame_equal not differentiating NaN and None with object dtype assert_frame_equal not differentiating NaN and None within object dtype Nov 24, 2017
@jorisvandenbossche
Copy link
Member

Generally None and np.nan are considered equivalent in pandas, both are seen as missing value. For other types of dtypes the None will on construction be converted to NaN, so you don't have this issue in such a case:

In [19]: pd.Series([1, 2, None])
Out[19]: 
0    1.0
1    2.0
2    NaN
dtype: float64

and for example the equals method also sees both frames in your example as equal:

In [18]: df1.equals(df2)
Out[18]: True

That said, to be able to test we are actually retuning NaN and not None in the issue you referenced, I agree it is annoying that you don't have a way to discriminate between both in assert_frame_equal.

For the PR as a workaround now, you can get the values of that column and assert that those are equal (eg np.testing.assert_array_equal(df1[0].values, df2[0].values) differentiates between None and np.nan)

@jorisvandenbossche jorisvandenbossche added the Testing pandas testing functions or related to the test suite label Nov 24, 2017
@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

I would add a kwarg to assert_*, maybe check_null='strict' or something. we almost always want 'strict' meaning not only is isnull(a) == isnull(b), but that np.isnan(a) & np.isnan(b).

I don't think working around this is a good idea, let's actually fix the testing.

@WillAyd
Copy link
Member Author

WillAyd commented Nov 24, 2017

@jreback do you know why assert_frame_equal doesn't use the NDFrame.equals method? There's a comment preceding that method that mentions it should.

The reason I'm asking is because while we could add a kwarg to the assert_frame_equal method to control the null checking behavior and do it in a strict manner by default, as @jorisvandenbossche pointed out that would be inconsistent with how the equals method of the NDFrame works. I would think we would either want to make that change in both comparison methods, or attempt to refactor assert_frame_equal to match NDFrame as the comment suggests

FWIW if we wanted to modify the behavior of the NDFrame.equals we'd probably want to add strict_nan=True to the following line:

return array_equivalent(self.values, other.values)

The array_equivalent method might also need some tweaks, as providing that argument is raising a ValueError for me, but nonetheless it seems like the intention is already in place there to handle this

@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

assert_frame_equal is NOT the same, it gives excruciating detail on where the error is (this is the entire point of this). .equals is just a user-facing shortcut. Under the hood these both use array_equivalent as the comparisons. I don't see a problem with adding things to the test methods, these are meant to be very fine grained.

so for example, .equals on the example (which compares None/np.nan) should fail. These are not exactly equal. We should not add things to .equals (as this opens a big can of worms).

@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

from above, this is actually wrong and we should strictly check this.

In [18]: df1.equals(df2)
Out[18]: True

@mroeschke mroeschke added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels May 22, 2020
@TomAugspurger TomAugspurger mentioned this issue Oct 9, 2020
5 tasks
@TomAugspurger
Copy link
Contributor

I think this also affects pd.NA.

In [45]: a = pd.Series([1, np.nan], dtype=object)

In [46]: b = pd.Series([1, pd.NA], dtype=object)

In [47]: pd.testing.assert_series_equal(a, b)

A keyword to control this would be great. Based on the recent experience with check_freq, we should perhaps not make things strict by default. But we could do non-strict + a warning by default, with the option to override in the function or via a global option.

def assert_series_equal(..., check_na=None):
    if check_na is None:
        check_na = pandas.config.get("testing.check_na")
    if check_na is None;
        warnings.warn("default to false, changing to strict in the future")
        check_na = False

@jbrockmendel
Copy link
Member

In _libs.testing.assert_almost_equal we have

    if checknull(a) and checknull(b):
        # TODO: Should require same-dtype NA?
        # nan / None comparison
        return True

Changing that to

if checknull(a):
    # nan / None comparison
    if is_matching_na(a, b):
        return True
    raise AssertionError(f"{a} != {b}")
elif checknull(b):
    raise AssertionError(f"{a} != {b}")

breaks 431 tests. 224 of those are in tests/strings/, 76 in tests/arithmetic/

is_matching_na has a nan_matches_none keyword we could use to be less strict specifically in the None vs np.nan case while still catching e.g. np.timedelta64("nat") vs pd.NaT (relevant in the tests/arithmetic/ cases). Passing that gets us down to 308 failing tests.

I definitely think we should become stricter here. As Tom mentioned when we made check_freq stricter, doing this probably needs a deprecation cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

6 participants