Skip to content

pd.testing.assert_frame_equal check_like not working like expected #22052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lassebenni opened this issue Jul 25, 2018 · 3 comments · Fixed by #22106
Closed

pd.testing.assert_frame_equal check_like not working like expected #22052

lassebenni opened this issue Jul 25, 2018 · 3 comments · Fixed by #22106

Comments

@lassebenni
Copy link

lassebenni commented Jul 25, 2018

Code Sample

pd.testing.assert_frame_equal(
pd.DataFrame([{'filename':'a'}, {'filename': 'b'}]),
pd.DataFrame([{'filename':'b'}, {'filename': 'a'}]),
check_like=True)

AssertionError: DataFrame.iloc[:, 0] are different

DataFrame.iloc[:, 0] values are different (100.0 %)
[left]:  [a, b]
[right]: [b, a]

Problem description

According to the documentation (version 0.23.3) , pandas.testing.assert_frame_equal takes a "check_like" parameter which can be set to True if the function should ignore the order of columns & rows.

check_like : bool, default False
If true, ignore the order of rows & columns

This does not work as I expect it to. When creating two Dataframe using a dict, it asserts them as being different due to the order of the rows.

Expected Output

I would expect the test to pass.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.93-linuxkit-aufs
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.3
pytest: 3.6.2
pip: 9.0.1
setuptools: 33.1.1
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

The order doesn't matter, but the same labels need to be with the same data. e.g.

In [31]: a = pd.DataFrame([{'filename':'a'}, {'filename': 'b'}], index=['a', 'b'])

In [32]: b = pd.DataFrame([{'filename':'b'}, {'filename': 'a'}], index=['b', 'a'])

In [33]: pd.util.testing.assert_frame_equal(a, b)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-33-302d0eeab6d2> in <module>()
----> 1 pd.util.testing.assert_frame_equal(a, b)

~/Envs/dask-dev/lib/python3.6/site-packages/pandas/util/testing.py in assert_frame_equal(left, right, check_dtype, check_index_type, check_column_type, check_frame_type, check_less_precise, check_names, by_blocks, check_exact, check_datetimelike_compat, check_categorical, check_like, obj)
   1330                        check_exact=check_exact,
   1331                        check_categorical=check_categorical,
-> 1332                        obj='{obj}.index'.format(obj=obj))
   1333
   1334     # column comparison

~/Envs/dask-dev/lib/python3.6/site-packages/pandas/util/testing.py in assert_index_equal(left, right, exact, check_names, check_less_precise, check_exact, check_categorical, obj)
    858                                      check_less_precise=check_less_precise,
    859                                      check_dtype=exact,
--> 860                                      obj=obj, lobj=left, robj=right)
    861
    862     # metadata comparison

pandas/_libs/testing.pyx in pandas._libs.testing.assert_almost_equal()

pandas/_libs/testing.pyx in pandas._libs.testing.assert_almost_equal()

~/Envs/dask-dev/lib/python3.6/site-packages/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
   1033         msg += "\n[diff]: {diff}".format(diff=diff)
   1034
-> 1035     raise AssertionError(msg)
   1036
   1037

AssertionError: DataFrame.index are different

DataFrame.index values are different (100.0 %)
[left]:  Index(['a', 'b'], dtype='object')
[right]: Index(['b', 'a'], dtype='object')

But this passes

In [34]: pd.util.testing.assert_frame_equal(a, b, check_like=True)

@TomAugspurger
Copy link
Contributor

@lassebenni could you make a PR updating the docstring of assert_frame_equal to clarify this?

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jul 26, 2018
@lassebenni
Copy link
Author

lassebenni commented Jul 26, 2018

@TomAugspurger

Thank you for the quick response!

I am not very familiar with Pandas Dataframes, so I missed the part of explicitly defining an index for the data.

My use case is as following: I have some code that applies transformations to Spark Dataframes. For testing purposes I want to compare the expected result to the actual result. For this I transform the Spark Dataframes to Pandas Dataframes: df.toPandas(). After which I want to compare the two: pd.testing.assert_frame_equal(expected, actual, check_like=True).

Assuming the transformations applied create rows and columns in a different order than expected, I was hoping that check_like=True would handle the differences without me having to sort the resulting DataFrame by column(s).

But it seems that I will have to sort the DataFrame either way, since the indexes for the values have to match:

In [31]: a = pd.DataFrame([{'filename':'a'}, {'filename': 'b'}], index=['a', 'b'])

In [32]: b = pd.DataFrame([{'filename':'b'}, {'filename': 'a'}], index=['b', 'a'])

In short, I hoped that:

a = pd.DataFrame([{'filename':'a'}, {'filename': 'b'}], index=['a', 'b'])
b = create_dataframe_transformation('filename', ['b', 'a'])

pd.testing.assert_frame_equal(a, b, check_like=True)

would pass, without having to do:

a = a.sort_values('filename')
b = b.sort_values('filename')

TLDR: for check_like to work, the values needs to have the same order of indexing.

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants