Skip to content

ENH: raise_assert_detail only shows the difference between the columns #48033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

ENH: raise_assert_detail only shows the difference between the columns #48033

wants to merge 3 commits into from

Conversation

savente93
Copy link

@savente93 savente93 commented Aug 11, 2022

I didn't receive feedback on whether or not it was okay to make this the default behavior but doing so minimizes the code impact and is a reasonable default in my opinion so I'm submitting it like this. If this needs to e changed feel free to let me know.

@savente93
Copy link
Author

I'll be honest aside from the linting one I have no idea why these checks failed. A lot of them seem to fail on parts that I haven't even touched. Could anyone help me understand what's going wrong here?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - I think the failures are related, e.g.

2022-08-11T07:40:59.3001588Z ____________________ TestPandasContainer.test_misc_example _____________________
2022-08-11T07:40:59.3002059Z [gw0] linux -- Python 3.8.13 /home/runner/micromamba/envs/test/bin/python
2022-08-11T07:40:59.3002267Z 
2022-08-11T07:40:59.3002488Z self = <pandas.tests.io.json.test_pandas.TestPandasContainer object at 0x7fc11b5f91c0>
2022-08-11T07:40:59.3002742Z 
2022-08-11T07:40:59.3002854Z         def test_misc_example(self):
2022-08-11T07:40:59.3003085Z     
2022-08-11T07:40:59.3003321Z             # parsing unordered input fails
2022-08-11T07:40:59.3003717Z             result = read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]', numpy=True)
2022-08-11T07:40:59.3004050Z             expected = DataFrame([[1, 2], [1, 2]], columns=["a", "b"])
2022-08-11T07:40:59.3004304Z     
2022-08-11T07:40:59.3004687Z             error_msg = """DataFrame\\.index are different
2022-08-11T07:40:59.3004937Z     
2022-08-11T07:40:59.3005192Z     DataFrame\\.index values are different \\(100\\.0 %\\)
2022-08-11T07:40:59.3005602Z     \\[left\\]:  Index\\(\\['a', 'b'\\], dtype='object'\\)
2022-08-11T07:40:59.3005942Z     \\[right\\]: RangeIndex\\(start=0, stop=2, step=1\\)"""
2022-08-11T07:40:59.3006290Z             with pytest.raises(AssertionError, match=error_msg):
2022-08-11T07:40:59.3006656Z >               tm.assert_frame_equal(result, expected, check_index_type=False)
2022-08-11T07:40:59.3006865Z 
2022-08-11T07:40:59.3006994Z pandas/tests/io/json/test_pandas.py:978: 
2022-08-11T07:40:59.3007285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2022-08-11T07:40:59.3007619Z pandas/_libs/testing.pyx:52: in pandas._libs.testing.assert_almost_equal
2022-08-11T07:40:59.3007943Z     cpdef assert_almost_equal(a, b,
2022-08-11T07:40:59.3008282Z pandas/_libs/testing.pyx:167: in pandas._libs.testing.assert_almost_equal
2022-08-11T07:40:59.3008660Z     raise_assert_detail(obj, msg, lobj, robj, index_values=index_values)
2022-08-11T07:40:59.3008984Z pandas/util/_decorators.py:317: in wrapper
2022-08-11T07:40:59.3009258Z     return func(*args, **kwargs)
2022-08-11T07:40:59.3009531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2022-08-11T07:40:59.3009691Z 
2022-08-11T07:40:59.3009782Z self =   left  right
2022-08-11T07:40:59.3009987Z 0    a      0
2022-08-11T07:40:59.3010223Z 1    b      1, keys = [None], drop = True
2022-08-11T07:40:59.3010523Z append = False, inplace = False, verify_integrity = False
2022-08-11T07:40:59.3010714Z 
2022-08-11T07:40:59.6463857Z     @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "keys"])
2022-08-11T07:40:59.6464256Z     def set_index(
2022-08-11T07:40:59.6464502Z         self,
2022-08-11T07:40:59.6464734Z         keys,
2022-08-11T07:40:59.6464981Z         drop: bool = True,
2022-08-11T07:40:59.6465259Z         append: bool = False,
2022-08-11T07:40:59.6465897Z         inplace: bool = False,
2022-08-11T07:40:59.6466177Z         verify_integrity: bool = False,
2022-08-11T07:40:59.6466730Z     ) -> DataFrame | None:
2022-08-11T07:40:59.6466994Z         """
2022-08-11T07:40:59.6467404Z         Set the DataFrame index using existing columns.
2022-08-11T07:40:59.6467703Z     
2022-08-11T07:40:59.6468018Z         Set the DataFrame index (row labels) using one or more existing
2022-08-11T07:40:59.6468441Z         columns or arrays (of the correct length). The index can replace the
2022-08-11T07:40:59.6468807Z         existing index or expand on it.
2022-08-11T07:40:59.6469071Z     
2022-08-11T07:40:59.6469299Z         Parameters
2022-08-11T07:40:59.6469604Z         ----------
2022-08-11T07:40:59.6470000Z         keys : label or array-like or list of labels/arrays
2022-08-11T07:40:59.6470401Z             This parameter can be either a single column key, a single array of
2022-08-11T07:40:59.6470830Z             the same length as the calling DataFrame, or a list containing an
2022-08-11T07:40:59.6471255Z             arbitrary combination of column keys and arrays. Here, "array"
2022-08-11T07:40:59.6471671Z             encompasses :class:`Series`, :class:`Index`, ``np.ndarray``, and
2022-08-11T07:40:59.6472068Z             instances of :class:`~collections.abc.Iterator`.
2022-08-11T07:40:59.6472407Z         drop : bool, default True
2022-08-11T07:40:59.6472728Z             Delete columns to be used as the new index.
2022-08-11T07:40:59.6473047Z         append : bool, default False
2022-08-11T07:40:59.6473385Z             Whether to append columns to existing index.
2022-08-11T07:40:59.6473711Z         inplace : bool, default False
2022-08-11T07:40:59.6474075Z             Whether to modify the DataFrame rather than creating a new one.
2022-08-11T07:40:59.6474445Z         verify_integrity : bool, default False
2022-08-11T07:40:59.6474829Z             Check the new index for duplicates. Otherwise defer the check until
2022-08-11T07:40:59.6475269Z             necessary. Setting to False will improve the performance of this
2022-08-11T07:40:59.6475598Z             method.
2022-08-11T07:40:59.6475823Z     
2022-08-11T07:40:59.6476041Z         Returns
2022-08-11T07:40:59.6476314Z         -------
2022-08-11T07:40:59.6476566Z         DataFrame or None
2022-08-11T07:40:59.6476908Z             Changed row labels or None if ``inplace=True``.
2022-08-11T07:40:59.6477179Z     
2022-08-11T07:40:59.6477396Z         See Also
2022-08-11T07:40:59.6477669Z         --------
2022-08-11T07:40:59.6477978Z         DataFrame.reset_index : Opposite of set_index.
2022-08-11T07:40:59.6478375Z         DataFrame.reindex : Change to new indices or expand indices.
2022-08-11T07:40:59.6478801Z         DataFrame.reindex_like : Change to same indices as other DataFrame.
2022-08-11T07:40:59.6479122Z     
2022-08-11T07:40:59.6479341Z         Examples
2022-08-11T07:40:59.6479618Z         --------
2022-08-11T07:40:59.6480002Z         >>> df = pd.DataFrame({'month': [1, 4, 7, 10],
2022-08-11T07:40:59.6480431Z         ...                    'year': [2012, 2014, 2013, 2014],
2022-08-11T07:40:59.6480820Z         ...                    'sale': [55, 40, 84, 31]})
2022-08-11T07:40:59.6481081Z         >>> df
2022-08-11T07:40:59.6481326Z            month  year  sale
2022-08-11T07:40:59.6481572Z         0      1  2012    55
2022-08-11T07:40:59.6481814Z         1      4  2014    40
2022-08-11T07:40:59.6482054Z         2      7  2013    84
2022-08-11T07:40:59.6482296Z         3     10  2014    31
2022-08-11T07:40:59.6482522Z     
2022-08-11T07:40:59.6482878Z         Set the index to become the 'month' column:
2022-08-11T07:40:59.6483160Z     
2022-08-11T07:40:59.6483462Z         >>> df.set_index('month')
2022-08-11T07:40:59.6483727Z                year  sale
2022-08-11T07:40:59.6483968Z         month
2022-08-11T07:40:59.6484198Z         1      2012    55
2022-08-11T07:40:59.6484980Z         4      2014    40
2022-08-11T07:40:59.6485228Z         7      2013    84
2022-08-11T07:40:59.6485473Z         10     2014    31
2022-08-11T07:40:59.6485910Z     
2022-08-11T07:40:59.6486314Z         Create a MultiIndex using columns 'year' and 'month':
2022-08-11T07:40:59.6486615Z     
2022-08-11T07:40:59.6486943Z         >>> df.set_index(['year', 'month'])
2022-08-11T07:40:59.6487295Z                     sale
2022-08-11T07:40:59.6487548Z         year  month
2022-08-11T07:40:59.6487795Z         2012  1     55
2022-08-11T07:40:59.6488034Z         2014  4     40
2022-08-11T07:40:59.6488274Z         2013  7     84
2022-08-11T07:40:59.6488506Z         2014  10    31
2022-08-11T07:40:59.6488732Z     
2022-08-11T07:40:59.6489023Z         Create a MultiIndex using an Index and a column:
2022-08-11T07:40:59.6489310Z     
2022-08-11T07:40:59.6489666Z         >>> df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
2022-08-11T07:40:59.6489961Z                  month  sale
2022-08-11T07:40:59.6490212Z            year
2022-08-11T07:40:59.6490449Z         1  2012  1      55
2022-08-11T07:40:59.6490694Z         2  2014  4      40
2022-08-11T07:40:59.6490940Z         3  2013  7      84
2022-08-11T07:40:59.6491180Z         4  2014  10     31
2022-08-11T07:40:59.6491404Z     
2022-08-11T07:40:59.6491672Z         Create a MultiIndex using two Series:
2022-08-11T07:40:59.6491943Z     
2022-08-11T07:40:59.6492191Z         >>> s = pd.Series([1, 2, 3, 4])
2022-08-11T07:40:59.6492480Z         >>> df.set_index([s, s**2])
2022-08-11T07:40:59.6492754Z               month  year  sale
2022-08-11T07:40:59.6492997Z         1 1       1  2012    55
2022-08-11T07:40:59.6493253Z         2 4       4  2014    40
2022-08-11T07:40:59.6493497Z         3 9       7  2013    84
2022-08-11T07:40:59.6493744Z         4 16     10  2014    31
2022-08-11T07:40:59.6493984Z         """
2022-08-11T07:40:59.6494288Z         inplace = validate_bool_kwarg(inplace, "inplace")
2022-08-11T07:40:59.6494673Z         self._check_inplace_and_allows_duplicate_labels(inplace)
2022-08-11T07:40:59.6495024Z         if not isinstance(keys, list):
2022-08-11T07:40:59.6495304Z             keys = [keys]
2022-08-11T07:40:59.6495546Z     
2022-08-11T07:40:59.6495769Z         err_msg = (
2022-08-11T07:40:59.6496206Z             'The parameter "keys" may be a column key, one-dimensional '
2022-08-11T07:40:59.6496621Z             "array, or a list containing only valid column keys and "
2022-08-11T07:40:59.6497033Z             "one-dimensional arrays."
2022-08-11T07:40:59.6497300Z         )
2022-08-11T07:40:59.6497500Z     
2022-08-11T07:40:59.6497752Z         missing: list[Hashable] = []
2022-08-11T07:40:59.6498032Z         for col in keys:
2022-08-11T07:40:59.6498389Z             if isinstance(col, (Index, Series, np.ndarray, list, abc.Iterator)):
2022-08-11T07:40:59.6498900Z                 # arrays are fine as long as they are one-dimensional
2022-08-11T07:40:59.6499267Z                 # iterators get converted to list below
2022-08-11T07:40:59.6499597Z                 if getattr(col, "ndim", 1) != 1:
2022-08-11T07:40:59.6499909Z                     raise ValueError(err_msg)
2022-08-11T07:40:59.6500183Z             else:
2022-08-11T07:40:59.6500493Z                 # everything else gets tried as a key; see GH 24969
2022-08-11T07:40:59.6500794Z                 try:
2022-08-11T07:40:59.6501073Z                     found = col in self.columns
2022-08-11T07:40:59.6501387Z                 except TypeError as err:
2022-08-11T07:40:59.6501680Z                     raise TypeError(
2022-08-11T07:40:59.6502018Z                         f"{err_msg}. Received column of type {type(col)}"
2022-08-11T07:40:59.6502333Z                     ) from err
2022-08-11T07:40:59.6502568Z                 else:
2022-08-11T07:40:59.6502822Z                     if not found:
2022-08-11T07:40:59.6503106Z                         missing.append(col)
2022-08-11T07:40:59.6503365Z     
2022-08-11T07:40:59.6503585Z         if missing:
2022-08-11T07:40:59.6503905Z >           raise KeyError(f"None of {missing} are in the columns")
2022-08-11T07:40:59.6504360Z E           KeyError: 'None of [None] are in the columns'
2022-08-11T07:40:59.6504673Z 
2022-08-11T07:40:59.6504808Z pandas/core/frame.py:5968: KeyError

looks related to changes in raise_assert_detail.

The current behaviour for passing tests shouldn't change

@mroeschke mroeschke added the Testing pandas testing functions or related to the test suite label Aug 11, 2022
@pep8speaks
Copy link

pep8speaks commented Aug 12, 2022

Hello @savente93! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-08-14 07:56:35 UTC

@savente93
Copy link
Author

@MarcoGorelli Thanks for your direction earlier. I've reverted the original behavior and added a conditional flag for this and added a test. Most of the CIs pass now but a lot of them also fail because of this error:

 Error: The runner has received a shutdown signal...' in GHA. GH 45651

I'll be honest I have no idea what this is and have also been unable to find any information on it. Have you seen this before perhaps? Mind giving me another hint? Thanks!

@MarcoGorelli
Copy link
Member

Hey - I haven't had a chance to review the PR yet, but that error looks unrelated, if you fetch and merge upstream/main and push again then I presume it'll go away

@savente93
Copy link
Author

it did indeed, thanks so much!

@savente93 savente93 requested a review from MarcoGorelli August 14, 2022 09:31
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR, just had a closer look

Regarding the API, I'd suggest not adding an extra argument, but rather an extra line with the first diff, like pytest does, e.g.
if we do

    l1 = [1,2,3, 5, 6]
    l2 = [1,2,4, 5, 7]
    assert l1 == l2

then the output from pytest will show

At index 2 diff: 3 != 4

Also, the comparison shouldn't use !=, but consider atol and rtol.

I think the simplest way to do this might be, in

for i in range(len(a)):
try:
assert_almost_equal(a[i], b[i], rtol=rtol, atol=atol)
except AssertionError:
is_unequal = True
diff += 1

to capture the first index (and first elements) where a and b differ, pass that to raise_assert_detail, and append it to msg

@savente93
Copy link
Author

I agree that atol and rtol should be used, I'll take a look at that later. However, I would really strongly prefer the current solution if at all possible for a couple of reasons:

  1. The solution you propose doesn't actually solve the problem in my use case. It's not just that the statement doesn't print the actual difference but also that it prints so a massive amount of redundant information that the output becomes unreadable. The current solution would solve both of these problems
  2. the other problem is that as far as I can tell adding these things to the default behavior would break all the current tests which you instructed me to not do previously.
  3. this is a subjective point I'll fully admit but I find that pytest notation very intuitive to read and it often trips me up. I find the way it currently displays things much clearer so I'd like to keep the current way if at all possible.

I don't mean to be difficult, but for the reasons above I'd like to keep the current solution if possible. Is there a reason you're opposed to adding an argument?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 14, 2022

You can already hide the output by setting max_seq_items to something small, like 1 e.g.

with pd.option_context('display.max_seq_items', 1):
    assert_frame_equal(df1,df2)

(as an aside - it looks like setting 0 doesn't work, so that's a separate bug, I'll open an issue for this)

Then, with the solution I'm suggesting, you'd get

DataFrame.iloc[:, 0] (column name="a") values are different (0.27397 %)
[index]: [2022-01-01T00:00:00.000000000, ...]
[left]:  [1.0, ...]
[right]: [1.0, ...]
At index 2022-12-31T00:00:00.000000000 diff: 1.0 != 0.0

As for adding an extra argument - it adds complexity, you'd need extra testing to check how it interacts with all the other arguments already there (e.g. currently, it wouldn't interact well with rtol and atol).


I'm open to making the current output more readable, especially in your case when the elements themselves are quite long, but I do think it's a separate issue to only showing elements where the frames differ

@savente93
Copy link
Author

If that's your position then I don't think this PR is actually the right way to go about it, so I'll be closing it. Thanks for your consideration and help along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Allowing more control over the asserion printing format
4 participants