-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: raise_assert_detail only shows the difference between the columns #48033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'll be honest aside from the linting one I have no idea why these checks failed. A lot of them seem to fail on parts that I haven't even touched. Could anyone help me understand what's going wrong here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi - I think the failures are related, e.g.
2022-08-11T07:40:59.3001588Z ____________________ TestPandasContainer.test_misc_example _____________________
2022-08-11T07:40:59.3002059Z [gw0] linux -- Python 3.8.13 /home/runner/micromamba/envs/test/bin/python
2022-08-11T07:40:59.3002267Z
2022-08-11T07:40:59.3002488Z self = <pandas.tests.io.json.test_pandas.TestPandasContainer object at 0x7fc11b5f91c0>
2022-08-11T07:40:59.3002742Z
2022-08-11T07:40:59.3002854Z def test_misc_example(self):
2022-08-11T07:40:59.3003085Z
2022-08-11T07:40:59.3003321Z # parsing unordered input fails
2022-08-11T07:40:59.3003717Z result = read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]', numpy=True)
2022-08-11T07:40:59.3004050Z expected = DataFrame([[1, 2], [1, 2]], columns=["a", "b"])
2022-08-11T07:40:59.3004304Z
2022-08-11T07:40:59.3004687Z error_msg = """DataFrame\\.index are different
2022-08-11T07:40:59.3004937Z
2022-08-11T07:40:59.3005192Z DataFrame\\.index values are different \\(100\\.0 %\\)
2022-08-11T07:40:59.3005602Z \\[left\\]: Index\\(\\['a', 'b'\\], dtype='object'\\)
2022-08-11T07:40:59.3005942Z \\[right\\]: RangeIndex\\(start=0, stop=2, step=1\\)"""
2022-08-11T07:40:59.3006290Z with pytest.raises(AssertionError, match=error_msg):
2022-08-11T07:40:59.3006656Z > tm.assert_frame_equal(result, expected, check_index_type=False)
2022-08-11T07:40:59.3006865Z
2022-08-11T07:40:59.3006994Z pandas/tests/io/json/test_pandas.py:978:
2022-08-11T07:40:59.3007285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2022-08-11T07:40:59.3007619Z pandas/_libs/testing.pyx:52: in pandas._libs.testing.assert_almost_equal
2022-08-11T07:40:59.3007943Z cpdef assert_almost_equal(a, b,
2022-08-11T07:40:59.3008282Z pandas/_libs/testing.pyx:167: in pandas._libs.testing.assert_almost_equal
2022-08-11T07:40:59.3008660Z raise_assert_detail(obj, msg, lobj, robj, index_values=index_values)
2022-08-11T07:40:59.3008984Z pandas/util/_decorators.py:317: in wrapper
2022-08-11T07:40:59.3009258Z return func(*args, **kwargs)
2022-08-11T07:40:59.3009531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2022-08-11T07:40:59.3009691Z
2022-08-11T07:40:59.3009782Z self = left right
2022-08-11T07:40:59.3009987Z 0 a 0
2022-08-11T07:40:59.3010223Z 1 b 1, keys = [None], drop = True
2022-08-11T07:40:59.3010523Z append = False, inplace = False, verify_integrity = False
2022-08-11T07:40:59.3010714Z
2022-08-11T07:40:59.6463857Z @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "keys"])
2022-08-11T07:40:59.6464256Z def set_index(
2022-08-11T07:40:59.6464502Z self,
2022-08-11T07:40:59.6464734Z keys,
2022-08-11T07:40:59.6464981Z drop: bool = True,
2022-08-11T07:40:59.6465259Z append: bool = False,
2022-08-11T07:40:59.6465897Z inplace: bool = False,
2022-08-11T07:40:59.6466177Z verify_integrity: bool = False,
2022-08-11T07:40:59.6466730Z ) -> DataFrame | None:
2022-08-11T07:40:59.6466994Z """
2022-08-11T07:40:59.6467404Z Set the DataFrame index using existing columns.
2022-08-11T07:40:59.6467703Z
2022-08-11T07:40:59.6468018Z Set the DataFrame index (row labels) using one or more existing
2022-08-11T07:40:59.6468441Z columns or arrays (of the correct length). The index can replace the
2022-08-11T07:40:59.6468807Z existing index or expand on it.
2022-08-11T07:40:59.6469071Z
2022-08-11T07:40:59.6469299Z Parameters
2022-08-11T07:40:59.6469604Z ----------
2022-08-11T07:40:59.6470000Z keys : label or array-like or list of labels/arrays
2022-08-11T07:40:59.6470401Z This parameter can be either a single column key, a single array of
2022-08-11T07:40:59.6470830Z the same length as the calling DataFrame, or a list containing an
2022-08-11T07:40:59.6471255Z arbitrary combination of column keys and arrays. Here, "array"
2022-08-11T07:40:59.6471671Z encompasses :class:`Series`, :class:`Index`, ``np.ndarray``, and
2022-08-11T07:40:59.6472068Z instances of :class:`~collections.abc.Iterator`.
2022-08-11T07:40:59.6472407Z drop : bool, default True
2022-08-11T07:40:59.6472728Z Delete columns to be used as the new index.
2022-08-11T07:40:59.6473047Z append : bool, default False
2022-08-11T07:40:59.6473385Z Whether to append columns to existing index.
2022-08-11T07:40:59.6473711Z inplace : bool, default False
2022-08-11T07:40:59.6474075Z Whether to modify the DataFrame rather than creating a new one.
2022-08-11T07:40:59.6474445Z verify_integrity : bool, default False
2022-08-11T07:40:59.6474829Z Check the new index for duplicates. Otherwise defer the check until
2022-08-11T07:40:59.6475269Z necessary. Setting to False will improve the performance of this
2022-08-11T07:40:59.6475598Z method.
2022-08-11T07:40:59.6475823Z
2022-08-11T07:40:59.6476041Z Returns
2022-08-11T07:40:59.6476314Z -------
2022-08-11T07:40:59.6476566Z DataFrame or None
2022-08-11T07:40:59.6476908Z Changed row labels or None if ``inplace=True``.
2022-08-11T07:40:59.6477179Z
2022-08-11T07:40:59.6477396Z See Also
2022-08-11T07:40:59.6477669Z --------
2022-08-11T07:40:59.6477978Z DataFrame.reset_index : Opposite of set_index.
2022-08-11T07:40:59.6478375Z DataFrame.reindex : Change to new indices or expand indices.
2022-08-11T07:40:59.6478801Z DataFrame.reindex_like : Change to same indices as other DataFrame.
2022-08-11T07:40:59.6479122Z
2022-08-11T07:40:59.6479341Z Examples
2022-08-11T07:40:59.6479618Z --------
2022-08-11T07:40:59.6480002Z >>> df = pd.DataFrame({'month': [1, 4, 7, 10],
2022-08-11T07:40:59.6480431Z ... 'year': [2012, 2014, 2013, 2014],
2022-08-11T07:40:59.6480820Z ... 'sale': [55, 40, 84, 31]})
2022-08-11T07:40:59.6481081Z >>> df
2022-08-11T07:40:59.6481326Z month year sale
2022-08-11T07:40:59.6481572Z 0 1 2012 55
2022-08-11T07:40:59.6481814Z 1 4 2014 40
2022-08-11T07:40:59.6482054Z 2 7 2013 84
2022-08-11T07:40:59.6482296Z 3 10 2014 31
2022-08-11T07:40:59.6482522Z
2022-08-11T07:40:59.6482878Z Set the index to become the 'month' column:
2022-08-11T07:40:59.6483160Z
2022-08-11T07:40:59.6483462Z >>> df.set_index('month')
2022-08-11T07:40:59.6483727Z year sale
2022-08-11T07:40:59.6483968Z month
2022-08-11T07:40:59.6484198Z 1 2012 55
2022-08-11T07:40:59.6484980Z 4 2014 40
2022-08-11T07:40:59.6485228Z 7 2013 84
2022-08-11T07:40:59.6485473Z 10 2014 31
2022-08-11T07:40:59.6485910Z
2022-08-11T07:40:59.6486314Z Create a MultiIndex using columns 'year' and 'month':
2022-08-11T07:40:59.6486615Z
2022-08-11T07:40:59.6486943Z >>> df.set_index(['year', 'month'])
2022-08-11T07:40:59.6487295Z sale
2022-08-11T07:40:59.6487548Z year month
2022-08-11T07:40:59.6487795Z 2012 1 55
2022-08-11T07:40:59.6488034Z 2014 4 40
2022-08-11T07:40:59.6488274Z 2013 7 84
2022-08-11T07:40:59.6488506Z 2014 10 31
2022-08-11T07:40:59.6488732Z
2022-08-11T07:40:59.6489023Z Create a MultiIndex using an Index and a column:
2022-08-11T07:40:59.6489310Z
2022-08-11T07:40:59.6489666Z >>> df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
2022-08-11T07:40:59.6489961Z month sale
2022-08-11T07:40:59.6490212Z year
2022-08-11T07:40:59.6490449Z 1 2012 1 55
2022-08-11T07:40:59.6490694Z 2 2014 4 40
2022-08-11T07:40:59.6490940Z 3 2013 7 84
2022-08-11T07:40:59.6491180Z 4 2014 10 31
2022-08-11T07:40:59.6491404Z
2022-08-11T07:40:59.6491672Z Create a MultiIndex using two Series:
2022-08-11T07:40:59.6491943Z
2022-08-11T07:40:59.6492191Z >>> s = pd.Series([1, 2, 3, 4])
2022-08-11T07:40:59.6492480Z >>> df.set_index([s, s**2])
2022-08-11T07:40:59.6492754Z month year sale
2022-08-11T07:40:59.6492997Z 1 1 1 2012 55
2022-08-11T07:40:59.6493253Z 2 4 4 2014 40
2022-08-11T07:40:59.6493497Z 3 9 7 2013 84
2022-08-11T07:40:59.6493744Z 4 16 10 2014 31
2022-08-11T07:40:59.6493984Z """
2022-08-11T07:40:59.6494288Z inplace = validate_bool_kwarg(inplace, "inplace")
2022-08-11T07:40:59.6494673Z self._check_inplace_and_allows_duplicate_labels(inplace)
2022-08-11T07:40:59.6495024Z if not isinstance(keys, list):
2022-08-11T07:40:59.6495304Z keys = [keys]
2022-08-11T07:40:59.6495546Z
2022-08-11T07:40:59.6495769Z err_msg = (
2022-08-11T07:40:59.6496206Z 'The parameter "keys" may be a column key, one-dimensional '
2022-08-11T07:40:59.6496621Z "array, or a list containing only valid column keys and "
2022-08-11T07:40:59.6497033Z "one-dimensional arrays."
2022-08-11T07:40:59.6497300Z )
2022-08-11T07:40:59.6497500Z
2022-08-11T07:40:59.6497752Z missing: list[Hashable] = []
2022-08-11T07:40:59.6498032Z for col in keys:
2022-08-11T07:40:59.6498389Z if isinstance(col, (Index, Series, np.ndarray, list, abc.Iterator)):
2022-08-11T07:40:59.6498900Z # arrays are fine as long as they are one-dimensional
2022-08-11T07:40:59.6499267Z # iterators get converted to list below
2022-08-11T07:40:59.6499597Z if getattr(col, "ndim", 1) != 1:
2022-08-11T07:40:59.6499909Z raise ValueError(err_msg)
2022-08-11T07:40:59.6500183Z else:
2022-08-11T07:40:59.6500493Z # everything else gets tried as a key; see GH 24969
2022-08-11T07:40:59.6500794Z try:
2022-08-11T07:40:59.6501073Z found = col in self.columns
2022-08-11T07:40:59.6501387Z except TypeError as err:
2022-08-11T07:40:59.6501680Z raise TypeError(
2022-08-11T07:40:59.6502018Z f"{err_msg}. Received column of type {type(col)}"
2022-08-11T07:40:59.6502333Z ) from err
2022-08-11T07:40:59.6502568Z else:
2022-08-11T07:40:59.6502822Z if not found:
2022-08-11T07:40:59.6503106Z missing.append(col)
2022-08-11T07:40:59.6503365Z
2022-08-11T07:40:59.6503585Z if missing:
2022-08-11T07:40:59.6503905Z > raise KeyError(f"None of {missing} are in the columns")
2022-08-11T07:40:59.6504360Z E KeyError: 'None of [None] are in the columns'
2022-08-11T07:40:59.6504673Z
2022-08-11T07:40:59.6504808Z pandas/core/frame.py:5968: KeyError
looks related to changes in raise_assert_detail
.
The current behaviour for passing tests shouldn't change
Hello @savente93! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-08-14 07:56:35 UTC |
@MarcoGorelli Thanks for your direction earlier. I've reverted the original behavior and added a conditional flag for this and added a test. Most of the CIs pass now but a lot of them also fail because of this error:
I'll be honest I have no idea what this is and have also been unable to find any information on it. Have you seen this before perhaps? Mind giving me another hint? Thanks! |
Hey - I haven't had a chance to review the PR yet, but that error looks unrelated, if you fetch and merge upstream/main and push again then I presume it'll go away |
it did indeed, thanks so much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR, just had a closer look
Regarding the API, I'd suggest not adding an extra argument, but rather an extra line with the first diff, like pytest does, e.g.
if we do
l1 = [1,2,3, 5, 6]
l2 = [1,2,4, 5, 7]
assert l1 == l2
then the output from pytest
will show
At index 2 diff: 3 != 4
Also, the comparison shouldn't use !=
, but consider atol
and rtol
.
I think the simplest way to do this might be, in
pandas/pandas/_libs/testing.pyx
Lines 156 to 161 in 96fe059
for i in range(len(a)): | |
try: | |
assert_almost_equal(a[i], b[i], rtol=rtol, atol=atol) | |
except AssertionError: | |
is_unequal = True | |
diff += 1 |
to capture the first index (and first elements) where a
and b
differ, pass that to raise_assert_detail
, and append it to msg
I agree that
I don't mean to be difficult, but for the reasons above I'd like to keep the current solution if possible. Is there a reason you're opposed to adding an argument? |
You can already hide the output by setting
(as an aside - it looks like setting Then, with the solution I'm suggesting, you'd get
As for adding an extra argument - it adds complexity, you'd need extra testing to check how it interacts with all the other arguments already there (e.g. currently, it wouldn't interact well with I'm open to making the current output more readable, especially in your case when the elements themselves are quite long, but I do think it's a separate issue to only showing elements where the frames differ |
If that's your position then I don't think this PR is actually the right way to go about it, so I'll be closing it. Thanks for your consideration and help along the way. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature. The change is super small so I figured it didn't really make sense to add it to thewhatsnew
section but if I should still do this please let me knowI didn't receive feedback on whether or not it was okay to make this the default behavior but doing so minimizes the code impact and is a reasonable default in my opinion so I'm submitting it like this. If this needs to e changed feel free to let me know.