Skip to content

PERF: assert_frame_equal / assert_series_equal #55971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 17, 2023
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -305,7 +305,7 @@ Other Deprecations

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
- Performance improvement in :func:`.testing.assert_frame_equal` and :func:`.testing.assert_series_equal` for objects indexed by a :class:`MultiIndex` (:issue:`55949`)
- Performance improvement in :func:`.testing.assert_frame_equal` and :func:`.testing.assert_series_equal` (:issue:`55949`, :issue:`55971`)
- Performance improvement in :func:`concat` with ``axis=1`` and objects with unaligned indexes (:issue:`55084`)
- Performance improvement in :func:`merge_asof` when ``by`` is not ``None`` (:issue:`55580`, :issue:`55678`)
- Performance improvement in :func:`read_stata` for files with many variables (:issue:`55515`)
Expand Down
17 changes: 4 additions & 13 deletions pandas/_testing/asserters.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@
Series,
TimedeltaIndex,
)
from pandas.core.algorithms import take_nd
from pandas.core.arrays import (
DatetimeArray,
ExtensionArray,
Expand Down Expand Up @@ -246,13 +245,6 @@ def _check_types(left, right, obj: str = "Index") -> None:

assert_attr_equal("dtype", left, right, obj=obj)

def _get_ilevel_values(index, level):
# accept level number only
unique = index.levels[level]
level_codes = index.codes[level]
filled = take_nd(unique._values, level_codes, fill_value=unique._na_value)
return unique._shallow_copy(filled, name=index.names[level])

# instance validation
_check_isinstance(left, right, Index)

Expand Down Expand Up @@ -299,9 +291,8 @@ def _get_ilevel_values(index, level):
)
assert_numpy_array_equal(left.codes[level], right.codes[level])
except AssertionError:
# cannot use get_level_values here because it can change dtype
llevel = _get_ilevel_values(left, level)
rlevel = _get_ilevel_values(right, level)
llevel = left.get_level_values(level)
rlevel = right.get_level_values(level)

assert_index_equal(
llevel,
Expand All @@ -328,7 +319,7 @@ def _get_ilevel_values(index, level):
diff = np.sum(mismatch.astype(int)) * 100.0 / len(left)
msg = f"{obj} values are different ({np.round(diff, 5)} %)"
raise_assert_detail(obj, msg, left, right)
else:
elif not left.equals(right):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know that .equals is the correct amount of strict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but after reviewing again I've reverted this line in favor of a lower level change to _array_equivalent_object.

This has the benefit of improving performance for more cases, e.g. not just the index values but in this case the series values:

import pandas as pd

N = 100_000

values = pd._testing.makeStringIndex(N).values

ser1 = pd.Series(values.copy())
ser2 = pd.Series(values.copy())

%timeit pd.testing.assert_series_equal(ser1, ser2)

1.92 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
22.3 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# if we have "equiv", this becomes True
exact_bool = bool(exact)
_testing.assert_almost_equal(
Expand Down Expand Up @@ -592,7 +583,7 @@ def raise_assert_detail(
{message}"""

if isinstance(index_values, Index):
index_values = np.array(index_values)
index_values = np.asarray(index_values)

if isinstance(index_values, np.ndarray):
msg += f"\n[index]: {pprint_thing(index_values)}"
Expand Down