Skip to content

Commit c9d183d

Browse files
authored
ENH: Added DataFrame.compare and Series.compare (GH30429) (#30852)
1 parent 5f26c34 commit c9d183d

File tree

10 files changed

+698
-11
lines changed

10 files changed

+698
-11
lines changed

doc/source/reference/frame.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -240,13 +240,14 @@ Reshaping, sorting, transposing
240240
DataFrame.T
241241
DataFrame.transpose
242242

243-
Combining / joining / merging
244-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
243+
Combining / comparing / joining / merging
244+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
245245
.. autosummary::
246246
:toctree: api/
247247

248248
DataFrame.append
249249
DataFrame.assign
250+
DataFrame.compare
250251
DataFrame.join
251252
DataFrame.merge
252253
DataFrame.update

doc/source/reference/series.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -240,12 +240,13 @@ Reshaping, sorting
240240
Series.squeeze
241241
Series.view
242242

243-
Combining / joining / merging
244-
-----------------------------
243+
Combining / comparing / joining / merging
244+
-----------------------------------------
245245
.. autosummary::
246246
:toctree: api/
247247

248248
Series.append
249+
Series.compare
249250
Series.replace
250251
Series.update
251252

doc/source/user_guide/merging.rst

+64-3
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,18 @@
1010
p = doctools.TablePlotter()
1111
1212
13-
****************************
14-
Merge, join, and concatenate
15-
****************************
13+
************************************
14+
Merge, join, concatenate and compare
15+
************************************
1616

1717
pandas provides various facilities for easily combining together Series or
1818
DataFrame with various kinds of set logic for the indexes
1919
and relational algebra functionality in the case of join / merge-type
2020
operations.
2121

22+
In addition, pandas also provides utilities to compare two Series or DataFrame
23+
and summarize their differences.
24+
2225
.. _merging.concat:
2326

2427
Concatenating objects
@@ -1477,3 +1480,61 @@ exclude exact matches on time. Note that though we exclude the exact matches
14771480
by='ticker',
14781481
tolerance=pd.Timedelta('10ms'),
14791482
allow_exact_matches=False)
1483+
1484+
.. _merging.compare:
1485+
1486+
Comparing objects
1487+
-----------------
1488+
1489+
The :meth:`~Series.compare` and :meth:`~DataFrame.compare` methods allow you to
1490+
compare two DataFrame or Series, respectively, and summarize their differences.
1491+
1492+
This feature was added in :ref:`V1.1.0 <whatsnew_110.dataframe_or_series_comparing>`.
1493+
1494+
For example, you might want to compare two `DataFrame` and stack their differences
1495+
side by side.
1496+
1497+
.. ipython:: python
1498+
1499+
df = pd.DataFrame(
1500+
{
1501+
"col1": ["a", "a", "b", "b", "a"],
1502+
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
1503+
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
1504+
},
1505+
columns=["col1", "col2", "col3"],
1506+
)
1507+
df
1508+
1509+
.. ipython:: python
1510+
1511+
df2 = df.copy()
1512+
df2.loc[0, 'col1'] = 'c'
1513+
df2.loc[2, 'col3'] = 4.0
1514+
df2
1515+
1516+
.. ipython:: python
1517+
1518+
df.compare(df2)
1519+
1520+
By default, if two corresponding values are equal, they will be shown as ``NaN``.
1521+
Furthermore, if all values in an entire row / column, the row / column will be
1522+
omitted from the result. The remaining differences will be aligned on columns.
1523+
1524+
If you wish, you may choose to stack the differences on rows.
1525+
1526+
.. ipython:: python
1527+
1528+
df.compare(df2, align_axis=0)
1529+
1530+
If you wish to keep all original rows and columns, set `keep_shape` argument
1531+
to ``True``.
1532+
1533+
.. ipython:: python
1534+
1535+
df.compare(df2, keep_shape=True)
1536+
1537+
You may also keep all the original values even if they are equal.
1538+
1539+
.. ipython:: python
1540+
df.compare(df2, keep_shape=True, keep_equal=True)

doc/source/whatsnew/v1.1.0.rst

+33
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,39 @@ For example:
5555
ser.loc["May 2015"]
5656
5757
58+
.. _whatsnew_110.dataframe_or_series_comparing:
59+
60+
Comparing two `DataFrame` or two `Series` and summarizing the differences
61+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
62+
63+
We've added :meth:`DataFrame.compare` and :meth:`Series.compare` for comparing two `DataFrame` or two `Series` (:issue:`30429`)
64+
65+
.. ipython:: python
66+
67+
df = pd.DataFrame(
68+
{
69+
"col1": ["a", "a", "b", "b", "a"],
70+
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
71+
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
72+
},
73+
columns=["col1", "col2", "col3"],
74+
)
75+
df
76+
77+
.. ipython:: python
78+
79+
df2 = df.copy()
80+
df2.loc[0, 'col1'] = 'c'
81+
df2.loc[2, 'col3'] = 4.0
82+
df2
83+
84+
.. ipython:: python
85+
86+
df.compare(df2)
87+
88+
See :ref:`User Guide <merging.compare>` for more details.
89+
90+
5891
.. _whatsnew_110.groupby_key:
5992

6093
Allow NA in groupby key

pandas/core/frame.py

+110
Original file line numberDiff line numberDiff line change
@@ -5759,6 +5759,116 @@ def _construct_result(self, result) -> "DataFrame":
57595759
out.index = self.index
57605760
return out
57615761

5762+
@Appender(
5763+
"""
5764+
Returns
5765+
-------
5766+
DataFrame
5767+
DataFrame that shows the differences stacked side by side.
5768+
5769+
The resulting index will be a MultiIndex with 'self' and 'other'
5770+
stacked alternately at the inner level.
5771+
5772+
See Also
5773+
--------
5774+
Series.compare : Compare with another Series and show differences.
5775+
5776+
Notes
5777+
-----
5778+
Matching NaNs will not appear as a difference.
5779+
5780+
Examples
5781+
--------
5782+
>>> df = pd.DataFrame(
5783+
... {
5784+
... "col1": ["a", "a", "b", "b", "a"],
5785+
... "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
5786+
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
5787+
... },
5788+
... columns=["col1", "col2", "col3"],
5789+
... )
5790+
>>> df
5791+
col1 col2 col3
5792+
0 a 1.0 1.0
5793+
1 a 2.0 2.0
5794+
2 b 3.0 3.0
5795+
3 b NaN 4.0
5796+
4 a 5.0 5.0
5797+
5798+
>>> df2 = df.copy()
5799+
>>> df2.loc[0, 'col1'] = 'c'
5800+
>>> df2.loc[2, 'col3'] = 4.0
5801+
>>> df2
5802+
col1 col2 col3
5803+
0 c 1.0 1.0
5804+
1 a 2.0 2.0
5805+
2 b 3.0 4.0
5806+
3 b NaN 4.0
5807+
4 a 5.0 5.0
5808+
5809+
Align the differences on columns
5810+
5811+
>>> df.compare(df2)
5812+
col1 col3
5813+
self other self other
5814+
0 a c NaN NaN
5815+
2 NaN NaN 3.0 4.0
5816+
5817+
Stack the differences on rows
5818+
5819+
>>> df.compare(df2, align_axis=0)
5820+
col1 col3
5821+
0 self a NaN
5822+
other c NaN
5823+
2 self NaN 3.0
5824+
other NaN 4.0
5825+
5826+
Keep the equal values
5827+
5828+
>>> df.compare(df2, keep_equal=True)
5829+
col1 col3
5830+
self other self other
5831+
0 a c 1.0 1.0
5832+
2 b b 3.0 4.0
5833+
5834+
Keep all original rows and columns
5835+
5836+
>>> df.compare(df2, keep_shape=True)
5837+
col1 col2 col3
5838+
self other self other self other
5839+
0 a c NaN NaN NaN NaN
5840+
1 NaN NaN NaN NaN NaN NaN
5841+
2 NaN NaN NaN NaN 3.0 4.0
5842+
3 NaN NaN NaN NaN NaN NaN
5843+
4 NaN NaN NaN NaN NaN NaN
5844+
5845+
Keep all original rows and columns and also all original values
5846+
5847+
>>> df.compare(df2, keep_shape=True, keep_equal=True)
5848+
col1 col2 col3
5849+
self other self other self other
5850+
0 a c 1.0 1.0 1.0 1.0
5851+
1 a a 2.0 2.0 2.0 2.0
5852+
2 b b 3.0 3.0 3.0 4.0
5853+
3 b b NaN NaN 4.0 4.0
5854+
4 a a 5.0 5.0 5.0 5.0
5855+
"""
5856+
)
5857+
@Appender(_shared_docs["compare"] % _shared_doc_kwargs)
5858+
def compare(
5859+
self,
5860+
other: "DataFrame",
5861+
align_axis: Axis = 1,
5862+
keep_shape: bool = False,
5863+
keep_equal: bool = False,
5864+
) -> "DataFrame":
5865+
return super().compare(
5866+
other=other,
5867+
align_axis=align_axis,
5868+
keep_shape=keep_shape,
5869+
keep_equal=keep_equal,
5870+
)
5871+
57625872
def combine(
57635873
self, other: "DataFrame", func, fill_value=None, overwrite=True
57645874
) -> "DataFrame":

pandas/core/generic.py

+98
Original file line numberDiff line numberDiff line change
@@ -8403,6 +8403,104 @@ def ranker(data):
84038403

84048404
return ranker(data)
84058405

8406+
_shared_docs[
8407+
"compare"
8408+
] = """
8409+
Compare to another %(klass)s and show the differences.
8410+
8411+
.. versionadded:: 1.1.0
8412+
8413+
Parameters
8414+
----------
8415+
other : %(klass)s
8416+
Object to compare with.
8417+
8418+
align_axis : {0 or 'index', 1 or 'columns'}, default 1
8419+
Determine which axis to align the comparison on.
8420+
8421+
* 0, or 'index' : Resulting differences are stacked vertically
8422+
with rows drawn alternately from self and other.
8423+
* 1, or 'columns' : Resulting differences are aligned horizontally
8424+
with columns drawn alternately from self and other.
8425+
8426+
keep_shape : bool, default False
8427+
If true, all rows and columns are kept.
8428+
Otherwise, only the ones with different values are kept.
8429+
8430+
keep_equal : bool, default False
8431+
If true, the result keeps values that are equal.
8432+
Otherwise, equal values are shown as NaNs.
8433+
"""
8434+
8435+
@Appender(_shared_docs["compare"] % _shared_doc_kwargs)
8436+
def compare(
8437+
self,
8438+
other,
8439+
align_axis: Axis = 1,
8440+
keep_shape: bool_t = False,
8441+
keep_equal: bool_t = False,
8442+
):
8443+
from pandas.core.reshape.concat import concat
8444+
8445+
if type(self) is not type(other):
8446+
cls_self, cls_other = type(self).__name__, type(other).__name__
8447+
raise TypeError(
8448+
f"can only compare '{cls_self}' (not '{cls_other}') with '{cls_self}'"
8449+
)
8450+
8451+
mask = ~((self == other) | (self.isna() & other.isna()))
8452+
keys = ["self", "other"]
8453+
8454+
if not keep_equal:
8455+
self = self.where(mask)
8456+
other = other.where(mask)
8457+
8458+
if not keep_shape:
8459+
if isinstance(self, ABCDataFrame):
8460+
cmask = mask.any()
8461+
rmask = mask.any(axis=1)
8462+
self = self.loc[rmask, cmask]
8463+
other = other.loc[rmask, cmask]
8464+
else:
8465+
self = self[mask]
8466+
other = other[mask]
8467+
8468+
if align_axis in (1, "columns"): # This is needed for Series
8469+
axis = 1
8470+
else:
8471+
axis = self._get_axis_number(align_axis)
8472+
8473+
diff = concat([self, other], axis=axis, keys=keys)
8474+
8475+
if axis >= self.ndim:
8476+
# No need to reorganize data if stacking on new axis
8477+
# This currently applies for stacking two Series on columns
8478+
return diff
8479+
8480+
ax = diff._get_axis(axis)
8481+
ax_names = np.array(ax.names)
8482+
8483+
# set index names to positions to avoid confusion
8484+
ax.names = np.arange(len(ax_names))
8485+
8486+
# bring self-other to inner level
8487+
order = list(range(1, ax.nlevels)) + [0]
8488+
if isinstance(diff, ABCDataFrame):
8489+
diff = diff.reorder_levels(order, axis=axis)
8490+
else:
8491+
diff = diff.reorder_levels(order)
8492+
8493+
# restore the index names in order
8494+
diff._get_axis(axis=axis).names = ax_names[order]
8495+
8496+
# reorder axis to keep things organized
8497+
indices = (
8498+
np.arange(diff.shape[axis]).reshape([2, diff.shape[axis] // 2]).T.flatten()
8499+
)
8500+
diff = diff.take(indices, axis=axis)
8501+
8502+
return diff
8503+
84068504
@doc(**_shared_doc_kwargs)
84078505
def align(
84088506
self,

0 commit comments

Comments
 (0)