Skip to content

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
May 28, 2020
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
c13af19
ENH: Added DataFrame.differences and Series.differences (GH30429)
fujiaxiang Jan 9, 2020
8f5d0fb
CLN: reformatted docstring (GH30429)
fujiaxiang Jan 9, 2020
c5b793a
ENH: Extracted differences() from DataFrame and Series into NDFrame
fujiaxiang Jan 10, 2020
5eff415
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Jan 16, 2020
0bc8529
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Jan 18, 2020
d22e21a
ENH: organized docstring using _shared_doc and reduced duplicates (GH…
fujiaxiang Jan 18, 2020
83f31df
ENH: added argument type indication (GH30429)
fujiaxiang Jan 18, 2020
488c8a8
ENH: reordered imports (GH30429)
fujiaxiang Jan 18, 2020
322ff20
ENH: removed inconsistent type indication (GH30429)
fujiaxiang Jan 18, 2020
71f5eef
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Jan 30, 2020
e50172c
ENH: Added whatsnew entry (GH30429)
fujiaxiang Jan 30, 2020
4a82bec
ENH: Minor correction in whatsnew entry (GH30429)
fujiaxiang Jan 30, 2020
b2849ed
ENH: Minor correction in whatsnew entry (GH30429)
fujiaxiang Jan 30, 2020
8e0e441
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Jan 31, 2020
ff7a572
ENH: Correction in whatsnew entry (GH30429)
fujiaxiang Jan 31, 2020
bc969e8
ENH: updated whatsnew (GH31200)
fujiaxiang Feb 10, 2020
26c6ca6
ENH: added doc references (GH31200)
fujiaxiang Feb 10, 2020
de2195b
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Feb 10, 2020
5fb2edc
DOC: fixed formatting issue in doc references
fujiaxiang Feb 10, 2020
dcc2d71
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Feb 25, 2020
35ccb5f
updated parameter names, docstring, and relevant tests (GH30429)
fujiaxiang Feb 25, 2020
586e37c
added doc-string tests (GH30429)
fujiaxiang Feb 25, 2020
d13db2f
fixed some PEP8 issues in doc-strings (GH30429)
fujiaxiang Feb 25, 2020
5342208
removed trailing spaces in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
35e9be6
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Feb 26, 2020
77b1c9e
fixed sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
51ffe0e
sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
71b0332
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Feb 26, 2020
827b69c
sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
53918a5
Update pandas/core/frame.py
fujiaxiang Feb 26, 2020
110f138
Update pandas/core/series.py
fujiaxiang Feb 26, 2020
acd51e0
attempt to fix sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 27, 2020
1ef31c9
removed trailing spaces in doc-strings (GH30429)
fujiaxiang Feb 27, 2020
a898b87
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Mar 3, 2020
3bc7485
removed unintended changes in ci/code_checks (GH30429)
fujiaxiang Mar 10, 2020
36024d5
Merge remote-tracking branch 'origin/dataframe_and_series_differences…
fujiaxiang Mar 10, 2020
06ed216
corrected errors in docstring (GH30429)
fujiaxiang Mar 10, 2020
e4729ca
Update pandas/core/frame.py: slight semantic cleanup in docstring
fujiaxiang Mar 14, 2020
b6c0f78
renamed parameter axis to align_axis; added tests (GH30429)
fujiaxiang Mar 14, 2020
c352ee2
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Mar 14, 2020
0850420
minor correction in docstring
fujiaxiang Mar 14, 2020
a709db7
some semantic cleanup in docstrings
fujiaxiang Mar 14, 2020
9509604
added type indicator for method arguments
fujiaxiang Mar 17, 2020
e1a1c49
updated type hints
fujiaxiang Mar 20, 2020
a8caa53
added NDFrame in FrameOrSeriesUnion type
fujiaxiang Mar 20, 2020
4056f90
fixed type hints of concat function
fujiaxiang Mar 21, 2020
e50772d
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Mar 21, 2020
39f857e
renamed `differences` method to `compare`
fujiaxiang Apr 9, 2020
d0226d8
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Apr 9, 2020
6c62b0e
correction of method name in doc/source/reference/series.rst
fujiaxiang Apr 9, 2020
098d40c
added type checking in `compare` method and reformatted whatsnew a bit
fujiaxiang Apr 11, 2020
131ea95
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Apr 11, 2020
4223eb4
removed unintended line break
fujiaxiang Apr 11, 2020
c5246d6
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Apr 29, 2020
91758c8
resolved a linting issue
fujiaxiang Apr 29, 2020
7dde706
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Apr 30, 2020
774ff5d
updated whatsnew entry
fujiaxiang Apr 30, 2020
eb6d33d
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Apr 30, 2020
5d34fc4
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang May 1, 2020
c358e3d
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang May 11, 2020
0189623
added doc in user guide merging.rst and more tests
fujiaxiang May 15, 2020
b0b3e24
removed trailing space in docstring and blackified code
fujiaxiang May 15, 2020
cdb03b2
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang May 27, 2020
007eeb7
added one more example in docstring of DataFrame.compare
fujiaxiang May 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/source/reference/frame.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,13 +240,14 @@ Reshaping, sorting, transposing
DataFrame.T
DataFrame.transpose

Combining / joining / merging
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Combining / comparing / joining / merging
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/

DataFrame.append
DataFrame.assign
DataFrame.compare
DataFrame.join
DataFrame.merge
DataFrame.update
Expand Down
5 changes: 3 additions & 2 deletions doc/source/reference/series.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,12 +240,13 @@ Reshaping, sorting
Series.squeeze
Series.view

Combining / joining / merging
-----------------------------
Combining / comparing / joining / merging
-----------------------------------------
.. autosummary::
:toctree: api/

Series.append
Series.compare
Series.replace
Series.update

Expand Down
31 changes: 31 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,37 @@ For example:
ser.loc["May 2015"]


.. _whatsnew_110.dataframe_or_series_comparing:

Comparing two `DataFrame` or two `Series` and summarizing the differences
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We've added :meth:`DataFrame.compare` and :meth:`Series.compare` for comparing two `DataFrame` or two `Series` (:issue:`30429`)

.. ipython:: python

df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df

.. ipython:: python

df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
df2

.. ipython:: python

df.compare(df2)


.. _whatsnew_110.groupby_key:

Allow NA in groupby key
Expand Down
102 changes: 102 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5689,6 +5689,108 @@ def _construct_result(self, result) -> "DataFrame":
out.columns = self.columns
return out

@Appender(
"""
Returns
-------
DataFrame
DataFrame that shows the differences stacked side by side.

The resulting index will be a MultiIndex with 'self' and 'other'
stacked alternately at the inner level.

See Also
--------
Series.compare : Compare with another Series and show differences.

Notes
-----
Matching NaNs will not appear as a difference.

Examples
--------
>>> df = pd.DataFrame(
... {
... "col1": ["a", "a", "b", "b", "a"],
... "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
... },
... columns=["col1", "col2", "col3"],
... )
>>> df
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
3 b NaN 4.0
4 a 5.0 5.0

>>> df2 = df.copy()
>>> df2.loc[0, 'col1'] = 'c'
>>> df2.loc[2, 'col3'] = 4.0
>>> df2
col1 col2 col3
0 c 1.0 1.0
1 a 2.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0

Align the differences on columns

>>> df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0

Stack the differences on rows

>>> df.compare(df2, align_axis=0)
col1 col3
0 self a NaN
other c NaN
2 self NaN 3.0
other NaN 4.0

Keep all original rows and columns

>>> df.compare(df2, keep_shape=True)
col1 col2 col3
self other self other self other
0 a c NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 3.0 4.0
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN

Keep all original rows and columns and also all original values

>>> df.compare(df2, keep_shape=True, keep_equal=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does keep_equal make sense w/o keep_shape==True? IOW does it stand on its own? can you add an example of just using it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I would say so. It helps identify which one could be the "anomaly". I myself have come across such use cases.

For example, say I'm looking at the date a company publishes 2 types of their quarterly report.

>>> import pandas as pd
>>> df1 = pd.DataFrame(columns=['Q1_filing', 'Q2_filing', 'Q3_filing', 'Q4_filing'], index=list(range(2010, 2020)))
>>> df1['Q1_filing'] = 'Mar'
>>> df1['Q2_filing'] = 'Jun'
>>> df1['Q3_filing'] = 'Sep'
>>> df1['Q4_filing'] = 'Dec'
>>> df1
     Q1_filing Q2_filing Q3_filing Q4_filing
2010       Mar       Jun       Sep       Dec
2011       Mar       Jun       Sep       Dec
2012       Mar       Jun       Sep       Dec
2013       Mar       Jun       Sep       Dec
2014       Mar       Jun       Sep       Dec
2015       Mar       Jun       Sep       Dec
2016       Mar       Jun       Sep       Dec
2017       Mar       Jun       Sep       Dec
2018       Mar       Jun       Sep       Dec
2019       Mar       Jun       Sep       Dec

>>> df2 = df1.copy()
>>> df2.loc[2015, 'Q1_filing'] = 'Apr'
>>> df2.loc[2016, 'Q2_filing'] = 'Jul'

By comparing the two, I can see that the discrepancy is in 2015 and 2016, but I don't know which one deviated from the norm.

>>> df1.compare(df2)
     Q1_filing       Q2_filing
          self other      self other
2015       Mar   Apr       NaN   NaN
2016       NaN   NaN       Jun   Jul

The natural thing for me to do now is look at 2015 Q2_filing and 2016 Q1_filing where they agree with each other. (You can of course look at the whole thing but sometimes data is too big and I just want to take a look at the relevant ones first)

>>> df1.compare(df2, keep_equal=True)
     Q1_filing       Q2_filing
          self other      self other
2015       Mar   Apr       Jun   Jun
2016       Mar   Mar       Jun   Jul

With this result I know probably something is off for the second type of reports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added the example in frame.py. keep_equal does not stand on its own for Series so I did not add anything there.

col1 col2 col3
self other self other self other
0 a c 1.0 1.0 1.0 1.0
1 a a 2.0 2.0 2.0 2.0
2 b b 3.0 3.0 3.0 4.0
3 b b NaN NaN 4.0 4.0
4 a a 5.0 5.0 5.0 5.0
"""
)
@Appender(_shared_docs["compare"] % _shared_doc_kwargs)
def compare(
self,
other: "DataFrame",
align_axis: Axis = 1,
keep_shape: bool = False,
keep_equal: bool = False,
) -> "DataFrame":
return super().compare(
other=other,
align_axis=align_axis,
keep_shape=keep_shape,
keep_equal=keep_equal,
)

def combine(
self, other: "DataFrame", func, fill_value=None, overwrite=True
) -> "DataFrame":
Expand Down
96 changes: 96 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8406,6 +8406,102 @@ def ranker(data):

return ranker(data)

_shared_docs[
"compare"
] = """
Compare to another %(klass)s and show the differences.

Parameters
----------
other : %(klass)s
Object to compare with.

align_axis : {0 or 'index', 1 or 'columns'}, default 1
Determine which axis to align the comparison on.

* 0, or 'index' : Resulting differences are stacked vertically
with rows drawn alternately from self and other.
* 1, or 'columns' : Resulting differences are aligned horizontally
with columns drawn alternately from self and other.

keep_shape : bool, default False
If true, all rows and columns are kept.
Otherwise, only the ones with different values are kept.

keep_equal : bool, default False
If true, the result keeps values that are equal.
Otherwise, equal values are shown as NaNs.
"""

@Appender(_shared_docs["compare"] % _shared_doc_kwargs)
def compare(
self,
other,
align_axis: Axis = 1,
keep_shape: bool_t = False,
keep_equal: bool_t = False,
):
from pandas.core.reshape.concat import concat

if type(self) is not type(other):
cls_self, cls_other = type(self).__name__, type(other).__name__
raise TypeError(
f"can only compare '{cls_self}' (not '{cls_other}') with '{cls_self}'"
)

mask = ~((self == other) | (self.isna() & other.isna()))
keys = ["self", "other"]

if not keep_equal:
self = self.where(mask)
other = other.where(mask)

if not keep_shape:
if isinstance(self, ABCDataFrame):
cmask = mask.any()
rmask = mask.any(axis=1)
self = self.loc[rmask, cmask]
other = other.loc[rmask, cmask]
else:
self = self[mask]
other = other[mask]

if align_axis in (1, "columns"): # This is needed for Series
axis = 1
else:
axis = self._get_axis_number(align_axis)

diff = concat([self, other], axis=axis, keys=keys)

if axis >= self.ndim:
# No need to reorganize data if stacking on new axis
# This currently applies for stacking two Series on columns
return diff

ax = diff._get_axis(axis)
ax_names = np.array(ax.names)

# set index names to positions to avoid confusion
ax.names = np.arange(len(ax_names))

# bring self-other to inner level
order = list(range(1, ax.nlevels)) + [0]
if isinstance(diff, ABCDataFrame):
diff = diff.reorder_levels(order, axis=axis)
else:
diff = diff.reorder_levels(order)

# restore the index names in order
diff._get_axis(axis=axis).names = ax_names[order]

# reorder axis to keep things organized
indices = (
np.arange(diff.shape[axis]).reshape([2, diff.shape[axis] // 2]).T.flatten()
)
diff = diff.take(indices, axis=axis)

return diff

@doc(**_shared_doc_kwargs)
def align(
self,
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/reshape/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

import numpy as np

from pandas._typing import FrameOrSeriesUnion, Label
from pandas._typing import FrameOrSeries, FrameOrSeriesUnion, Label

from pandas.core.dtypes.concat import concat_compat
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries
Expand Down Expand Up @@ -50,7 +50,7 @@ def concat(

@overload
def concat(
objs: Union[Iterable[FrameOrSeriesUnion], Mapping[Label, FrameOrSeriesUnion]],
objs: Union[Iterable[FrameOrSeries], Mapping[Label, FrameOrSeries]],
axis=0,
join: str = "outer",
ignore_index: bool = False,
Expand All @@ -65,7 +65,7 @@ def concat(


def concat(
objs: Union[Iterable[FrameOrSeriesUnion], Mapping[Label, FrameOrSeriesUnion]],
objs: Union[Iterable[FrameOrSeries], Mapping[Label, FrameOrSeries]],
axis=0,
join="outer",
ignore_index: bool = False,
Expand Down
Loading