Skip to content

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
May 28, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
c13af19
ENH: Added DataFrame.differences and Series.differences (GH30429)
fujiaxiang Jan 9, 2020
8f5d0fb
CLN: reformatted docstring (GH30429)
fujiaxiang Jan 9, 2020
c5b793a
ENH: Extracted differences() from DataFrame and Series into NDFrame
fujiaxiang Jan 10, 2020
5eff415
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Jan 16, 2020
0bc8529
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Jan 18, 2020
d22e21a
ENH: organized docstring using _shared_doc and reduced duplicates (GH…
fujiaxiang Jan 18, 2020
83f31df
ENH: added argument type indication (GH30429)
fujiaxiang Jan 18, 2020
488c8a8
ENH: reordered imports (GH30429)
fujiaxiang Jan 18, 2020
322ff20
ENH: removed inconsistent type indication (GH30429)
fujiaxiang Jan 18, 2020
71f5eef
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Jan 30, 2020
e50172c
ENH: Added whatsnew entry (GH30429)
fujiaxiang Jan 30, 2020
4a82bec
ENH: Minor correction in whatsnew entry (GH30429)
fujiaxiang Jan 30, 2020
b2849ed
ENH: Minor correction in whatsnew entry (GH30429)
fujiaxiang Jan 30, 2020
8e0e441
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Jan 31, 2020
ff7a572
ENH: Correction in whatsnew entry (GH30429)
fujiaxiang Jan 31, 2020
bc969e8
ENH: updated whatsnew (GH31200)
fujiaxiang Feb 10, 2020
26c6ca6
ENH: added doc references (GH31200)
fujiaxiang Feb 10, 2020
de2195b
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Feb 10, 2020
5fb2edc
DOC: fixed formatting issue in doc references
fujiaxiang Feb 10, 2020
dcc2d71
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Feb 25, 2020
35ccb5f
updated parameter names, docstring, and relevant tests (GH30429)
fujiaxiang Feb 25, 2020
586e37c
added doc-string tests (GH30429)
fujiaxiang Feb 25, 2020
d13db2f
fixed some PEP8 issues in doc-strings (GH30429)
fujiaxiang Feb 25, 2020
5342208
removed trailing spaces in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
35e9be6
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Feb 26, 2020
77b1c9e
fixed sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
51ffe0e
sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
71b0332
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Feb 26, 2020
827b69c
sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 26, 2020
53918a5
Update pandas/core/frame.py
fujiaxiang Feb 26, 2020
110f138
Update pandas/core/series.py
fujiaxiang Feb 26, 2020
acd51e0
attempt to fix sphinx identation issues in doc-strings (GH30429)
fujiaxiang Feb 27, 2020
1ef31c9
removed trailing spaces in doc-strings (GH30429)
fujiaxiang Feb 27, 2020
a898b87
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Mar 3, 2020
3bc7485
removed unintended changes in ci/code_checks (GH30429)
fujiaxiang Mar 10, 2020
36024d5
Merge remote-tracking branch 'origin/dataframe_and_series_differences…
fujiaxiang Mar 10, 2020
06ed216
corrected errors in docstring (GH30429)
fujiaxiang Mar 10, 2020
e4729ca
Update pandas/core/frame.py: slight semantic cleanup in docstring
fujiaxiang Mar 14, 2020
b6c0f78
renamed parameter axis to align_axis; added tests (GH30429)
fujiaxiang Mar 14, 2020
c352ee2
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Mar 14, 2020
0850420
minor correction in docstring
fujiaxiang Mar 14, 2020
a709db7
some semantic cleanup in docstrings
fujiaxiang Mar 14, 2020
9509604
added type indicator for method arguments
fujiaxiang Mar 17, 2020
e1a1c49
updated type hints
fujiaxiang Mar 20, 2020
a8caa53
added NDFrame in FrameOrSeriesUnion type
fujiaxiang Mar 20, 2020
4056f90
fixed type hints of concat function
fujiaxiang Mar 21, 2020
e50772d
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Mar 21, 2020
39f857e
renamed `differences` method to `compare`
fujiaxiang Apr 9, 2020
d0226d8
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Apr 9, 2020
6c62b0e
correction of method name in doc/source/reference/series.rst
fujiaxiang Apr 9, 2020
098d40c
added type checking in `compare` method and reformatted whatsnew a bit
fujiaxiang Apr 11, 2020
131ea95
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Apr 11, 2020
4223eb4
removed unintended line break
fujiaxiang Apr 11, 2020
c5246d6
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang Apr 29, 2020
91758c8
resolved a linting issue
fujiaxiang Apr 29, 2020
7dde706
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Apr 30, 2020
774ff5d
updated whatsnew entry
fujiaxiang Apr 30, 2020
eb6d33d
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang Apr 30, 2020
5d34fc4
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang May 1, 2020
c358e3d
Merge branch 'master' into dataframe_and_series_differences
fujiaxiang May 11, 2020
0189623
added doc in user guide merging.rst and more tests
fujiaxiang May 15, 2020
b0b3e24
removed trailing space in docstring and blackified code
fujiaxiang May 15, 2020
cdb03b2
Merge remote-tracking branch 'upstream/master' into dataframe_and_ser…
fujiaxiang May 27, 2020
007eeb7
added one more example in docstring of DataFrame.compare
fujiaxiang May 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5365,6 +5365,94 @@ def _construct_result(self, result) -> "DataFrame":
out.columns = self.columns
return out

@Appender(
"""
Returns
-------
DataFrame
DataFrame that shows the differences stacked side by side.

See Also
--------
Series.differences: Show differences.

Examples
--------
>>> df = pd.DataFrame(
... {
... "col1": ["a", "a", "b", "b", "a"],
... "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
... },
... columns=["col1", "col2", "col3"],
... )
>>> df
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
3 b NaN 4.0
4 a 5.0 5.0

>>> df2 = df.copy()
>>> df2.loc[0, 'col1'] = 'c'
>>> df2.loc[2, 'col3'] = 4.0
>>> df2
col1 col2 col3
0 c 1.0 1.0
1 a 2.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0

Stack the differences on columns

>>> df.differences(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0

Stack the differences on rows

>>> df.differences(df2, axis=0)
col1 col3
0 self a NaN
other c NaN
2 self NaN 3.0
other NaN 4.0

Keep all the original indices (rows and columns)

>>> df.differences(df2, keep_indices=True)
col1 col2 col3
self other self other self other
0 a c NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 3.0 4.0
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN

Keep all original indices and data

>>> df.differences(df2, keep_indices=True, keep_values=True)
col1 col2 col3
self other self other self other
0 a c 1.0 1.0 1.0 1.0
1 a a 2.0 2.0 2.0 2.0
2 b b 3.0 3.0 3.0 4.0
3 b b NaN NaN 4.0 4.0
4 a a 5.0 5.0 5.0 5.0
"""
)
@Appender(_shared_docs["differences"] % _shared_doc_kwargs)
def differences(
self, other: "DataFrame", axis=1, keep_indices=False, keep_values=False
) -> "DataFrame":
return super().differences(
other=other, axis=axis, keep_indices=keep_indices, keep_values=keep_values
)

def combine(
self, other: "DataFrame", func, fill_value=None, overwrite=True
) -> "DataFrame":
Expand Down
84 changes: 84 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8103,6 +8103,90 @@ def ranker(data):

return ranker(data)

_shared_docs[
"differences"
] = """
Compare to another %(klass)s and show the differences.

The axis on which to stack results and how much information to
preserve can be customized.

Note that NaNs are considered not different from other NaNs.

Parameters
----------
other : %(klass)s
Object to compare with.

axis : {0 or 'index', 1 or 'columns'}, default 1
Determine how the differences are stacked.
* 0, or 'index' : Stack differences on neighbouring rows.
* 1, or 'columns' : Stack differences on neighbouring columns.

keep_indices : bool, default False
Whether to keep the rows and columns that are equal, or drop them.

keep_values : bool, default False
Whether to keep the values that are equal, or show as NaNs.
"""

@Appender(_shared_docs["differences"] % _shared_doc_kwargs)
def differences(self, other, axis=1, keep_indices=False, keep_values=False):
from pandas.core.reshape.concat import concat

mask = ~((self == other) | (self.isna() & other.isna()))
keys = ["self", "other"]

if not keep_values:
self = self.where(mask)
other = other.where(mask)

if not keep_indices:
if isinstance(self, ABCDataFrame):
cmask = mask.any()
rmask = mask.any(axis=1)
self = self.loc[rmask, cmask]
other = other.loc[rmask, cmask]
else:
self = self[mask]
other = other[mask]

if axis in (1, "columns"): # This is needed for Series
axis = 1
else:
axis = self._get_axis_number(axis)

diff = concat([self, other], axis=axis, keys=keys)

if axis >= self.ndim:
# No need to reorganize data if stacking on new axis
# This currently applies for stacking two Series on columns
return diff

ax = diff._get_axis(axis)
ax_names = np.array(ax.names)

# set index names to positions to avoid confusion
ax.names = np.arange(len(ax_names))

# bring self-other to inner level
order = list(range(1, ax.nlevels)) + [0]
if isinstance(diff, ABCDataFrame):
diff = diff.reorder_levels(order, axis=axis)
else:
diff = diff.reorder_levels(order)

# restore the index names in order
diff._get_axis(axis=axis).names = ax_names[order]

# reorder axis to keep things organized
indices = (
np.arange(diff.shape[axis]).reshape([2, diff.shape[axis] // 2]).T.flatten()
)
diff = diff.take(indices, axis=axis)

return diff

_shared_docs[
"align"
] = """
Expand Down
65 changes: 64 additions & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
from pandas._config import get_option

from pandas._libs import index as libindex, lib, reshape, tslibs
from pandas._typing import Label
from pandas._typing import FrameOrSeries, Label
from pandas.compat.numpy import function as nv
from pandas.util._decorators import Appender, Substitution
from pandas.util._validators import validate_bool_kwarg, validate_percentile
Expand Down Expand Up @@ -2555,7 +2555,70 @@ def _binop(self, other, func, level=None, fill_value=None):
ret = ops._construct_result(self, result, new_index, name)
return ret

@Appender(
"""
Returns
-------
Series or DataFrame
If axis is 0 or 'index' the result will be a Series.
If axis is 1 or 'columns' the result will be a DataFrame.

See Also
--------
DataFrame.differences: Show differences.

Examples
--------
>>> s1 = pd.Series(["a", "b", "c", "d", "e"])
>>> s2 = pd.Series(["a", "a", "c", "b", "e"])

Stack the differences on columns

>>> s1.differences(s2)
self other
1 b a
3 d b

Stack the differences on indices

>>> s1.differences(s2, axis=0)
1 self b
other a
3 self d
other b
dtype: object

Keep all the original indices

>>> s1.differences(s2, keep_indices=True)
self other
0 NaN NaN
1 b a
2 NaN NaN
3 d b
4 NaN NaN

Keep all original indices and data

>>> s1.differences(s2, keep_indices=True, keep_values=True)
self other
0 a a
1 b a
2 c c
3 d b
4 e e
"""
)
@Appender(generic._shared_docs["differences"] % _shared_doc_kwargs)
def differences(
self, other: "Series", axis=1, keep_indices=False, keep_values=False
) -> FrameOrSeries:
return super().differences(
other=other, axis=axis, keep_indices=keep_indices, keep_values=keep_values
)

def combine(self, other, func, fill_value=None) -> "Series":

"""
Combine the Series with a Series or scalar according to `func`.

Expand Down
130 changes: 130 additions & 0 deletions pandas/tests/frame/methods/test_differences.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
import numpy as np
import pytest

import pandas as pd
import pandas._testing as tm


@pytest.mark.parametrize("axis", [0, 1, "index", "columns"])
def test_differences_axis(axis):
df = pd.DataFrame(
{"col1": ["a", "b", "c"], "col2": [1.0, 2.0, np.nan], "col3": [1.0, 2.0, 3.0]},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, "col1"] = "c"
df2.loc[2, "col3"] = 4.0

result = df.differences(df2, axis=axis)

if axis in (1, "columns"):
indices = pd.Index([0, 2])
columns = pd.MultiIndex.from_product([["col1", "col3"], ["self", "other"]])
expected = pd.DataFrame(
[["a", "c", np.nan, np.nan], [np.nan, np.nan, 3.0, 4.0]],
index=indices,
columns=columns,
)
else:
indices = pd.MultiIndex.from_product([[0, 2], ["self", "other"]])
columns = pd.Index(["col1", "col3"])
expected = pd.DataFrame(
[["a", np.nan], ["c", np.nan], [np.nan, 3.0], [np.nan, 4.0]],
index=indices,
columns=columns,
)
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
"keep_indices, keep_values",
[
(True, False),
(False, True),
(True, True),
# False, False case is already covered in test_differences_axis
],
)
def test_differences_various_formats(keep_indices, keep_values):
df = pd.DataFrame(
{"col1": ["a", "b", "c"], "col2": [1.0, 2.0, np.nan], "col3": [1.0, 2.0, 3.0]},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, "col1"] = "c"
df2.loc[2, "col3"] = 4.0

result = df.differences(df2, keep_indices=keep_indices, keep_values=keep_values)

if keep_indices:
indices = pd.Index([0, 1, 2])
columns = pd.MultiIndex.from_product(
[["col1", "col2", "col3"], ["self", "other"]]
)
if keep_values:
expected = pd.DataFrame(
[
["a", "c", 1.0, 1.0, 1.0, 1.0],
["b", "b", 2.0, 2.0, 2.0, 2.0],
["c", "c", np.nan, np.nan, 3.0, 4.0],
],
index=indices,
columns=columns,
)
else:
expected = pd.DataFrame(
[
["a", "c", np.nan, np.nan, np.nan, np.nan],
[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
[np.nan, np.nan, np.nan, np.nan, 3.0, 4.0],
],
index=indices,
columns=columns,
)
else:
indices = pd.Index([0, 2])
columns = pd.MultiIndex.from_product([["col1", "col3"], ["self", "other"]])
expected = pd.DataFrame(
[["a", "c", 1.0, 1.0], ["c", "c", 3.0, 4.0]], index=indices, columns=columns
)
tm.assert_frame_equal(result, expected)


def test_differences_with_equal_nulls():
# We want to make sure two NaNs are considered the same
# and dropped where applicable
df = pd.DataFrame(
{"col1": ["a", "b", "c"], "col2": [1.0, 2.0, np.nan], "col3": [1.0, 2.0, 3.0]},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, "col1"] = "c"

result = df.differences(df2)
indices = pd.Index([0])
columns = pd.MultiIndex.from_product([["col1"], ["self", "other"]])
expected = pd.DataFrame([["a", "c"]], index=indices, columns=columns)
tm.assert_frame_equal(result, expected)


def test_differences_with_non_equal_nulls():
# We want to make sure the relevant NaNs do not get dropped
# even if the entire row or column are NaNs
df = pd.DataFrame(
{"col1": ["a", "b", "c"], "col2": [1.0, 2.0, np.nan], "col3": [1.0, 2.0, 3.0]},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, "col1"] = "c"
df2.loc[2, "col3"] = np.nan

result = df.differences(df2)

indices = pd.Index([0, 2])
columns = pd.MultiIndex.from_product([["col1", "col3"], ["self", "other"]])
expected = pd.DataFrame(
[["a", "c", np.nan, np.nan], [np.nan, np.nan, 3.0, np.nan]],
index=indices,
columns=columns,
)
tm.assert_frame_equal(result, expected)
Loading