Skip to content

Make DataFrame arithmetic ops with 2D arrays behave like numpy analogues #23000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Oct 7, 2018
Merged
32 changes: 32 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -488,6 +488,38 @@ Previous Behavior:
0 NaT


.. _whatsnew_0240.api.dataframe_arithmetic_broadcasting:

DataFrame Arithmetic Operations Broadcasting Changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:class:`DataFrame` arithmetic operations when operating with 2-dimensional
``np.ndarray`` objects now broadcast in the same way as ``np.ndarray``s
broadcast. (:issue:`23000`)

Previous Behavior:

.. code-block:: ipython

In [3]: arr = np.arange(6).reshape(3, 2)
In [4]: df = pd.DataFrame(arr)
In [5]: df + arr[[0], :] # 1 row, 2 columns
...
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2)
In [6]: df + arr[:, [1]] # 1 column, 3 rows
...
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (3, 1)

*Current Behavior*:

.. ipython:: python
arr = np.arange(6).reshape(3, 2)
df = pd.DataFrame(arr)
df

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add another ipython:: python here, the blank line gets removed and it appears as a single block (the way it is written), if you add another block, then you get another cell

df + arr[[0], :] # 1 row, 2 columns
df + arr[:, [1]] # 1 column, 3 rows


.. _whatsnew_0240.api.extension:

ExtensionType Changes
Expand Down
25 changes: 22 additions & 3 deletions pandas/core/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1799,14 +1799,33 @@ def to_series(right):
right = to_series(right)

elif right.ndim == 2:
if left.shape != right.shape:
if right.shape == left.shape:
right = left._constructor(right, index=left.index,
columns=left.columns)

elif right.shape[0] == left.shape[0] and right.shape[1] == 1:
# Broadcast across columns
try:
right = np.broadcast_to(right, left.shape)
except AttributeError:
# numpy < 1.10.0
right = np.tile(right, (1, left.shape[1]))

right = left._constructor(right,
index=left.index,
columns=left.columns)
# TODO: Double-check this doesn't make copies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, if the answer is that it does make copies, then yes. At least in the sufficiently-new numpy case, we're passing a view in left._constructor.

a = np.arange(3)
b = a.reshape(3, 1)
c = np.broadcast_to(b, (3, 2))
d = c.copy()

df = pd.DataFrame(c)
df2 = pd.DataFrame(d)

>>> df.values.base is a  # <-- the concern is that this comes back False
True

>>> df2.values.base is d
True

In this example its OK. I left the comment to do a more thorough check. Are you confident this is always OK?


elif right.shape[1] == left.shape[1] and right.shape[0] == 1:
# Broadcast along rows
right = to_series(right[0, :])

else:
raise ValueError("Unable to coerce to DataFrame, shape "
"must be {req_shape}: given {given_shape}"
.format(req_shape=left.shape,
given_shape=right.shape))

right = left._constructor(right, index=left.index,
columns=left.columns)
elif right.ndim > 2:
raise ValueError('Unable to coerce to Series/DataFrame, dim '
'must be <= 2: {dim}'.format(dim=right.shape))
Expand Down
43 changes: 43 additions & 0 deletions pandas/tests/frame/test_arithmetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ def test_df_flex_cmp_constant_return_types_empty(self, opname):
# Arithmetic

class TestFrameFlexArithmetic(object):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have sufficient converage for a broadcast op with a non-homogenous frame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its pretty scattered. specifically within this module its pretty bare

def test_df_add_td64_columnwise(self):
# GH#22534 Check that column-wise addition broadcasts correctly
dti = pd.date_range('2016-01-01', periods=10)
Expand Down Expand Up @@ -252,6 +253,48 @@ def test_arith_flex_zero_len_raises(self):


class TestFrameArithmetic(object):
# TODO: tests for other arithmetic ops
def test_df_add_2d_array_rowlike_broadcasts(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand to use the all_arithmetic_ops fixture? (or some of them)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll keep these tests as-is because they have nice explicitly-written-out expecteds, and add another pair of tests using the fixtures.

# GH#
arr = np.arange(6).reshape(3, 2)
df = pd.DataFrame(arr, columns=[True, False], index=['A', 'B', 'C'])

rowlike = arr[[1], :] # shape --> (1, ncols)
assert rowlike.shape == (1, df.shape[1])

expected = pd.DataFrame([[2, 4],
[4, 6],
[6, 8]],
columns=df.columns, index=df.index,
# specify dtype explicitly to avoid failing
# on 32bit builds
dtype=arr.dtype)
result = df + rowlike
tm.assert_frame_equal(result, expected)
result = rowlike + df
tm.assert_frame_equal(result, expected)

# TODO: tests for other arithmetic ops
def test_df_add_2d_array_collike_broadcasts(self):
# GH#
arr = np.arange(6).reshape(3, 2)
df = pd.DataFrame(arr, columns=[True, False], index=['A', 'B', 'C'])

collike = arr[:, [1]] # shape --> (nrows, 1)
assert collike.shape == (df.shape[0], 1)

expected = pd.DataFrame([[1, 2],
[5, 6],
[9, 10]],
columns=df.columns, index=df.index,
# specify dtype explicitly to avoid failing
# on 32bit builds
dtype=arr.dtype)
result = df + collike
tm.assert_frame_equal(result, expected)
result = collike + df
tm.assert_frame_equal(result, expected)

def test_df_bool_mul_int(self):
# GH#22047, GH#22163 multiplication by 1 should result in int dtype,
# not object dtype
Expand Down