Skip to content

[CLN] Dispatch (some) Frame ops to Series, avoiding _data.eval #22019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Oct 3, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
5195dad
avoid casting to object dtype in mixed-type frames
jbrockmendel Jul 22, 2018
1a66906
Dispatch to Series ops in _combine_match_columns
jbrockmendel Jul 22, 2018
a45abec
comment
jbrockmendel Jul 22, 2018
8594c48
docstring
jbrockmendel Jul 22, 2018
108550e
flake8 fixup
jbrockmendel Jul 22, 2018
07fb477
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 24, 2018
946e54a
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 24, 2018
323f45e
dont bother with try_cast_result
jbrockmendel Jul 25, 2018
b3cedf1
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 25, 2018
a0708d1
revert non-central change
jbrockmendel Jul 25, 2018
5d3db89
simplify
jbrockmendel Jul 25, 2018
aa41de3
revert try_cast_results
jbrockmendel Jul 25, 2018
01c3720
revert non-central changes
jbrockmendel Jul 25, 2018
bcb6735
Fixup typo syntaxerror
jbrockmendel Jul 25, 2018
b3ef417
simplify assertion
jbrockmendel Jul 25, 2018
4b2e21a
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 26, 2018
330a94a
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Jul 27, 2018
4c4f626
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Aug 10, 2018
757e2ae
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 8, 2018
ff96c0d
use dispatch_to_series in combine_match_columns
jbrockmendel Sep 8, 2018
17f33b6
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 12, 2018
c30f898
Pass unwrapped op where appropriate
jbrockmendel Sep 12, 2018
82a7928
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 12, 2018
52b7102
catch correct error
jbrockmendel Sep 12, 2018
453ae8e
whatsnew note
jbrockmendel Sep 12, 2018
93887cf
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 13, 2018
dd115c8
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 18, 2018
265ec78
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 18, 2018
db5ca89
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 27, 2018
f574c24
comment
jbrockmendel Sep 27, 2018
e6821a2
whatsnew section
jbrockmendel Sep 27, 2018
da2e32c
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 28, 2018
a6a7f58
remove unnecessary tester
jbrockmendel Sep 29, 2018
b21a8a2
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Sep 29, 2018
858165f
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Oct 2, 2018
31d4089
Merge branch 'master' of https://github.com/pandas-dev/pandas into mops2
jbrockmendel Oct 2, 2018
5832c2b
doc fixup
jbrockmendel Oct 2, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 53 additions & 9 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@
check_bool_indexer)
from pandas.core.internals import (BlockManager,
create_block_manager_from_arrays,
create_block_manager_from_blocks)
create_block_manager_from_blocks,
try_cast_result)
from pandas.core.series import Series
from pandas.core.arrays import Categorical, ExtensionArray
import pandas.core.algorithms as algorithms
Expand Down Expand Up @@ -4921,21 +4922,64 @@ def _arith_op(left, right):
copy=False)

def _combine_match_index(self, other, func, level=None):
assert isinstance(other, Series)
left, right = self.align(other, join='outer', axis=0, level=level,
copy=False)
new_data = func(left.values.T, right.values).T
return self._constructor(new_data,
index=left.index, columns=self.columns,
copy=False)
assert left.index.equals(right.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where this might no be true? If not, then let's remove it, as it could be somewhat expensive (though maybe not, since the output of align shares indices.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always holds. I added it largely so I wouldn't have to keep going back to core.ops to double-check what was being passed.


if left._is_mixed_type or right._is_mixed_type:
if self.columns.is_unique:
new_data = {col: func(left[col], right)
for col in left.columns}
result = self._constructor(new_data,
index=left.index,
columns=left.columns,
copy=False)
return result
else:
new_data = [func(left.iloc[:, idx], right)
for idx in range(len(left.columns))]
result = self._constructor(new_data,
index=left.index,
copy=False)
result.columns = left.columns
return result

else:
# easy case, operate directly on values
result = func(left.values.T, right.values).T
return self._constructor(result,
index=left.index, columns=self.columns,
copy=False)

def _combine_match_columns(self, other, func, level=None, try_cast=True):
# TODO: `func` passed here is wrapped in core.ops; if we are
# dispatching to Series implementation, should we pass unwrapped?
assert isinstance(other, Series)
left, right = self.align(other, join='outer', axis=1, level=level,
copy=False)
assert left.columns.equals(right.index), (left.columns, right.index)

new_data = left._data.eval(func=func, other=right,
axes=[left.columns, self.index],
try_cast=try_cast)
return self._constructor(new_data)
new_data = [func(left.iloc[:, n], right.iloc[n])
for n in range(len(left.columns))]

if try_cast:
new_data = [try_cast_result(left.iloc[:, n], new_data[n])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't repeat code for the unique / non-unique case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be trimmed down a bit, but is based on the existing _combine_frame code. I'm anticipating that after combine_const is "fixed" most of this will move over to core.ops and we'll put some effort into being less verbose.

for n in range(len(left.columns))]

if left.columns.is_unique:
new_data = {left.columns[n]: new_data[n]
for n in range(len(left.columns))}
result = self._constructor(new_data,
index=left.index, columns=left.columns,
copy=False)
return result

else:
new_data = {i: new_data[i] for i in range(len(new_data))}
result = self._constructor(new_data, index=left.index, copy=False)
result.columns = left.columns
return result

def _combine_const(self, other, func, errors='raise', try_cast=True):
new_data = self._data.eval(func=func, other=other,
Expand Down
78 changes: 50 additions & 28 deletions pandas/core/internals/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,55 @@
from pandas.compat import range, map, zip, u


def try_cast_result(left, result, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF you need this function, it should be in panda/core/dtypes/cast.py. I say IF because this looks like lots of other functions there. please see if you can simplify. we want to remove things from internals not add them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be in panda/core/dtypes/cast.py

Sure, though I'm hoping we can get rid of it before long.

I say IF because this looks like lots of other functions there

This is based on the existing Block._try_cast_result, which is changed to just call this function.

we want to remove things from internals not add them.

Very much so. The point of this PR (along with another branch in progress) is to remove Block.eval and BlockManager.eval. (and in doing so ensure that Series/DataFrame ops all agree)

"""
Try to cast the result to the original dtype for `left`; we may have
roundtripped thru object in the mean-time.

Parameters
----------
left : array-like
result : array-like
dtype : np.dtype, pd.dtype, or None (default None)

Returns
-------
maybe_casted : same type as `result`
"""
if dtype is None:
dtype = left.dtype

if (is_integer_dtype(left) or is_bool_dtype(left) or
is_datetime64_dtype(left) or is_datetime64tz_dtype(left)):
pass
elif is_object_dtype(left) and lib.infer_dtype(left) == 'boolean':
# disguised is_bool_dtype
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nearly cut/paste from the existing Block._try_cast_result implementation. The difference is this uses e.g. is_integer_dtype(left) instead of left.is_integer. The is_foo attributes map directly onto the is_foo_dtype checks with the exception of is_bool, which requires this extra object-dtype check (this is the same way Index.is_boolean works).

elif is_float_dtype(left) and result.dtype == left.dtype:

# protect against a bool/object showing up here
if isinstance(dtype, compat.string_types) and dtype == 'infer':
return result
if not isinstance(dtype, type):
dtype = dtype.type
if issubclass(dtype, (np.bool_, np.object_)):
if issubclass(dtype, np.bool_):
if isna(result).all():
return result.astype(np.bool_)
else:
result = result.astype(np.object_)
result[result == 1] = True
result[result == 0] = False
return result
else:
return result.astype(np.object_)

return result

# may need to change the dtype here
return maybe_downcast_to_dtype(result, dtype)


class Block(PandasObject):
"""
Canonical n-dimensional unit of homogeneous dtype contained in a pandas
Expand Down Expand Up @@ -711,34 +760,7 @@ def _try_cast_result(self, result, dtype=None):
""" try to cast the result to our original type, we may have
roundtripped thru object in the mean-time
"""
if dtype is None:
dtype = self.dtype

if self.is_integer or self.is_bool or self.is_datetime:
pass
elif self.is_float and result.dtype == self.dtype:

# protect against a bool/object showing up here
if isinstance(dtype, compat.string_types) and dtype == 'infer':
return result
if not isinstance(dtype, type):
dtype = dtype.type
if issubclass(dtype, (np.bool_, np.object_)):
if issubclass(dtype, np.bool_):
if isna(result).all():
return result.astype(np.bool_)
else:
result = result.astype(np.object_)
result[result == 1] = True
result[result == 0] = False
return result
else:
return result.astype(np.object_)

return result

# may need to change the dtype here
return maybe_downcast_to_dtype(result, dtype)
return try_cast_result(self, result, dtype)

def _try_coerce_args(self, values, other):
""" provide coercion to our input arguments """
Expand Down
28 changes: 10 additions & 18 deletions pandas/core/sparse/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -585,27 +585,22 @@ def _combine_match_index(self, other, func, level=None):
if level is not None:
raise NotImplementedError("'level' argument is not supported")

new_index = self.index.union(other.index)
this = self
if self.index is not new_index:
this = self.reindex(new_index)

if other.index is not new_index:
other = other.reindex(new_index)
this, other = self.align(other, join='outer', axis=0,
level=level, copy=False)

new_data = {}
for col, series in compat.iteritems(this):
new_data[col] = func(series.values, other.values)

# fill_value is a function of our operator
fill_value = None
if isna(other.fill_value) or isna(self.default_fill_value):
fill_value = np.nan
else:
fill_value = func(np.float64(self.default_fill_value),
np.float64(other.fill_value))

return self._constructor(
new_data, index=new_index, columns=self.columns,
new_data, index=this.index, columns=this.columns,
default_fill_value=fill_value).__finalize__(self)

def _combine_match_columns(self, other, func, level=None, try_cast=True):
Expand All @@ -617,19 +612,16 @@ def _combine_match_columns(self, other, func, level=None, try_cast=True):
if level is not None:
raise NotImplementedError("'level' argument is not supported")

new_data = {}

union = intersection = self.columns

if not union.equals(other.index):
union = other.index.union(self.columns)
intersection = other.index.intersection(self.columns)
left, right = self.align(other, join='outer', axis=1,
level=level, copy=False)

for col in intersection:
new_data = {}
for col in other.index.intersection(self.columns):
new_data[col] = func(self[col], float(other[col]))
# TODO: Why are we casting other[col] to float?

return self._constructor(
new_data, index=self.index, columns=union,
new_data, index=left.index, columns=left.columns,
default_fill_value=self.default_fill_value).__finalize__(self)

def _combine_const(self, other, func, errors='raise', try_cast=True):
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/test_axis_select_reindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,7 @@ def test_align_int_fill_bug(self):

result = df1 - df1.mean()
expected = df2 - df2.mean()
assert_frame_equal(result, expected)
assert_frame_equal(result.astype('f8'), expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR needs a note in the what's new describing this change (as its effectievly an API change)


def test_align_multiindex(self):
# GH 10665
Expand Down
8 changes: 5 additions & 3 deletions pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -1565,8 +1565,9 @@ def test_crosstab_normalize(self):
full_normal)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='index'),
row_normal)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='columns'),
col_normal)
tm.assert_frame_equal(
pd.crosstab(df.a, df.b, normalize='columns').astype('f8'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changing is not great, the point of these ops is that the dtypes won't change on the pivot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is happening because ATM the casting is done at the Block-level, while with the PR casting is done at the column-level. pd.crosstab(df.a, df.b, normalize='columns') ATM returns

b    3    4
a          
1  0.5  0.0
2  0.5  1.0

and under the PR gives:

b    3    4
a          
1  0.5  0
2  0.5  1

i.e. the 4 column is int64. Note that the original inputs df.a and df.b are both int64 to begin with. I don't know crosstab well enough to comment on the ideal output, but it's hard to call this less-correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way to answer this is ask "was the upcast from int to float the result of coercing to a single dtype for multiple columns?" If so, then we should preserve integer dtype where we can, i.e.

For an all-integer crosstab, we preserve integer dtype:

In [27]: a = pd.Series([1, 2])

In [28]: pd.crosstab(a, a)
Out[28]:
col_0  1  2
row_0
1      1  0
2      0  1

so this change seems good (and maybe worth explicitly testing rather than astyping the output, to ensure that we don't regress)?

col_normal)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize=1),
pd.crosstab(df.a, df.b, normalize='columns'))
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize=0),
Expand Down Expand Up @@ -1599,7 +1600,8 @@ def test_crosstab_normalize(self):
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='index',
margins=True), row_normal_margins)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize='columns',
margins=True), col_normal_margins)
margins=True).astype('f8'),
col_normal_margins)
tm.assert_frame_equal(pd.crosstab(df.a, df.b, normalize=True,
margins=True), all_normal_margins)

Expand Down
7 changes: 5 additions & 2 deletions pandas/tests/series/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -1547,8 +1547,11 @@ def tester(a, b):
# this is an alignment issue; these are equivalent
# https://github.com/pandas-dev/pandas/issues/5284

pytest.raises(ValueError, lambda: d.__and__(s, axis='columns'))
pytest.raises(ValueError, tester, s, d)
with pytest.raises(TypeError):
d.__and__(s, axis='columns')

with pytest.raises(TypeError):
tester(s, d)

# this is wrong as its not a boolean result
# result = d.__and__(s,axis='index')
Expand Down