Operators between DataFrame and Series fail on large dataframes #27636

ccharlesgb · 2019-07-29T08:54:57Z

Code Sample

import pandas as pd

ind = list(range(0, 100))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.multiply(series, axis=1).head())  # Works fine
ind = list(range(0, 100000))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.add(series,axis=1).head())

Code Output:

   0    1    2    3    4    5    ...    294    295    296    297    298    299
0  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
1  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
2  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
3  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
4  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
[5 rows x 300 columns]
Traceback (most recent call last):
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-25-4d9165e5df4a>", line 15, in <module>
    print(df.add(series,axis=1).head())
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 1499, in f
    self, other, pass_op, fill_value=fill_value, axis=axis, level=level
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 1388, in _combine_series_frame
    return self._combine_match_columns(other, func, level=level)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\frame.py", line 5392, in _combine_match_columns
    return ops.dispatch_to_series(left, right, func, axis="columns")
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 596, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 126, in _evaluate_numexpr
    result = _evaluate_standard(op, op_str, a, b)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 584, in column_op
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 584, in <dictcomp>
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 1473, in na_op
    result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 101, in _evaluate_numexpr
    if _can_use_numexpr(op, op_str, a, b, "evaluate"):
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 84, in _can_use_numexpr
    s = o.dtypes.value_counts()
AttributeError: 'numpy.dtype' object has no attribute 'value_counts'

Problem description

I think this is a regression somewhere between pandas 0.19.2 and 0.25. If you multiply or use any other operator function such as add/divide on a DataFrame by a Series where axis=1 pandas will crash in the _can_use_numexpr functon when the DataFrame/Series becomes very large. This is presumably down to check of the size of the objects being operated on not passing for small datasets but for larger ones it gets to the failing line.

#pandas/core/computation/expressions.py : 73
def _can_use_numexpr(op, op_str, a, b, dtype_check):
    """ return a boolean if we WILL be using numexpr """
    if op_str is not None:

        # required min elements (otherwise we are adding overhead)
        if np.prod(a.shape) > _MIN_ELEMENTS:

            # check for dtype compatibility
            dtypes = set()
            for o in [a, b]:
                if hasattr(o, "dtypes"):
                    s = o.dtypes.value_counts()  # Fails here

In pandas 0.19.2 the function uses the get_dtype_counts() method instead to inspect if the dtype is uniform in the object:

def _can_use_numexpr(op, op_str, a, b, dtype_check):
    """ return a boolean if we WILL be using numexpr """
    if op_str is not None:

        # required min elements (otherwise we are adding overhead)
        if np.prod(a.shape) > _MIN_ELEMENTS:

            # check for dtype compatiblity
            dtypes = set()
            for o in [a, b]:
                if hasattr(o, 'get_dtype_counts'):
                    s = o.get_dtype_counts()

I have a workaround which is to transpose the dataframe and use axis=0:

df.T.add(series,axis=0).T.head()

I noticed get_dtype_counts() is deprecated ( #27145 ) which appears to be the PR that has caused this regression as a Series only returns a single numpy dtype which does not have a value_counts() method.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Windows
OS-release : 7
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.0
numpy : 1.16.4
pytz : 2018.4
dateutil : 2.7.3
pip : 10.0.1
setuptools : 39.1.0
Cython : None
pytest : 3.5.1
hypothesis : None
sphinx : 1.8.2
blosc : None
feather : None
xlsxwriter : 1.0.4
lxml.etree : 4.1.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 6.4.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.1.1
matplotlib : 2.2.2
numexpr : 2.6.5
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.8
tables : 3.5.2
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : 1.0.4

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-07-29T11:17:31Z

cc @jbrockmendel.

jbrockmendel · 2019-07-30T01:48:48Z

Looks like this was changed from obj.get_dtype_counts, which returns Series for either Series or DataFrame, to obj.dtypes.value_counts, but Series.dtypes returns a Scalar, which is why value_counts raises AttributeError.

ccharlesgb · 2019-08-02T11:25:27Z

I can raise a PR to do an extra hasattr on the dtypes. That should fix it?

TomAugspurger · 2019-08-02T13:45:45Z

maybe change

if hasattr(o, "dtypes"):

to

if hasattr(o, "dtypes") and o.ndim > 1:
    ...

TomAugspurger · 2019-08-02T13:46:02Z

But yes, a PR with tests and a release note in 0.25.1.rst would be very welcome.

ccharlesgb · 2019-08-02T15:33:23Z

Your suggestion worked as well and removed the extra if statement. I added to the test suite in test_expression.py which has uncovered some more issues with operators on DataFrames and Series, with axis=1. Will update this issue once I know the cause.

jbrockmendel · 2019-08-02T15:36:05Z

Will update this issue once I know the cause.

If its feasible, it would be easier if you made a small PR specific to the bug here, then address the newly-found bugs in separate steps.

ccharlesgb · 2019-08-05T09:20:20Z

It is feasible but it would require a very narrow test. The issue I am having now is that numexpr is failing to work on floordiv when operating on a DataFrame by a series with axis=1. This issue was never caught because the test suite doesn't cover this case currently.

If we modify the example code snippet, with the fix suggested by @TomAugspurger to:

import pandas as pd

ind = list(range(0, 100))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.floordiv(series, axis=1).head())  # Works fine
ind = list(range(0, 100000))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.floordiv(series,axis=1).head())

We get the following traceback:

Traceback (most recent call last):
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1473, in na_op
    result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 116, in _evaluate_numexpr
    **eval_kwargs
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\numexpr\necompiler.py", line 802, in evaluate
    * 'no' means the data types should not be cast at all.
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\numexpr\necompiler.py", line 709, in getExprNames
    input_order = getInputOrder(ast, None)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\numexpr\necompiler.py", line 299, in stringToExpression
    ex = eval(c, names)
  File "<expr>", line 1, in <module>
TypeError: unsupported operand type(s) for //: 'VariableNode' and 'VariableNode'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 12, in <module>
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1499, in f
    self, other, pass_op, fill_value=fill_value, axis=axis, level=level
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1388, in _combine_series_frame
    return self._combine_match_columns(other, func, level=level)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\frame.py", line 5392, in _combine_match_columns
    return ops.dispatch_to_series(left, right, func, axis="columns")
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 596, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 126, in _evaluate_numexpr
    result = _evaluate_standard(op, op_str, a, b)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 584, in column_op
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 584, in <dictcomp>
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1475, in na_op
    result = masked_arith_op(x, y, op)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 451, in masked_arith_op
    assert isinstance(x, np.ndarray), type(x)
AssertionError: <class 'pandas.core.series.Series'>

masked_arith_op expects its params x and y to be ndarray but in this specific case x is a Series:

    # pandas/core/ops/__init__.py : 423
    # For Series `x` is 1D so ravel() is a no-op; calling it anyway makes
    # the logic valid for both Series and DataFrame ops.
    xrav = x.ravel()
    assert isinstance(x, np.ndarray), type(x)

Modifying this function to use xrav instead of just x does fix the issue and all unit tests still pass but I am not sure if this is the true intention of the in line comment here?

Happy to restrict the tests to try every operator BUT floordiv if that is better to reduce the scope of the PR.

TomAugspurger · 2019-08-05T16:15:04Z

Happy to restrict the tests to try every operator BUT floordiv if that is better to reduce the scope of the PR.

Let's do that for now. You can open another issue for the floordiv problem I think.

Fixes: pandas-dev#27636

jbrockmendel added the Numeric Operations Arithmetic, Comparison, and Logical operations label Jul 29, 2019

TomAugspurger added the Regression Functionality that used to work in a prior pandas version label Aug 2, 2019

TomAugspurger added this to the 0.25.1 milestone Aug 2, 2019

ccharlesgb pushed a commit to ccharlesgb/pandas that referenced this issue Aug 6, 2019

BUG: _can_use_numexpr did not handle Series case correctly

dfda415

Fixes: pandas-dev#27636

ccharlesgb mentioned this issue Aug 6, 2019

BUG: _can_use_numexpr fails when passed large Series #27773

Merged

5 tasks

ccharlesgb pushed a commit to ccharlesgb/pandas that referenced this issue Aug 7, 2019

BUG: _can_use_numexpr did not handle Series case correctly

400c550

Fixes: pandas-dev#27636

jbrockmendel mentioned this issue Aug 12, 2019

Error dividing Dataframe by Series but only when size is large enough #27853

Closed

TomAugspurger closed this as completed in #27773 Aug 19, 2019

TomAugspurger mentioned this issue Aug 27, 2019

DataFrame.multiply(Series, axis=1) with 10k+ rows raises AttributeError #28171

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operators between DataFrame and Series fail on large dataframes #27636

Operators between DataFrame and Series fail on large dataframes #27636

ccharlesgb commented Jul 29, 2019

INSTALLED VERSIONS

TomAugspurger commented Jul 29, 2019

jbrockmendel commented Jul 30, 2019

ccharlesgb commented Aug 2, 2019

TomAugspurger commented Aug 2, 2019

TomAugspurger commented Aug 2, 2019

ccharlesgb commented Aug 2, 2019

jbrockmendel commented Aug 2, 2019

ccharlesgb commented Aug 5, 2019

TomAugspurger commented Aug 5, 2019

Operators between DataFrame and Series fail on large dataframes #27636

Operators between DataFrame and Series fail on large dataframes #27636

Comments

ccharlesgb commented Jul 29, 2019

Code Sample

Code Output:

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 29, 2019

jbrockmendel commented Jul 30, 2019

ccharlesgb commented Aug 2, 2019

TomAugspurger commented Aug 2, 2019

TomAugspurger commented Aug 2, 2019

ccharlesgb commented Aug 2, 2019

jbrockmendel commented Aug 2, 2019

ccharlesgb commented Aug 5, 2019

TomAugspurger commented Aug 5, 2019

Output of `pd.show_versions()`