Skip to content

Operators between DataFrame and Series fail on large dataframes #27636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ccharlesgb opened this issue Jul 29, 2019 · 9 comments · Fixed by #27773
Closed

Operators between DataFrame and Series fail on large dataframes #27636

ccharlesgb opened this issue Jul 29, 2019 · 9 comments · Fixed by #27773
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@ccharlesgb
Copy link
Contributor

Code Sample

import pandas as pd

ind = list(range(0, 100))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.multiply(series, axis=1).head())  # Works fine
ind = list(range(0, 100000))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.add(series,axis=1).head()) 

Code Output:

   0    1    2    3    4    5    ...    294    295    296    297    298    299
0  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
1  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
2  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
3  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
4  0.0  1.0  2.0  3.0  4.0  5.0  ...  294.0  295.0  296.0  297.0  298.0  299.0
[5 rows x 300 columns]
Traceback (most recent call last):
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-25-4d9165e5df4a>", line 15, in <module>
    print(df.add(series,axis=1).head())
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 1499, in f
    self, other, pass_op, fill_value=fill_value, axis=axis, level=level
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 1388, in _combine_series_frame
    return self._combine_match_columns(other, func, level=level)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\frame.py", line 5392, in _combine_match_columns
    return ops.dispatch_to_series(left, right, func, axis="columns")
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 596, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 126, in _evaluate_numexpr
    result = _evaluate_standard(op, op_str, a, b)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 584, in column_op
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 584, in <dictcomp>
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\ops\__init__.py", line 1473, in na_op
    result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 101, in _evaluate_numexpr
    if _can_use_numexpr(op, op_str, a, b, "evaluate"):
  File "C:\dev\bin\anaconda\envs\py36\lib\site-packages\pandas\core\computation\expressions.py", line 84, in _can_use_numexpr
    s = o.dtypes.value_counts()
AttributeError: 'numpy.dtype' object has no attribute 'value_counts'

Problem description

I think this is a regression somewhere between pandas 0.19.2 and 0.25. If you multiply or use any other operator function such as add/divide on a DataFrame by a Series where axis=1 pandas will crash in the _can_use_numexpr functon when the DataFrame/Series becomes very large. This is presumably down to check of the size of the objects being operated on not passing for small datasets but for larger ones it gets to the failing line.

#pandas/core/computation/expressions.py : 73
def _can_use_numexpr(op, op_str, a, b, dtype_check):
    """ return a boolean if we WILL be using numexpr """
    if op_str is not None:

        # required min elements (otherwise we are adding overhead)
        if np.prod(a.shape) > _MIN_ELEMENTS:

            # check for dtype compatibility
            dtypes = set()
            for o in [a, b]:
                if hasattr(o, "dtypes"):
                    s = o.dtypes.value_counts()  # Fails here

In pandas 0.19.2 the function uses the get_dtype_counts() method instead to inspect if the dtype is uniform in the object:

def _can_use_numexpr(op, op_str, a, b, dtype_check):
    """ return a boolean if we WILL be using numexpr """
    if op_str is not None:

        # required min elements (otherwise we are adding overhead)
        if np.prod(a.shape) > _MIN_ELEMENTS:

            # check for dtype compatiblity
            dtypes = set()
            for o in [a, b]:
                if hasattr(o, 'get_dtype_counts'):
                    s = o.get_dtype_counts()

I have a workaround which is to transpose the dataframe and use axis=0:

df.T.add(series,axis=0).T.head()

I noticed get_dtype_counts() is deprecated ( #27145 ) which appears to be the PR that has caused this regression as a Series only returns a single numpy dtype which does not have a value_counts() method.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Windows
OS-release : 7
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.0
numpy : 1.16.4
pytz : 2018.4
dateutil : 2.7.3
pip : 10.0.1
setuptools : 39.1.0
Cython : None
pytest : 3.5.1
hypothesis : None
sphinx : 1.8.2
blosc : None
feather : None
xlsxwriter : 1.0.4
lxml.etree : 4.1.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 6.4.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.1.1
matplotlib : 2.2.2
numexpr : 2.6.5
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.8
tables : 3.5.2
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : 1.0.4

@TomAugspurger
Copy link
Contributor

cc @jbrockmendel.

@jbrockmendel jbrockmendel added the Numeric Operations Arithmetic, Comparison, and Logical operations label Jul 29, 2019
@jbrockmendel
Copy link
Member

Looks like this was changed from obj.get_dtype_counts, which returns Series for either Series or DataFrame, to obj.dtypes.value_counts, but Series.dtypes returns a Scalar, which is why value_counts raises AttributeError.

@ccharlesgb
Copy link
Contributor Author

I can raise a PR to do an extra hasattr on the dtypes. That should fix it?

@TomAugspurger
Copy link
Contributor

maybe change

if hasattr(o, "dtypes"):

to

if hasattr(o, "dtypes") and o.ndim > 1:
    ...

@TomAugspurger
Copy link
Contributor

But yes, a PR with tests and a release note in 0.25.1.rst would be very welcome.

@TomAugspurger TomAugspurger added the Regression Functionality that used to work in a prior pandas version label Aug 2, 2019
@TomAugspurger TomAugspurger added this to the 0.25.1 milestone Aug 2, 2019
@ccharlesgb
Copy link
Contributor Author

Your suggestion worked as well and removed the extra if statement. I added to the test suite in test_expression.py which has uncovered some more issues with operators on DataFrames and Series, with axis=1. Will update this issue once I know the cause.

@jbrockmendel
Copy link
Member

Will update this issue once I know the cause.

If its feasible, it would be easier if you made a small PR specific to the bug here, then address the newly-found bugs in separate steps.

@ccharlesgb
Copy link
Contributor Author

It is feasible but it would require a very narrow test. The issue I am having now is that numexpr is failing to work on floordiv when operating on a DataFrame by a series with axis=1. This issue was never caught because the test suite doesn't cover this case currently.

If we modify the example code snippet, with the fix suggested by @TomAugspurger to:

import pandas as pd

ind = list(range(0, 100))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.floordiv(series, axis=1).head())  # Works fine
ind = list(range(0, 100000))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.floordiv(series,axis=1).head()) 

We get the following traceback:

Traceback (most recent call last):
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1473, in na_op
    result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 116, in _evaluate_numexpr
    **eval_kwargs
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\numexpr\necompiler.py", line 802, in evaluate
    * 'no' means the data types should not be cast at all.
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\numexpr\necompiler.py", line 709, in getExprNames
    input_order = getInputOrder(ast, None)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\numexpr\necompiler.py", line 299, in stringToExpression
    ex = eval(c, names)
  File "<expr>", line 1, in <module>
TypeError: unsupported operand type(s) for //: 'VariableNode' and 'VariableNode'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 12, in <module>
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1499, in f
    self, other, pass_op, fill_value=fill_value, axis=axis, level=level
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1388, in _combine_series_frame
    return self._combine_match_columns(other, func, level=level)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\frame.py", line 5392, in _combine_match_columns
    return ops.dispatch_to_series(left, right, func, axis="columns")
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 596, in dispatch_to_series
    new_data = expressions.evaluate(column_op, str_rep, left, right)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 220, in evaluate
    return _evaluate(op, op_str, a, b, **eval_kwargs)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 126, in _evaluate_numexpr
    result = _evaluate_standard(op, op_str, a, b)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\computation\expressions.py", line 70, in _evaluate_standard
    return op(a, b)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 584, in column_op
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 584, in <dictcomp>
    return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 1475, in na_op
    result = masked_arith_op(x, y, op)
  File "C:\dev\bin\anaconda\envs\py36pd25\lib\site-packages\pandas\core\ops\__init__.py", line 451, in masked_arith_op
    assert isinstance(x, np.ndarray), type(x)
AssertionError: <class 'pandas.core.series.Series'>

masked_arith_op expects its params x and y to be ndarray but in this specific case x is a Series:

    # pandas/core/ops/__init__.py : 423
    # For Series `x` is 1D so ravel() is a no-op; calling it anyway makes
    # the logic valid for both Series and DataFrame ops.
    xrav = x.ravel()
    assert isinstance(x, np.ndarray), type(x)

Modifying this function to use xrav instead of just x does fix the issue and all unit tests still pass but I am not sure if this is the true intention of the in line comment here?

Happy to restrict the tests to try every operator BUT floordiv if that is better to reduce the scope of the PR.

@TomAugspurger
Copy link
Contributor

Happy to restrict the tests to try every operator BUT floordiv if that is better to reduce the scope of the PR.

Let's do that for now. You can open another issue for the floordiv problem I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants