-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Operators between DataFrame and Series fail on large dataframes #27636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @jbrockmendel. |
Looks like this was changed from obj.get_dtype_counts, which returns Series for either Series or DataFrame, to obj.dtypes.value_counts, but Series.dtypes returns a Scalar, which is why value_counts raises AttributeError. |
I can raise a PR to do an extra hasattr on the dtypes. That should fix it? |
maybe change
to
|
But yes, a PR with tests and a release note in 0.25.1.rst would be very welcome. |
Your suggestion worked as well and removed the extra if statement. I added to the test suite in test_expression.py which has uncovered some more issues with operators on DataFrames and Series, with axis=1. Will update this issue once I know the cause. |
If its feasible, it would be easier if you made a small PR specific to the bug here, then address the newly-found bugs in separate steps. |
It is feasible but it would require a very narrow test. The issue I am having now is that numexpr is failing to work on floordiv when operating on a DataFrame by a series with axis=1. This issue was never caught because the test suite doesn't cover this case currently. If we modify the example code snippet, with the fix suggested by @TomAugspurger to: import pandas as pd
ind = list(range(0, 100))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.floordiv(series, axis=1).head()) # Works fine
ind = list(range(0, 100000))
cols = list(range(0, 300))
df = pd.DataFrame(index=ind, columns=cols, data=1.0)
series = pd.Series(index=cols, data=cols)
print(df.floordiv(series,axis=1).head()) We get the following traceback:
masked_arith_op expects its params x and y to be ndarray but in this specific case x is a Series: # pandas/core/ops/__init__.py : 423
# For Series `x` is 1D so ravel() is a no-op; calling it anyway makes
# the logic valid for both Series and DataFrame ops.
xrav = x.ravel()
assert isinstance(x, np.ndarray), type(x) Modifying this function to use xrav instead of just x does fix the issue and all unit tests still pass but I am not sure if this is the true intention of the in line comment here? Happy to restrict the tests to try every operator BUT floordiv if that is better to reduce the scope of the PR. |
Let's do that for now. You can open another issue for the floordiv problem I think. |
Code Sample
Code Output:
Problem description
I think this is a regression somewhere between pandas 0.19.2 and 0.25. If you multiply or use any other operator function such as add/divide on a DataFrame by a Series where axis=1 pandas will crash in the
_can_use_numexpr
functon when the DataFrame/Series becomes very large. This is presumably down to check of the size of the objects being operated on not passing for small datasets but for larger ones it gets to the failing line.In pandas 0.19.2 the function uses the get_dtype_counts() method instead to inspect if the dtype is uniform in the object:
I have a workaround which is to transpose the dataframe and use axis=0:
I noticed get_dtype_counts() is deprecated ( #27145 ) which appears to be the PR that has caused this regression as a Series only returns a single numpy dtype which does not have a value_counts() method.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Windows
OS-release : 7
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.0
numpy : 1.16.4
pytz : 2018.4
dateutil : 2.7.3
pip : 10.0.1
setuptools : 39.1.0
Cython : None
pytest : 3.5.1
hypothesis : None
sphinx : 1.8.2
blosc : None
feather : None
xlsxwriter : 1.0.4
lxml.etree : 4.1.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 6.4.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.1.1
matplotlib : 2.2.2
numexpr : 2.6.5
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.2.8
tables : 3.5.2
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : 1.0.4
The text was updated successfully, but these errors were encountered: