-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
maybe_cast_to_integer_array fails when the input is all booleans #25211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! That does seem strange that NA is required here. Investigation and PRs are always welcome |
I've found the source of the discrepancy between the arrays with and without NA, it's in pandas/pandas/core/arrays/integer.py Line 179 in 0eddba8
pandas/pandas/core/arrays/integer.py Lines 190 to 192 in 0eddba8
When np.array is called on an array with NAs it casts the arrays to float:
In [1]: import numpy as np
In [2]: np.array([False, True]).dtype
Out[2]: dtype('bool')
In [3]: np.array([False, True, np.nan]).dtype
Out[3]: dtype('float64') Though I'm not sure how to handle this properly, should we cast boolean arrays to numpy floats first? |
I would prefer casting to |
Fixes pandas-dev#25211. Cast boolean array to int before casting to EA Integer dtype.
Well, my PR seems to fix this issue but the tests are failing with several similar errors: pandas/tests/extension/base/ops.py:33: AssertionError
___________ TestComparisonOps.test_compare_scalar[__gt__-Int32Dtype] ___________
[gw0] darwin -- Python 3.5.6 /Users/vsts/miniconda3/envs/pandas-dev/bin/python
self = <pandas.tests.extension.test_integer.TestComparisonOps object at 0x1378eb470>
data = <IntegerArray>
[ 1, 2, 3, 4, 5, 6, 7, 8, NaN, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, ...2, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, NaN, 99, 100]
Length: 100, dtype: Int32
all_compare_operators = '__gt__'
def test_compare_scalar(self, data, all_compare_operators):
op_name = all_compare_operators
s = pd.Series(data)
> self._compare_other(s, data, op_name, 0)
pandas/tests/extension/base/ops.py:148:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/tests/extension/test_integer.py:154: in _compare_other
self.check_opname(s, op_name, other)
pandas/tests/extension/test_integer.py:151: in check_opname
other, exc=None)
pandas/tests/extension/base/ops.py:27: in check_opname
self._check_op(s, op, other, op_name, exc)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pandas.tests.extension.test_integer.TestComparisonOps object at 0x1378eb470>
s = 0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 NaN
9 10
10 11
11 1... 91
91 92
92 93
93 94
94 95
95 96
96 97
97 NaN
98 99
99 100
Length: 100, dtype: Int32
op = <built-in function gt>, other = 0, op_name = '__gt__', exc = None
def _check_op(self, s, op, other, op_name, exc=NotImplementedError):
if exc is None:
result = op(s, other)
expected = s.combine(other, op)
> self.assert_series_equal(result, expected)
E AssertionError: Attributes are different
E
E Attribute "dtype" are different
E [left]: bool
E [right]: Int64 It looks like that these should not be related, and I'd be happy for any help clarifying it. |
@vladserkoff : This is what I was referring to when I was talking about how trying to modify EA logic is by no means straightforward and has its own landmines. I would try casting before you hit any of the EA logic. |
@gfyoung, thanks. I've left the fix where it was, only made sure to not cast to int if the target dtype is not an integer. Tests are now ok, except for an unrelated ImportError in a windows test: _____________________________ test_oo_optimizable _____________________________
[gw1] win32 -- Python 2.7.15 C:\Miniconda\envs\pandas-dev\python.exe
def test_oo_optimizable():
# GH 21071
> subprocess.check_call([sys.executable, "-OO", "-c", "import pandas"])
pandas\tests\test_downstream.py:63:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
popenargs = (['C:\\Miniconda\\envs\\pandas-dev\\python.exe', '-OO', '-c', 'import pandas'],)
kwargs = {}, retcode = 1
cmd = ['C:\\Miniconda\\envs\\pandas-dev\\python.exe', '-OO', '-c', 'import pandas']
def check_call(*popenargs, **kwargs):
"""Run command with arguments. Wait for command to complete. If
the exit code was zero then return, otherwise raise
CalledProcessError. The CalledProcessError object will have the
return code in the returncode attribute.
The arguments are the same as for the Popen constructor. Example:
check_call(["ls", "-l"])
"""
retcode = call(*popenargs, **kwargs)
if retcode:
cmd = kwargs.get("args")
if cmd is None:
cmd = popenargs[0]
> raise CalledProcessError(retcode, cmd)
E CalledProcessError: Command '['C:\\Miniconda\\envs\\pandas-dev\\python.exe', '-OO', '-c', 'import pandas']' returned non-zero exit status 1
C:\Miniconda\envs\pandas-dev\lib\subprocess.py:190: CalledProcessError
---------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "pandas\__init__.py", line 35, in <module>
"the C extensions first.".format(module))
ImportError: C extension: DLL load failed: The parameter is incorrect. not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first. Unfortunately Codecov is down by 50%, and I'm afraid I can't fix this as I'm new to it. |
Code Sample, a copy-pastable example if possible
Problem description
Pandas is unable to convert array of bools to IntegerDtype, while conversion to int is supported. What's interesting, if an arrays contains NaNs, then conversion goes as expected.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
pandas: 0.24.1
pytest: 4.2.0
pip: 19.0.1
setuptools: 40.7.3
Cython: 0.29.4
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.4
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.14
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.17
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: