-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: interchange bitmasks not supported in interchange/from_dataframe.py #49888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report @AlenkaF! As an aside related to converting a |
I think that is something that is fully controlled on the pandas side? PyArrow will just expose the buffers, and it's up to pandas to decide how to reassemble those in arrays (except for boolean dtype, where the spec currently requires a byte array, so that's not possible zero-copy for pyarrow. But that is also something that maybe should be discussed on the Data APIs side to include that in the spec?) |
Yeah agreed. I guess during the consumption of the interchange by pandas there can be a "mode" that can be configure to say whether to consume the buffers as arrow objects or numpy objects |
@AlenkaF I tried this, but then I get: In [1]: import pyarrow as pa
...: table = pa.table({"a": [1, 2, 3, None]})
...:
...: exchange_df = table.__dataframe__()
...:
...: from pandas.core.interchange.from_dataframe import from_dataframe
...: from_dataframe(exchange_df)
Out[1]:
a
0 1.0
1 NaN
2 NaN
3 NaN which doesn't look right. Do you know what else would need changing? |
Hm, first thing where I would look at is this line:
and would just print out if the numpy array is actually what we expect. And if not, maybe the shape needs to be corrected (still needs |
thanks for taking a look! here, (Pdb) arr
array([ True]) |
This needs some custom code, since numpy only supports arrays with size of number of bytes, not bits (minimum element length is 1 byte). So you can't just view this bitmap as a numpy array without some custom processing. We do have that processing in our arrow_utils (for converting pyarrow arrays to masked arrays), however, that assumes that pyarrow is installed, so that would make accepting bitmasks in the dataframe interchange protocol dependent on pyarrow (which is probably fine, though) |
is that |
Yes, that's indeed pandas/pandas/core/arrays/arrow/_arrow_utils.py Lines 50 to 66 in 961f9c4
I am not fully sure to what extent the comment about padding and offset applies here as well. The actual conversion of the bitmap buffer to a numpy boolean array is this part: pandas/pandas/core/arrays/arrow/_arrow_utils.py Lines 59 to 63 in 961f9c4
So what this does is actually converting this bitmap back to a pyarrow boolean array (because this can be done zero-copy, a boolean array also uses bits for its values, similar as the validity bitmap of another array), to then uses the implementation of pyarrow to convert this array (using bits) to a numpy array (using bytes). (in theory we could write our own code in cython/c to convert a buffers of bits to bytes, but personally I wouldn't worry about that) |
Would it be an option to just check for pyarrow arrays in
and instead of using buffer_to_ndarray use pyarrow_array_to_numpy_and_mask instead?
|
thanks @jorisvandenbossche ! I was trying that, but get
|
Ignore my comment please - thought it would be a good idea but it would not work for general dataframe library that uses bitmasks. |
In theory, the code for the dataframe interchange protocol doesn't know that the protocol object is backed by pyarrow? (without checking that |
I think |
Yeah, the |
thanks! yeah if I use arr = pa.BooleanArray.from_buffers(
pa.bool_(),
length,
[None, pa.foreign_buffer(buffer.ptr, length)],
offset=offset,
)
return np.asarray(arr) then it all seems to work as-expected |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Sorry, this is not easily reproducible as the dataframe interchange protocol for pyarrow is still work in progress but I think the error is quite clear:
Issue Description
I am currently working on implementing a dataframe interchange protocol for
pyarrow.Table
in Apache Arrow project (apache/arrow#14613).I am using pandas implementation to test that the produced
__dataframe__
object can be correctly consumed.When consuming a
pyarrow.Table
with missing values I get anNotImplementedError
. The bitmasks, used by PyArrow to represent nulls in a given column, can not be converted.But if I look at the code in from_dataframe.py:
pandas/pandas/core/interchange/from_dataframe.py
Lines 405 to 415 in 70121c7
I would think this is not intentional and that the
_NP_DTYPES
should include{1: bool}
pandas/pandas/core/interchange/from_dataframe.py
Line 393 in 70121c7
pandas/pandas/core/interchange/from_dataframe.py
Lines 23 to 28 in 70121c7
Expected Behavior
The bitmask can be converted to
ndarray
by the current pandas implementation of the dataframe interchange protocol and the code below could work for missing values also:Installed Versions
INSTALLED VERSIONS
commit : 87cfe4e
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.5.0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.28
pytest : 7.1.3
hypothesis : 6.39.4
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : 2022.02.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: