BUG: interchange bitmasks not supported in interchange/from_dataframe.py #49888

AlenkaF · 2022-11-24T13:54:32Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Sorry, this is not easily reproducible as the dataframe interchange protocol for pyarrow is still work in progress but I think the error is quite clear:

import pyarrow as pa
table = pa.table({"a": [1, 2, 3, None]})

exchange_df = table.__dataframe__()

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 122, in protocol_df_chunk_to_pandas
    columns[name], buf = primitive_column_to_ndarray(col)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 160, in primitive_column_to_ndarray
    data = set_nulls(data, col, buffers["validity"])
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 504, in set_nulls
    null_pos = buffer_to_ndarray(valid_buff, valid_dtype, col.offset, col.size)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 395, in buffer_to_ndarray
    raise NotImplementedError(f"Conversion for {dtype} is not yet supported.")
NotImplementedError: Conversion for (<DtypeKind.BOOL: 20>, 1, 'b', '=') is not yet supported.

Issue Description

I am currently working on implementing a dataframe interchange protocol for pyarrow.Table in Apache Arrow project (apache/arrow#14613).

I am using pandas implementation to test that the produced __dataframe__ object can be correctly consumed.

When consuming a pyarrow.Table with missing values I get an NotImplementedError. The bitmasks, used by PyArrow to represent nulls in a given column, can not be converted.

But if I look at the code in from_dataframe.py:

pandas/pandas/core/interchange/from_dataframe.py

Lines 405 to 415 in 70121c7

    
               if bit_width == 1: 
        
                   assert length is not None, "`length` must be specified for a bit-mask buffer." 
        
                   arr = np.ctypeslib.as_array(data_pointer, shape=(buffer.bufsize,)) 
        
                   return bitmask_to_bool_ndarray(arr, length, first_byte_offset=offset % 8) 
        
               else: 
        
                   return np.ctypeslib.as_array( 
        
                       data_pointer, shape=(buffer.bufsize // (bit_width // 8),) 
        
                   ) 
        
           def bitmask_to_bool_ndarray(

I would think this is not intentional and that the _NP_DTYPES should include {1: bool}

pandas/pandas/core/interchange/from_dataframe.py

Line 393 in 70121c7

column_dtype = _NP_DTYPES.get(kind, {}).get(bit_width, None)

pandas/pandas/core/interchange/from_dataframe.py

Lines 23 to 28 in 70121c7

    
           _NP_DTYPES: dict[DtypeKind, dict[int, Any]] = { 
        
               DtypeKind.INT: {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}, 
        
               DtypeKind.UINT: {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}, 
        
               DtypeKind.FLOAT: {32: np.float32, 64: np.float64}, 
        
               DtypeKind.BOOL: {8: bool}, 
        
           }

Expected Behavior

The bitmask can be converted to ndarray by the current pandas implementation of the dataframe interchange protocol and the code below could work for missing values also:

>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3, 4]})

>>> exchange_df = table.__dataframe__()
>>> exchange_df._df
pyarrow.Table
a: int64
----
a: [[1,2,3,4]]

>>> from pandas.core.interchange.from_dataframe import from_dataframe
>>> from_dataframe(exchange_df)
   a
0  1
1  2
2  3
3  4

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.5.0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.28
pytest : 7.1.3
hypothesis : 6.39.4
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : 2022.02.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

AlenkaF · 2022-11-24T13:59:06Z

cc @jorisvandenbossche @mroeschke

mroeschke · 2022-11-29T21:07:11Z

Thanks for the report @AlenkaF!

As an aside related to converting a pa.Table to a pd.DataFrame, it would be very interesting to keep the pyarrow types in a pandas DataFrame since pandas supports a built-in pd.ArrowExtensionArray. Of course more would need refactored on the pandas exchange side to construct a pd.DataFrame using pd.ArrowExtensionArray instead of numpy arrays.

jorisvandenbossche · 2022-11-29T21:59:36Z

it would be very interesting to keep the pyarrow types in a pandas DataFrame

I think that is something that is fully controlled on the pandas side? PyArrow will just expose the buffers, and it's up to pandas to decide how to reassemble those in arrays (except for boolean dtype, where the spec currently requires a byte array, so that's not possible zero-copy for pyarrow. But that is also something that maybe should be discussed on the Data APIs side to include that in the spec?)

mroeschke · 2022-11-29T22:39:21Z

I think that is something that is fully controlled on the pandas side?

Yeah agreed. I guess during the consumption of the interchange by pandas there can be a "mode" that can be configure to say whether to consume the buffers as arrow objects or numpy objects

MarcoGorelli · 2023-04-20T14:51:04Z

I would think this is not intentional and that the _NP_DTYPES should include {1: bool}

@AlenkaF I tried this, but then I get:

In [1]: import pyarrow as pa
   ...: table = pa.table({"a": [1, 2, 3, None]})
   ...: 
   ...: exchange_df = table.__dataframe__()
   ...: 
   ...: from pandas.core.interchange.from_dataframe import from_dataframe
   ...: from_dataframe(exchange_df)
Out[1]: 
     a
0  1.0
1  NaN
2  NaN
3  NaN

which doesn't look right. Do you know what else would need changing?

AlenkaF · 2023-04-20T15:08:15Z

Hm, first thing where I would look at is this line:

pandas/pandas/core/interchange/from_dataframe.py

Line 407 in 70121c7

arr = np.ctypeslib.as_array(data_pointer, shape=(buffer.bufsize,))

and would just print out if the numpy array is actually what we expect. And if not, maybe the shape needs to be corrected (still needs // 8 maybe? Just guessing though).

MarcoGorelli · 2023-04-20T15:15:50Z

thanks for taking a look! buffer.bufsize here is 1, // 8 would make it 0

here, arr is just

(Pdb) arr
array([ True])

jorisvandenbossche · 2023-04-20T15:19:49Z

This needs some custom code, since numpy only supports arrays with size of number of bytes, not bits (minimum element length is 1 byte). So you can't just view this bitmap as a numpy array without some custom processing.

We do have that processing in our arrow_utils (for converting pyarrow arrays to masked arrays), however, that assumes that pyarrow is installed, so that would make accepting bitmasks in the dataframe interchange protocol dependent on pyarrow (which is probably fine, though)

MarcoGorelli · 2023-04-20T17:14:33Z

is that pyarrow_array_to_numpy_and_mask? what would you pass to it? (sorry, I'm new to this - I'll take a look, but it may take me some time)

jorisvandenbossche · 2023-04-21T08:21:46Z

Yes, that's indeed pyarrow_array_to_numpy_and_mask. Now, you can't use this directly here, but it has some similar code needed for this (so either copy what you need, or refactor to reuse the shared bit). What this function does is getting the buffers of the pyarrow array (arr.buffers()), and then convert those buffers to numpy arrays. While for this use case, we get the buffers from the dataframe interchange protocol object.

pandas/pandas/core/arrays/arrow/_arrow_utils.py

Lines 50 to 66 in 961f9c4

    
           buflist = arr.buffers() 
        
           # Since Arrow buffers might contain padding and the data might be offset, 
        
           # the buffer gets sliced here before handing it to numpy. 
        
           # See also https://github.com/pandas-dev/pandas/issues/40896 
        
           offset = arr.offset * dtype.itemsize 
        
           length = len(arr) * dtype.itemsize 
        
           data_buf = buflist[1][offset : offset + length] 
        
           data = np.frombuffer(data_buf, dtype=dtype) 
        
           bitmask = buflist[0] 
        
           if bitmask is not None: 
        
               mask = pyarrow.BooleanArray.from_buffers( 
        
                   pyarrow.bool_(), len(arr), [None, bitmask], offset=arr.offset 
        
               ) 
        
               mask = np.asarray(mask) 
        
           else: 
        
               mask = np.ones(len(arr), dtype=bool) 
        
           return data, mask

I am not fully sure to what extent the comment about padding and offset applies here as well.

The actual conversion of the bitmap buffer to a numpy boolean array is this part:

pandas/pandas/core/arrays/arrow/_arrow_utils.py

Lines 59 to 63 in 961f9c4

    
           if bitmask is not None: 
        
               mask = pyarrow.BooleanArray.from_buffers( 
        
                   pyarrow.bool_(), len(arr), [None, bitmask], offset=arr.offset 
        
               ) 
        
               mask = np.asarray(mask)

So what this does is actually converting this bitmap back to a pyarrow boolean array (because this can be done zero-copy, a boolean array also uses bits for its values, similar as the validity bitmap of another array), to then uses the implementation of pyarrow to convert this array (using bits) to a numpy array (using bytes).

(in theory we could write our own code in cython/c to convert a buffers of bits to bytes, but personally I wouldn't worry about that)

AlenkaF · 2023-04-21T08:24:22Z

Would it be an option to just check for pyarrow arrays in set_nulls (and similarly in string_column_to_ndarray):

pandas/pandas/core/interchange/from_dataframe.py

Line 504 in 70121c7

null_pos = buffer_to_ndarray(valid_buff, valid_dtype, col.offset, col.size)

and instead of using buffer_to_ndarray use pyarrow_array_to_numpy_and_mask instead?

MarcoGorelli · 2023-04-21T08:28:20Z

thanks @jorisvandenbossche !

I was trying that, but get

(Pdb) pyarrow.BooleanArray.from_buffers(pyarrow.bool_(), length, [None, buffer], offset=offset)
*** TypeError: Cannot convert _PyArrowBuffer to pyarrow.lib.Buffer

AlenkaF · 2023-04-21T08:28:54Z

Would it be an option to just check for pyarrow arrays in set_nulls (and similarly in string_column_to_ndarray):

pandas/pandas/core/interchange/from_dataframe.py

Line 504 in 70121c7

null_pos = buffer_to_ndarray(valid_buff, valid_dtype, col.offset, col.size)

and instead of using buffer_to_ndarray use pyarrow_array_to_numpy_and_mask instead?

Ignore my comment please - thought it would be a good idea but it would not work for general dataframe library that uses bitmasks.

jorisvandenbossche · 2023-04-21T08:29:13Z

Would it be an option to just check for pyarrow arrays in set_nulls (and similarly in string_column_to_ndarray):

In theory, the code for the dataframe interchange protocol doesn't know that the protocol object is backed by pyarrow? (without checking that obj._col is a pyarrow.Array, which would be relying on an implementation detal)

AlenkaF · 2023-04-21T08:30:03Z

thanks @jorisvandenbossche !

I was trying that, but get

(Pdb) pyarrow.BooleanArray.from_buffers(pyarrow.bool_(), length, [None, buffer], offset=offset)
*** TypeError: Cannot convert _PyArrowBuffer to pyarrow.lib.Buffer

I think _PyArrowBuffer is a dataframe protocol object and pyarrow doesn't recognise it.

jorisvandenbossche · 2023-04-21T08:31:19Z

I was trying that, but get

(Pdb) pyarrow.BooleanArray.from_buffers(pyarrow.bool_(), length, [None, buffer], offset=offset)
*** TypeError: Cannot convert _PyArrowBuffer to pyarrow.lib.Buffer

Yeah, the _PyArrowBuffer implements the protocol interface, but that is not a generic "buffer like object). But I suppose you can use pa.foreign_buffer(..) to create a pyarrow.Buffer object from the protocol's buffer (specifying pointer and size)

MarcoGorelli · 2023-04-21T08:37:13Z

thanks! yeah if I use

        arr = pa.BooleanArray.from_buffers(
            pa.bool_(),
            length,
            [None, pa.foreign_buffer(buffer.ptr, length)],
            offset=offset,
        )
        return np.asarray(arr)

then it all seems to work as-expected

AlenkaF added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 24, 2022

AlenkaF changed the title ~~BUG: bitmasks not supported in from_dataframe.py~~ BUG: bitmasks not supported in interchange/from_dataframe.py Nov 24, 2022

jorisvandenbossche added Interchange Dataframe Interchange Protocol and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 24, 2022

AlenkaF changed the title ~~BUG: bitmasks not supported in interchange/from_dataframe.py~~ BUG: interchange bitmasks not supported in interchange/from_dataframe.py Jan 16, 2023

AlenkaF mentioned this issue Apr 21, 2023

[Python] Dataframe interchange protocol updates after pandas bug fixes apache/arrow#35264

Closed

4 tasks

MarcoGorelli mentioned this issue Apr 21, 2023

BUG: interchange bitmasks not supported in interchange/from_dataframe.py #52824

Merged

6 tasks

mroeschke closed this as completed in #52824 Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: interchange bitmasks not supported in interchange/from_dataframe.py #49888

BUG: interchange bitmasks not supported in interchange/from_dataframe.py #49888

AlenkaF commented Nov 24, 2022 •

edited

Loading

INSTALLED VERSIONS

AlenkaF commented Nov 24, 2022

mroeschke commented Nov 29, 2022

jorisvandenbossche commented Nov 29, 2022

mroeschke commented Nov 29, 2022

MarcoGorelli commented Apr 20, 2023

AlenkaF commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

jorisvandenbossche commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

jorisvandenbossche commented Apr 21, 2023

AlenkaF commented Apr 21, 2023

MarcoGorelli commented Apr 21, 2023

AlenkaF commented Apr 21, 2023

jorisvandenbossche commented Apr 21, 2023 •

edited

Loading

AlenkaF commented Apr 21, 2023

jorisvandenbossche commented Apr 21, 2023

MarcoGorelli commented Apr 21, 2023

BUG: interchange bitmasks not supported in interchange/from_dataframe.py #49888

BUG: interchange bitmasks not supported in interchange/from_dataframe.py #49888

Comments

AlenkaF commented Nov 24, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

AlenkaF commented Nov 24, 2022

mroeschke commented Nov 29, 2022

jorisvandenbossche commented Nov 29, 2022

mroeschke commented Nov 29, 2022

MarcoGorelli commented Apr 20, 2023

AlenkaF commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

jorisvandenbossche commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

jorisvandenbossche commented Apr 21, 2023

AlenkaF commented Apr 21, 2023

MarcoGorelli commented Apr 21, 2023

AlenkaF commented Apr 21, 2023

jorisvandenbossche commented Apr 21, 2023 • edited Loading

AlenkaF commented Apr 21, 2023

jorisvandenbossche commented Apr 21, 2023

MarcoGorelli commented Apr 21, 2023

AlenkaF commented Nov 24, 2022 •

edited

Loading

jorisvandenbossche commented Apr 21, 2023 •

edited

Loading