Skip to content

BUG: Interchange with Pyarrow types loses validity buffer #56805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
WillAyd opened this issue Jan 10, 2024 · 4 comments
Closed
3 tasks done

BUG: Interchange with Pyarrow types loses validity buffer #56805

WillAyd opened this issue Jan 10, 2024 · 4 comments
Labels
Bug Interchange Dataframe Interchange Protocol

Comments

@WillAyd
Copy link
Member

WillAyd commented Jan 10, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = pd.DataFrame([[None]], dtype="int32[pyarrow]")
>>> df.__dataframe__().get_column(0).describe_null
(<ColumnNullType.NON_NULLABLE: 0>, None)
>>> df.__dataframe__().get_column(0).get_buffers()
{'data': (PandasBuffer({'bufsize': 8, 'ptr': 100388886100432, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 32, 'i', '=')), 'validity': None, 'offsets': None}

Issue Description

No mention of nullability even though it is

Expected Behavior

We should at least be sending a bit or a byte mask along through the dataframe api

Installed Versions

main

@WillAyd WillAyd added Bug Needs Triage Issue that has not been reviewed by a pandas team member Interchange Dataframe Interchange Protocol and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 10, 2024
@MarcoGorelli
Copy link
Member

yeah this wrong - the Polars implementation looks correct here

In [6]: pl.DataFrame({'a': [None]}, schema={'a': pl.Int32}).__dataframe__().get_column(0).get_buffers()
Out[6]:
{'data': (PolarsBuffer(bufsize=4, ptr=139918198210568, device='CPU'),
  (<DtypeKind.INT: 0>, 32, 'i', '=')),
 'validity': (PolarsBuffer(bufsize=1, ptr=139918198210576, device='CPU'),
  (<DtypeKind.BOOL: 20>, 1, 'b', '=')),
 'offsets': None}

In [7]: pd.DataFrame({'a': [None]}, dtype='int32[pyarrow]').__dataframe__().get_column(0).get_buffers()
Out[7]:
{'data': (PandasBuffer({'bufsize': 8, 'ptr': 93950600639216, 'device': 'CPU'}),
  (<DtypeKind.INT: 0>, 32, 'i', '=')),
 'validity': None,
 'offsets': None}

definitely needs fixing

@MarcoGorelli
Copy link
Member

the bufsize is wrong in the pandas case too, right? the api reads

Buffer size in bytes.

1 byte is 8 bits, and 32 bits / 8 bits = 4. Not 8

@WillAyd
Copy link
Member Author

WillAyd commented Jan 11, 2024

Good catch - yea that also is wrong

@MarcoGorelli
Copy link
Member

Fixed by #57764 (thanks for your review there!)

In [3]: >>> df = pd.DataFrame([[None]], dtype="int32[pyarrow]")
   ...: >>> df.__dataframe__().get_column(0).describe_null
Out[3]: (<ColumnNullType.USE_BITMASK: 3>, 0)

In [4]: df.__dataframe__().get_column(0).get_buffers()
Out[4]: 
{'data': (PandasBuffer[pyarrow]({'bufsize': 4, 'ptr': 140157248376896, 'device': 'CPU'}),
  (<DtypeKind.INT: 0>, 32, 'i', '=')),
 'validity': (PandasBuffer[pyarrow]({'bufsize': 1, 'ptr': 140157248376832, 'device': 'CPU'}),
  (<DtypeKind.BOOL: 20>, 1, 'b', '=')),
 'offsets': None}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Interchange Dataframe Interchange Protocol
Projects
None yet
Development

No branches or pull requests

2 participants