-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Interchange with Pyarrow types loses validity buffer #56805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
3 tasks
yeah this wrong - the Polars implementation looks correct here In [6]: pl.DataFrame({'a': [None]}, schema={'a': pl.Int32}).__dataframe__().get_column(0).get_buffers()
Out[6]:
{'data': (PolarsBuffer(bufsize=4, ptr=139918198210568, device='CPU'),
(<DtypeKind.INT: 0>, 32, 'i', '=')),
'validity': (PolarsBuffer(bufsize=1, ptr=139918198210576, device='CPU'),
(<DtypeKind.BOOL: 20>, 1, 'b', '=')),
'offsets': None}
In [7]: pd.DataFrame({'a': [None]}, dtype='int32[pyarrow]').__dataframe__().get_column(0).get_buffers()
Out[7]:
{'data': (PandasBuffer({'bufsize': 8, 'ptr': 93950600639216, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 32, 'i', '=')),
'validity': None,
'offsets': None} definitely needs fixing |
the bufsize is wrong in the pandas case too, right? the api reads
1 byte is 8 bits, and 32 bits / 8 bits = 4. Not 8 |
Good catch - yea that also is wrong |
Fixed by #57764 (thanks for your review there!) In [3]: >>> df = pd.DataFrame([[None]], dtype="int32[pyarrow]")
...: >>> df.__dataframe__().get_column(0).describe_null
Out[3]: (<ColumnNullType.USE_BITMASK: 3>, 0)
In [4]: df.__dataframe__().get_column(0).get_buffers()
Out[4]:
{'data': (PandasBuffer[pyarrow]({'bufsize': 4, 'ptr': 140157248376896, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 32, 'i', '=')),
'validity': (PandasBuffer[pyarrow]({'bufsize': 1, 'ptr': 140157248376832, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 1, 'b', '=')),
'offsets': None} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
No mention of nullability even though it is
Expected Behavior
We should at least be sending a bit or a byte mask along through the dataframe api
Installed Versions
main
The text was updated successfully, but these errors were encountered: