-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9876c64
0b936b3
00ad9c7
d54f950
adeceb9
3557b4a
975c87c
0ef179a
df996ac
0bea19b
d04ac92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -14,6 +14,7 @@ | |||
DtypeKind, | ||||
) | ||||
from pandas.core.interchange.from_dataframe import from_dataframe | ||||
from pandas.core.interchange.utils import ArrowCTypes | ||||
|
||||
|
||||
@pytest.fixture | ||||
|
@@ -326,3 +327,24 @@ def test_interchange_from_non_pandas_tz_aware(): | |||
dtype="datetime64[us, Asia/Kathmandu]", | ||||
) | ||||
tm.assert_frame_equal(expected, result) | ||||
|
||||
|
||||
def test_interchange_from_corrected_buffer_dtypes(monkeypatch) -> None: | ||||
# https://github.com/pandas-dev/pandas/issues/54781 | ||||
df = pd.DataFrame({"a": ["foo", "bar"]}).__dataframe__() | ||||
interchange = df.__dataframe__() | ||||
column = interchange.get_column_by_name("a") | ||||
buffers = column.get_buffers() | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a blocker for this PR but I think these tests would be more impactful if we made the PandasBuffer implement the buffer protocol: https://docs.cython.org/en/latest/src/userguide/buffer.html That way we could inspect the bytes for tests There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Started this in #55671 |
||||
buffers_data = buffers["data"] | ||||
buffer_dtype = buffers_data[1] | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feel free to ignore my possibly wrong commentary as I'm new to this, but I think the offset buffers don't have the proper bufsize here either (Pdb) buffers["offsets"]
(PandasBuffer({'bufsize': 24, 'ptr': 94440192356160, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '=')) The standard StringType which inherits from BinaryType in arrow uses a 32 bit offset value, so I think that bufsize should only be 12, unless we are mapping to a LargeString intentionally There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for looking into this - looks like it comes from pandas/pandas/core/interchange/column.py Line 364 in d2f05c2
where the dtype's being set to int64. OK to discuss/address this separately? |
||||
buffer_dtype = ( | ||||
DtypeKind.UINT, | ||||
8, | ||||
ArrowCTypes.UINT8, | ||||
buffer_dtype[3], | ||||
) | ||||
buffers["data"] = (buffers_data[0], buffer_dtype) | ||||
column.get_buffers = lambda: buffers | ||||
interchange.get_column_by_name = lambda _: column | ||||
monkeypatch.setattr(df, "__dataframe__", lambda allow_copy: interchange) | ||||
pd.api.interchange.from_dataframe(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.