Skip to content

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
60 changes: 24 additions & 36 deletions pandas/core/interchange/from_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,29 +266,22 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:

assert buffers["offsets"], "String buffers must contain offsets"
# Retrieve the data buffer containing the UTF-8 code units
data_buff, data_dtype = buffers["data"]

if (data_dtype[1] == 8) and (
data_dtype[2]
in (
ArrowCTypes.STRING,
ArrowCTypes.LARGE_STRING,
)
): # format_str == utf-8
# temporary workaround to keep backwards compatibility due to
# https://github.com/pandas-dev/pandas/issues/54781

# We're going to reinterpret the buffer as uint8, so make sure we can do it
# safely

# Convert the buffers to NumPy arrays. In order to go from STRING to
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array)
data_dtype = (
DtypeKind.UINT,
8,
ArrowCTypes.UINT8,
Endianness.NATIVE,
)
data_buff, _ = buffers["data"]
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely
assert col.dtype[1] == 8
assert col.dtype[2] in (
ArrowCTypes.STRING,
ArrowCTypes.LARGE_STRING,
) # format_str == utf-8

# Convert the buffers to NumPy arrays. In order to go from STRING to
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array)
data_dtype = (
DtypeKind.UINT,
8,
ArrowCTypes.UINT8,
Endianness.NATIVE,
)
# Specify zero offset as we don't want to chunk the string data
data = buffer_to_ndarray(data_buff, data_dtype, offset=0, length=data_buff.bufsize)

Expand Down Expand Up @@ -386,22 +379,17 @@ def datetime_column_to_ndarray(col: Column) -> tuple[np.ndarray | pd.Series, Any
buffers = col.get_buffers()

_, _, format_str, _ = col.dtype
dbuf, data_dtype = buffers["data"]

if data_dtype[0] == DtypeKind.DATETIME:
# temporary workaround to keep backwards compatibility due to
# https://github.com/pandas-dev/pandas/issues/54781
# Consider dtype being `int` to get number of units passed since 1970-01-01
data_dtype = (
DtypeKind.INT,
data_dtype[1],
getattr(ArrowCTypes, f"INT{data_dtype[1]}"),
Endianness.NATIVE,
)
dbuf, _ = buffers["data"]
# Consider dtype being `uint` to get number of units passed since the 01.01.1970

data = buffer_to_ndarray(
dbuf,
data_dtype,
(
DtypeKind.INT,
col.dtype[1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We unpack col.dtype on line 381, it'll be slightly more efficient to get the bit width from there!

getattr(ArrowCTypes, f"INT{col.dtype[1]}"),
Endianness.NATIVE,
),
offset=col.offset,
length=col.size(),
)
Expand Down