-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
9876c64
0b936b3
00ad9c7
d54f950
adeceb9
3557b4a
975c87c
0ef179a
df996ac
0bea19b
d04ac92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -266,21 +266,29 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]: | |||||
|
||||||
assert buffers["offsets"], "String buffers must contain offsets" | ||||||
# Retrieve the data buffer containing the UTF-8 code units | ||||||
data_buff, protocol_data_dtype = buffers["data"] | ||||||
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely | ||||||
assert protocol_data_dtype[1] == 8 | ||||||
assert protocol_data_dtype[2] in ( | ||||||
ArrowCTypes.STRING, | ||||||
ArrowCTypes.LARGE_STRING, | ||||||
) # format_str == utf-8 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This assertion is valid, but it should be on |
||||||
# Convert the buffers to NumPy arrays. In order to go from STRING to | ||||||
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array) | ||||||
data_dtype = ( | ||||||
DtypeKind.UINT, | ||||||
8, | ||||||
ArrowCTypes.UINT8, | ||||||
Endianness.NATIVE, | ||||||
) | ||||||
data_buff, data_dtype = buffers["data"] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can simply ignore the data dtype here, as we know what it needs to be (we set it later).
Suggested change
|
||||||
|
||||||
if (data_dtype[1] == 8) and ( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Out of curiosity what is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey - from https://data-apis.org/dataframe-protocol/latest/API.html, it's number of bits:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are string dtypes supposed to have an 8 bit association? That is kind of confusing for variable length types, granted I know very little of how this interchange is supposed to work There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the idea is that strings are meant to be utf-8, and so each character can be represented with 8bits There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm interesting. Well keep in mind that UTF-8 doesn't mean a character is 8 bits; it is still 1-4 bytes In arrow-adbc I've seen this assigned the value of 0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The UFT-8 array, consisting of the actual strings and offsets array, so here the buffer representing all string data (so which is typically much longer as the logical length of the array) can be seen as simply a buffer of bytes ("bytearray"), so so in numpy / buffer interface terms you can view such an array as one bitwidth 8
That's something postgres specific, I think |
||||||
data_dtype[2] | ||||||
in ( | ||||||
ArrowCTypes.STRING, | ||||||
ArrowCTypes.LARGE_STRING, | ||||||
) | ||||||
): # format_str == utf-8 | ||||||
# temporary workaround to keep backwards compatibility due to | ||||||
# https://github.com/pandas-dev/pandas/issues/54781 | ||||||
|
||||||
# We're going to reinterpret the buffer as uint8, so make sure we can do it | ||||||
# safely | ||||||
|
||||||
# Convert the buffers to NumPy arrays. In order to go from STRING to | ||||||
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array) | ||||||
data_dtype = ( | ||||||
DtypeKind.UINT, | ||||||
8, | ||||||
ArrowCTypes.UINT8, | ||||||
Endianness.NATIVE, | ||||||
) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not need to be in an |
||||||
# Specify zero offset as we don't want to chunk the string data | ||||||
data = buffer_to_ndarray(data_buff, data_dtype, offset=0, length=data_buff.bufsize) | ||||||
|
||||||
|
@@ -378,16 +386,22 @@ def datetime_column_to_ndarray(col: Column) -> tuple[np.ndarray | pd.Series, Any | |||||
buffers = col.get_buffers() | ||||||
|
||||||
_, _, format_str, _ = col.dtype | ||||||
dbuf, dtype = buffers["data"] | ||||||
# Consider dtype being `uint` to get number of units passed since the 01.01.1970 | ||||||
data = buffer_to_ndarray( | ||||||
dbuf, | ||||||
( | ||||||
dbuf, data_dtype = buffers["data"] | ||||||
|
||||||
if data_dtype[0] == DtypeKind.DATETIME: | ||||||
# temporary workaround to keep backwards compatibility due to | ||||||
# https://github.com/pandas-dev/pandas/issues/54781 | ||||||
# Consider dtype being `uint` to get number of units passed since the 01.01.1970 | ||||||
data_dtype = ( | ||||||
DtypeKind.UINT, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this be an INT? Timestamps are backed by 64 bit signed integers in arrow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, AFAIK that should be signed INT |
||||||
dtype[1], | ||||||
getattr(ArrowCTypes, f"UINT{dtype[1]}"), | ||||||
data_dtype[1], | ||||||
getattr(ArrowCTypes, f"UINT{data_dtype[1]}"), | ||||||
Endianness.NATIVE, | ||||||
), | ||||||
) | ||||||
|
||||||
data = buffer_to_ndarray( | ||||||
dbuf, | ||||||
data_dtype, | ||||||
offset=col.offset, | ||||||
length=col.size(), | ||||||
) | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.