-
Notifications
You must be signed in to change notification settings - Fork 21
Interchage protocol - large-string? #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For onlookers, relevant spec tidbit dataframe-api/protocol/dataframe_protocol.py Lines 44 to 45 in baa605d
So as I read it, large strings aren't supported by the interchange protocol. IMO for now, |
sure but isn't
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings |
Ahh, no idea then if we need another dtype, have existing ways to introspect string columns for interchange, etc. |
Yes, the only difference is that the offsets are represented in |
So I suspect that this is an implementation rather than a spec issue. We have a "string" dtype identified by an enum of 21 for the "kind" and a format string which say:
so that should cover any Arrow string type I'd think. Then for the data representation of variable-length strings, we have
For |
Seems plausible - should it be fine to just make this change in pandas diff --git a/pandas/core/interchange/from_dataframe.py b/pandas/core/interchange/from_dataframe.py
index 2bbb678516..998f3bc374 100644
--- a/pandas/core/interchange/from_dataframe.py
+++ b/pandas/core/interchange/from_dataframe.py
@@ -238,8 +238,11 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:
# Retrieve the data buffer containing the UTF-8 code units
data_buff, protocol_data_dtype = buffers["data"]
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely
assert protocol_data_dtype[1] == 8 # bitwidth == 8
- assert protocol_data_dtype[2] == ArrowCTypes.STRING # format_str == utf-8
+ assert protocol_data_dtype[2] in (
+ ArrowCTypes.STRING,
+ ArrowCTypes.LARGE_STRING,
+ ) # format_str == utf-8
# Convert the buffers to NumPy arrays. In order to go from STRING to
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array)
data_dtype = (
diff --git a/pandas/core/interchange/utils.py b/pandas/core/interchange/utils.py
index 89599818d6..69c0367238 100644
--- a/pandas/core/interchange/utils.py
+++ b/pandas/core/interchange/utils.py
@@ -39,6 +39,7 @@ class ArrowCTypes:
FLOAT32 = "f"
FLOAT64 = "g"
STRING = "u" # utf-8
+ LARGE_STRING = "U" # utf-8
DATE32 = "tdD"
DATE64 = "tdm"
# Resoulution: ? At least, if I do this, it all just works |
If that does the job and |
thanks all this is resolved, so closing |
Currently, the interchange fails with
large-string
type:I get
This is an issue when interchanging from polars, which uses
large-string
: pola-rs/polars#8377What should be done in this case, where do we go from here?
Note that if I try adding
large-string
to theArrowCTypes
in pandas, then it "just works", but that's probably not the solution?The text was updated successfully, but these errors were encountered: