Skip to content

Interchage protocol - large-string? #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Apr 20, 2023 · 8 comments
Closed

Interchage protocol - large-string? #150

MarcoGorelli opened this issue Apr 20, 2023 · 8 comments

Comments

@MarcoGorelli
Copy link
Contributor

Currently, the interchange fails with large-string type:

import pyarrow as pa

arr = ["foo", "bar"]
table = pa.table(
    {"arr": pa.array(arr, 'large_string')}
)
exchange_df = table.__dataframe__()

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)

I get

Traceback (most recent call last):
  File "t.py", line 30, in <module>
    from_dataframe(exchange_df)
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 52, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 73, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 125, in protocol_df_chunk_to_pandas
    columns[name], buf = string_column_to_ndarray(col)
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 242, in string_column_to_ndarray
    assert protocol_data_dtype[1] == 8  # bitwidth == 8
AssertionError

This is an issue when interchanging from polars, which uses large-string: pola-rs/polars#8377

What should be done in this case, where do we go from here?

Note that if I try adding large-string to the ArrowCTypes in pandas, then it "just works", but that's probably not the solution?

@honno
Copy link
Member

honno commented Apr 20, 2023

For onlookers, relevant spec tidbit

STRING : int
Matches to string data type (UTF-8 encoded).

So as I read it, large strings aren't supported by the interchange protocol. IMO for now, __dataframe__() should error out, or try coerce large string columns into UTF8 string.

@MarcoGorelli
Copy link
Contributor Author

sure but isn't U still utf8-encoded?

U large utf-8 string

https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

@honno
Copy link
Member

honno commented Apr 20, 2023

Ahh, no idea then if we need another dtype, have existing ways to introspect string columns for interchange, etc.

@ritchie46
Copy link

sure but isn't U still utf8-encoded?

Yes, the only difference is that the offsets are represented in i64 integers instead of i32 and therefore it can hold more than 2^31 ~ 2GB of data in a single buffer.

@rgommers
Copy link
Member

So I suspect that this is an implementation rather than a spec issue. We have a "string" dtype identified by an enum of 21 for the "kind" and a format string which say:

        Format string : data type description format string in Apache Arrow C
                        Data Interface format.

so that should cover any Arrow string type I'd think.

Then for the data representation of variable-length strings, we have data, validity and offset elements; the relevant one here is offsets:

    # first element is a buffer containing the offset values for
    # variable-size binary data (e.g., variable-length strings);
    # second element is the offsets buffer's associated dtype.
    # None if the data buffer does not have an associated offsets buffer
    offsets: Optional[Tuple["Buffer", Dtype]]

For large-string, it seems like the Dtype part should use int64 rather than int32, and that will work. So protocol producers/consumers probably don't handle this simply because it hasn't come up before?

@MarcoGorelli
Copy link
Contributor Author

So protocol producers/consumers probably don't handle this simply because it hasn't come up before?

Seems plausible - should it be fine to just make this change in pandas

diff --git a/pandas/core/interchange/from_dataframe.py b/pandas/core/interchange/from_dataframe.py
index 2bbb678516..998f3bc374 100644
--- a/pandas/core/interchange/from_dataframe.py
+++ b/pandas/core/interchange/from_dataframe.py
@@ -238,8 +238,11 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:
     # Retrieve the data buffer containing the UTF-8 code units
     data_buff, protocol_data_dtype = buffers["data"]
     # We're going to reinterpret the buffer as uint8, so make sure we can do it safely
     assert protocol_data_dtype[1] == 8  # bitwidth == 8
-    assert protocol_data_dtype[2] == ArrowCTypes.STRING  # format_str == utf-8
+    assert protocol_data_dtype[2] in (
+        ArrowCTypes.STRING,
+        ArrowCTypes.LARGE_STRING,
+    )  # format_str == utf-8
     # Convert the buffers to NumPy arrays. In order to go from STRING to
     # an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array)
     data_dtype = (
diff --git a/pandas/core/interchange/utils.py b/pandas/core/interchange/utils.py
index 89599818d6..69c0367238 100644
--- a/pandas/core/interchange/utils.py
+++ b/pandas/core/interchange/utils.py
@@ -39,6 +39,7 @@ class ArrowCTypes:
     FLOAT32 = "f"
     FLOAT64 = "g"
     STRING = "u"  # utf-8
+    LARGE_STRING = "U"  # utf-8
     DATE32 = "tdD"
     DATE64 = "tdm"
     # Resoulution:

?

At least, if I do this, it all just works

@rgommers
Copy link
Member

If that does the job and LARGE_STRING = "U" is already handled internally by pandas (must be if it works and there is test coverage), then sure - that seems fine to me.

@MarcoGorelli
Copy link
Contributor Author

thanks all

this is resolved, so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants