Interchage protocol - large-string? #150

MarcoGorelli · 2023-04-20T09:13:36Z

Currently, the interchange fails with large-string type:

import pyarrow as pa

arr = ["foo", "bar"]
table = pa.table(
    {"arr": pa.array(arr, 'large_string')}
)
exchange_df = table.__dataframe__()

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)

I get

Traceback (most recent call last):
  File "t.py", line 30, in <module>
    from_dataframe(exchange_df)
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 52, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 73, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 125, in protocol_df_chunk_to_pandas
    columns[name], buf = string_column_to_ndarray(col)
  File "/home/marcogorelli/pandas-dev/pandas/core/interchange/from_dataframe.py", line 242, in string_column_to_ndarray
    assert protocol_data_dtype[1] == 8  # bitwidth == 8
AssertionError

This is an issue when interchanging from polars, which uses large-string: pola-rs/polars#8377

What should be done in this case, where do we go from here?

Note that if I try adding large-string to the ArrowCTypes in pandas, then it "just works", but that's probably not the solution?

The text was updated successfully, but these errors were encountered:

honno · 2023-04-20T09:20:23Z

For onlookers, relevant spec tidbit

dataframe-api/protocol/dataframe_protocol.py

Lines 44 to 45 in baa605d

    
               STRING : int 
        
                   Matches to string data type (UTF-8 encoded).

So as I read it, large strings aren't supported by the interchange protocol. IMO for now, __dataframe__() should error out, or try coerce large string columns into UTF8 string.

MarcoGorelli · 2023-04-20T09:26:48Z

sure but isn't U still utf8-encoded?

U large utf-8 string

https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

honno · 2023-04-20T09:33:17Z

Ahh, no idea then if we need another dtype, have existing ways to introspect string columns for interchange, etc.

ritchie46 · 2023-04-20T10:01:52Z

sure but isn't U still utf8-encoded?

Yes, the only difference is that the offsets are represented in i64 integers instead of i32 and therefore it can hold more than 2^31 ~ 2GB of data in a single buffer.

rgommers · 2023-04-20T10:19:50Z

So I suspect that this is an implementation rather than a spec issue. We have a "string" dtype identified by an enum of 21 for the "kind" and a format string which say:

        Format string : data type description format string in Apache Arrow C
                        Data Interface format.

so that should cover any Arrow string type I'd think.

Then for the data representation of variable-length strings, we have data, validity and offset elements; the relevant one here is offsets:

    # first element is a buffer containing the offset values for
    # variable-size binary data (e.g., variable-length strings);
    # second element is the offsets buffer's associated dtype.
    # None if the data buffer does not have an associated offsets buffer
    offsets: Optional[Tuple["Buffer", Dtype]]

For large-string, it seems like the Dtype part should use int64 rather than int32, and that will work. So protocol producers/consumers probably don't handle this simply because it hasn't come up before?

MarcoGorelli · 2023-04-20T10:23:53Z

So protocol producers/consumers probably don't handle this simply because it hasn't come up before?

Seems plausible - should it be fine to just make this change in pandas

diff --git a/pandas/core/interchange/from_dataframe.py b/pandas/core/interchange/from_dataframe.py
index 2bbb678516..998f3bc374 100644
--- a/pandas/core/interchange/from_dataframe.py
+++ b/pandas/core/interchange/from_dataframe.py
@@ -238,8 +238,11 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:
     # Retrieve the data buffer containing the UTF-8 code units
     data_buff, protocol_data_dtype = buffers["data"]
     # We're going to reinterpret the buffer as uint8, so make sure we can do it safely
     assert protocol_data_dtype[1] == 8  # bitwidth == 8
-    assert protocol_data_dtype[2] == ArrowCTypes.STRING  # format_str == utf-8
+    assert protocol_data_dtype[2] in (
+        ArrowCTypes.STRING,
+        ArrowCTypes.LARGE_STRING,
+    )  # format_str == utf-8
     # Convert the buffers to NumPy arrays. In order to go from STRING to
     # an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array)
     data_dtype = (
diff --git a/pandas/core/interchange/utils.py b/pandas/core/interchange/utils.py
index 89599818d6..69c0367238 100644
--- a/pandas/core/interchange/utils.py
+++ b/pandas/core/interchange/utils.py
@@ -39,6 +39,7 @@ class ArrowCTypes:
     FLOAT32 = "f"
     FLOAT64 = "g"
     STRING = "u"  # utf-8
+    LARGE_STRING = "U"  # utf-8
     DATE32 = "tdD"
     DATE64 = "tdm"
     # Resoulution:

?

At least, if I do this, it all just works

rgommers · 2023-04-20T13:23:24Z

If that does the job and LARGE_STRING = "U" is already handled internally by pandas (must be if it works and there is test coverage), then sure - that seems fine to me.

MarcoGorelli · 2023-04-28T15:54:05Z

thanks all

this is resolved, so closing

MarcoGorelli mentioned this issue Apr 20, 2023

__dataframe__ returns large-string, which fails on interchange protocol pola-rs/polars#8377

Closed

2 tasks

rgommers added the interchange-protocol label Apr 20, 2023

MarcoGorelli mentioned this issue Apr 20, 2023

BUG: interchange.from_dataframe doesn't work with large_string pandas-dev/pandas#52795

Closed

3 tasks

MarcoGorelli closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interchage protocol - large-string? #150

Interchage protocol - large-string? #150

MarcoGorelli commented Apr 20, 2023

honno commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

honno commented Apr 20, 2023

ritchie46 commented Apr 20, 2023

rgommers commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

rgommers commented Apr 20, 2023

MarcoGorelli commented Apr 28, 2023

Interchage protocol - large-string? #150

Interchage protocol - large-string? #150

Comments

MarcoGorelli commented Apr 20, 2023

honno commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

honno commented Apr 20, 2023

ritchie46 commented Apr 20, 2023

rgommers commented Apr 20, 2023

MarcoGorelli commented Apr 20, 2023

rgommers commented Apr 20, 2023

MarcoGorelli commented Apr 28, 2023