Skip to content

Commit 6903f80

Browse files
Add support for arrow large_string in cudf (#15093)
This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733 This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220 This PR fixes all 25 `dask-cudf` failures. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Ashwin Srinath (https://github.com/shwina) URL: #15093
1 parent 44686ca commit 6903f80

File tree

3 files changed

+17
-0
lines changed

3 files changed

+17
-0
lines changed

python/cudf/cudf/core/column/column.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1920,6 +1920,13 @@ def as_column(
19201920
return col
19211921

19221922
elif isinstance(arbitrary, (pa.Array, pa.ChunkedArray)):
1923+
if pa.types.is_large_string(arbitrary.type):
1924+
# Pandas-2.2+: Pandas defaults to `large_string` type
1925+
# instead of `string` without data-introspection.
1926+
# Temporary workaround until cudf has native
1927+
# support for `LARGE_STRING` i.e., 64 bit offsets
1928+
arbitrary = arbitrary.cast(pa.string())
1929+
19231930
if pa.types.is_float16(arbitrary.type):
19241931
raise NotImplementedError(
19251932
"Type casting from `float16` to `float32` is not "

python/cudf/cudf/tests/test_series.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2700,3 +2700,11 @@ def test_series_dtype_astypes(data):
27002700
result = cudf.Series(data, dtype="float64")
27012701
expected = cudf.Series([1.0, 2.0, 3.0])
27022702
assert_eq(result, expected)
2703+
2704+
2705+
def test_series_from_large_string():
2706+
pa_large_string_array = pa.array(["a", "b", "c"]).cast(pa.large_string())
2707+
got = cudf.Series(pa_large_string_array)
2708+
expected = pd.Series(pa_large_string_array)
2709+
2710+
assert_eq(expected, got)

python/cudf/cudf/utils/dtypes.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,8 @@ def cudf_dtype_from_pa_type(typ):
213213
return cudf.core.dtypes.StructDtype.from_arrow(typ)
214214
elif pa.types.is_decimal(typ):
215215
return cudf.core.dtypes.Decimal128Dtype.from_arrow(typ)
216+
elif pa.types.is_large_string(typ):
217+
return cudf.dtype("str")
216218
else:
217219
return cudf.api.types.pandas_dtype(typ.to_pandas_dtype())
218220

0 commit comments

Comments
 (0)