ENH: Implement arrow string option for various I/O methods #54431

phofl · 2023-08-05T17:46:05Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2023-08-07T15:29:05Z

pandas/io/orc.py

@@ -132,7 +135,12 @@ def read_orc(
            df = pa_table.to_pandas(types_mapper=mapping.get)
        return df
    else:
-        return pa_table.to_pandas()
+        print("Ts")


lithomas1

Thanks for pushing this forward. I left some comments.

lithomas1 · 2023-08-07T16:57:11Z

pandas/_libs/lib.pyx

+    elif seen.str_:
+        if is_string_array(objects):
+            from pandas._config import get_option
+            opt = get_option("future.infer_string")


Can we pass this in as a kwarg to maybe_convert_objects instead?

I'd rather only get the option if actually needed

lithomas1 · 2023-08-07T16:58:37Z

pandas/core/dtypes/cast.py

@@ -796,6 +798,12 @@ def infer_dtype_from_scalar(val) -> tuple[DtypeObj, Any]:
        # coming out as np.str_!

        dtype = _dtype_obj
+        opt = get_option("future.infer_string")


using_pyarrow_string_dtype?

follow-up. This is introduced in the other pr (little bit confusing, sorry)

lithomas1 · 2023-08-07T17:01:30Z

pandas/io/_util.py

+def arrow_string_types_mapper() -> Callable:
+    pa = import_optional_dependency("pyarrow")
+
+    return {pa.string(): pd.ArrowDtype(pa.string())}.get


Thinking about this a little, is there a situation where you would want to mix pyarrow and numpy dtypes?

(I'm thinking maybe we should force users to pick the pyarrow dtype backend if you are using the pyarrow string type)

Yes there are a lot of situations.

NumPy numeric and Arrow strings is still the fastest, numpy numeric is 2D. Forcing them right now is not a good idea

lithomas1 · 2023-08-07T17:02:59Z

pandas/io/pytables.py

@@ -3219,7 +3221,12 @@ def read(
        self.validate_read(columns, where)
        index = self.read_index("index", start=start, stop=stop)
        values = self.read_array("values", start=start, stop=stop)
-        return Series(values, index=index, name=self.name, copy=False)
+        result = Series(values, index=index, name=self.name, copy=False)
+        if result.dtype.kind == "O" and using_pyarrow_string_dtype():


Not too familiar with this code, but do we need to check if results is a string array first if doing this?

Yeah that makes sense

lithomas1 · 2023-08-07T17:04:59Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+    dtype = pd.ArrowDtype(pa.string())
+
+    data = """a,b
+x,1


Can you add a test case with null/nan/None like in your other PR?

I can add a missing field, actually having these values doesn't make much sense

lithomas1 · 2023-08-07T17:07:21Z

pandas/_libs/lib.pyx

@@ -2669,6 +2678,20 @@ def maybe_convert_objects(ndarray[object] objects,
            return pi._data
        seen.object_ = True

+    elif seen.str_:
+        if is_string_array(objects):


I know everywhere else does this, but is there a way to avoid this double parsing?

(Maybe we check the other flags are all false?)

No, you exit the first loop as soon as you find one string

# Conflicts: # pandas/_libs/lib.pyx # pandas/tests/io/parser/dtypes/test_dtypes_basic.py # pandas/tests/series/test_constructors.py

mroeschke · 2023-08-10T20:52:31Z

Nice thanks @phofl

…v#54431) * ENH: Implement arrow string option for various I/O methods * ENH: allow opt-in to inferring pyarrow strings * Remove comments and add tests * Add string option to arrow parsers * Update * Update * Adjust csv * Update * Update * Add test * Fix mypy --------- Co-authored-by: Brock <[email protected]>

phofl and others added 6 commits July 30, 2023 16:06

ENH: Implement arrow string option for various I/O methods

3904245

ENH: allow opt-in to inferring pyarrow strings

ebe0bd5

Remove comments and add tests

0889028

Merge remote-tracking branch 'upstream/main' into arrow_string_option

28ace4b

Merge branch 'arrow_strings' into arrow_string_option

f2b5992

Add string option to arrow parsers

35a8240

phofl requested a review from WillAyd as a code owner August 5, 2023 17:46

phofl added 2 commits August 5, 2023 23:38

Update

b677a89

Update

11b267e

jbrockmendel reviewed Aug 7, 2023

View reviewed changes

mroeschke added IO Data IO issues that don't fit into a more specific label Arrow pyarrow functionality labels Aug 7, 2023

phofl mentioned this pull request Aug 7, 2023

ENH: allow opt-in to inferring pyarrow strings #54430

Merged

5 tasks

lithomas1 reviewed Aug 7, 2023

View reviewed changes

phofl added 6 commits August 9, 2023 21:42

Merge remote-tracking branch 'upstream/main' into arrow_string_option

0f79a2f

# Conflicts: # pandas/_libs/lib.pyx # pandas/tests/io/parser/dtypes/test_dtypes_basic.py # pandas/tests/series/test_constructors.py

Adjust csv

8072a86

Update

bed3124

Update

efb6f4a

Add test

0ac28a1

Fix mypy

ff38a29

phofl added this to the 2.1 milestone Aug 10, 2023

mroeschke approved these changes Aug 10, 2023

View reviewed changes

mroeschke merged commit 57c7943 into pandas-dev:main Aug 10, 2023

phofl deleted the arrow_string_option branch August 10, 2023 20:55

rhshadrach mentioned this pull request Feb 23, 2025

ENH(string dtype): fallback for HDF5 with UTF-8 surrogates #60993

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement arrow string option for various I/O methods #54431

ENH: Implement arrow string option for various I/O methods #54431

phofl commented Aug 5, 2023

jbrockmendel Aug 7, 2023

lithomas1 left a comment

lithomas1 Aug 7, 2023

phofl Aug 7, 2023

lithomas1 Aug 7, 2023

phofl Aug 7, 2023

phofl Aug 9, 2023

lithomas1 Aug 7, 2023

phofl Aug 7, 2023

lithomas1 Aug 7, 2023

phofl Aug 7, 2023

lithomas1 Aug 7, 2023

phofl Aug 7, 2023

phofl Aug 9, 2023

lithomas1 Aug 7, 2023

phofl Aug 7, 2023

mroeschke commented Aug 10, 2023

ENH: Implement arrow string option for various I/O methods #54431

ENH: Implement arrow string option for various I/O methods #54431

Conversation

phofl commented Aug 5, 2023

Choose a reason for hiding this comment

lithomas1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Aug 10, 2023