Skip to content

ENH: Implement arrow string option for various I/O methods #54431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 10, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 5, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@phofl phofl requested a review from WillAyd as a code owner August 5, 2023 17:46
pandas/io/orc.py Outdated
@@ -132,7 +135,12 @@ def read_orc(
df = pa_table.to_pandas(types_mapper=mapping.get)
return df
else:
return pa_table.to_pandas()
print("Ts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@mroeschke mroeschke added IO Data IO issues that don't fit into a more specific label Arrow pyarrow functionality labels Aug 7, 2023
Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward. I left some comments.

elif seen.str_:
if is_string_array(objects):
from pandas._config import get_option
opt = get_option("future.infer_string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass this in as a kwarg to maybe_convert_objects instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather only get the option if actually needed

@@ -796,6 +798,12 @@ def infer_dtype_from_scalar(val) -> tuple[DtypeObj, Any]:
# coming out as np.str_!

dtype = _dtype_obj
opt = get_option("future.infer_string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using_pyarrow_string_dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up. This is introduced in the other pr (little bit confusing, sorry)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

def arrow_string_types_mapper() -> Callable:
pa = import_optional_dependency("pyarrow")

return {pa.string(): pd.ArrowDtype(pa.string())}.get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a little, is there a situation where you would want to mix pyarrow and numpy dtypes?

(I'm thinking maybe we should force users to pick the pyarrow dtype backend if you are using the pyarrow string type)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are a lot of situations.

NumPy numeric and Arrow strings is still the fastest, numpy numeric is 2D. Forcing them right now is not a good idea

@@ -3219,7 +3221,12 @@ def read(
self.validate_read(columns, where)
index = self.read_index("index", start=start, stop=stop)
values = self.read_array("values", start=start, stop=stop)
return Series(values, index=index, name=self.name, copy=False)
result = Series(values, index=index, name=self.name, copy=False)
if result.dtype.kind == "O" and using_pyarrow_string_dtype():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with this code, but do we need to check if results is a string array first if doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense

dtype = pd.ArrowDtype(pa.string())

data = """a,b
x,1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case with null/nan/None like in your other PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a missing field, actually having these values doesn't make much sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -2669,6 +2678,20 @@ def maybe_convert_objects(ndarray[object] objects,
return pi._data
seen.object_ = True

elif seen.str_:
if is_string_array(objects):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know everywhere else does this, but is there a way to avoid this double parsing?

(Maybe we check the other flags are all false?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you exit the first loop as soon as you find one string

phofl added 6 commits August 9, 2023 21:42
# Conflicts:
#	pandas/_libs/lib.pyx
#	pandas/tests/io/parser/dtypes/test_dtypes_basic.py
#	pandas/tests/series/test_constructors.py
@phofl phofl added this to the 2.1 milestone Aug 10, 2023
@mroeschke mroeschke merged commit 57c7943 into pandas-dev:main Aug 10, 2023
@mroeschke
Copy link
Member

Nice thanks @phofl

@phofl phofl deleted the arrow_string_option branch August 10, 2023 20:55
mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Aug 18, 2023
…v#54431)

* ENH: Implement arrow string option for various I/O methods

* ENH: allow opt-in to inferring pyarrow strings

* Remove comments and add tests

* Add string option to arrow parsers

* Update

* Update

* Adjust csv

* Update

* Update

* Add test

* Fix mypy

---------

Co-authored-by: Brock <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants