-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Implement arrow string option for various I/O methods #54431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/orc.py
Outdated
@@ -132,7 +135,12 @@ def read_orc( | |||
df = pa_table.to_pandas(types_mapper=mapping.get) | |||
return df | |||
else: | |||
return pa_table.to_pandas() | |||
print("Ts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward. I left some comments.
pandas/_libs/lib.pyx
Outdated
elif seen.str_: | ||
if is_string_array(objects): | ||
from pandas._config import get_option | ||
opt = get_option("future.infer_string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass this in as a kwarg to maybe_convert_objects
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather only get the option if actually needed
pandas/core/dtypes/cast.py
Outdated
@@ -796,6 +798,12 @@ def infer_dtype_from_scalar(val) -> tuple[DtypeObj, Any]: | |||
# coming out as np.str_! | |||
|
|||
dtype = _dtype_obj | |||
opt = get_option("future.infer_string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using_pyarrow_string_dtype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up. This is introduced in the other pr (little bit confusing, sorry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
def arrow_string_types_mapper() -> Callable: | ||
pa = import_optional_dependency("pyarrow") | ||
|
||
return {pa.string(): pd.ArrowDtype(pa.string())}.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a little, is there a situation where you would want to mix pyarrow and numpy dtypes?
(I'm thinking maybe we should force users to pick the pyarrow dtype backend if you are using the pyarrow string type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there are a lot of situations.
NumPy numeric and Arrow strings is still the fastest, numpy numeric is 2D. Forcing them right now is not a good idea
pandas/io/pytables.py
Outdated
@@ -3219,7 +3221,12 @@ def read( | |||
self.validate_read(columns, where) | |||
index = self.read_index("index", start=start, stop=stop) | |||
values = self.read_array("values", start=start, stop=stop) | |||
return Series(values, index=index, name=self.name, copy=False) | |||
result = Series(values, index=index, name=self.name, copy=False) | |||
if result.dtype.kind == "O" and using_pyarrow_string_dtype(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with this code, but do we need to check if results is a string array first if doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense
dtype = pd.ArrowDtype(pa.string()) | ||
|
||
data = """a,b | ||
x,1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case with null/nan/None like in your other PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a missing field, actually having these values doesn't make much sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
pandas/_libs/lib.pyx
Outdated
@@ -2669,6 +2678,20 @@ def maybe_convert_objects(ndarray[object] objects, | |||
return pi._data | |||
seen.object_ = True | |||
|
|||
elif seen.str_: | |||
if is_string_array(objects): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know everywhere else does this, but is there a way to avoid this double parsing?
(Maybe we check the other flags are all false?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you exit the first loop as soon as you find one string
Nice thanks @phofl |
…v#54431) * ENH: Implement arrow string option for various I/O methods * ENH: allow opt-in to inferring pyarrow strings * Remove comments and add tests * Add string option to arrow parsers * Update * Update * Adjust csv * Update * Update * Add test * Fix mypy --------- Co-authored-by: Brock <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.