Skip to content

Default string dtype should not raise fallback performance warnings #58581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #54792
jorisvandenbossche opened this issue May 5, 2024 · 5 comments · Fixed by #59760
Closed
Tracked by #54792

Default string dtype should not raise fallback performance warnings #58581

jorisvandenbossche opened this issue May 5, 2024 · 5 comments · Fixed by #59760
Labels
Arrow pyarrow functionality Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Issue related to the new default "string" dtype (#54792)

As long as the pyarrow-backed string dtype was opt-in, so users could get the better performance, it might have made sense to warn for certain cases where it was not actually using pyarrow but was falling back to the slower object-dtype implementation.

However, when we make the pyarrow-backed dtype the default string dtype for pandas 3.0, I don't think the current warnings make sense:

>>> pd.options.future.infer_string = True
>>> s = pd.Series(["a", "b", "c"])
>>> import re
>>> s.str.contains("a", flags=re.IGNORECASE)
PerformanceWarning: Falling back on a non-pyarrow code path which may decrease performance.
0     True
1    False
2    False
dtype: bool

Of course, it's still taking a slower code path, but 1) the user cannot do anything about this, and 2) the user also didn't choose this explicitly, it just got this dtype by default, so we don't need to warn the user that their choice for the faster dtype didn't have any effect.

So I think we should remove the warning by default (maybe still give users a way to get them optionally, now we have an option for this, xref #56921), and better document in the docstring if a certain option will cause a slowdown.

Or if we keep a warning, it should at least be an actionable warning, if that is possible. For example in the specific case above, a user could pass case=False instead of flags=re.IGNORECASE for the same result and without fallback. But then 1) the warning should actually point users to that better option, and 2) I don't know if such alternative will always exist.

cc @mroeschke @phofl

@mroeschke
Copy link
Member

I'd be OK removing this warning. (And +1 to eventually remove this fallback behavior)

@jorisvandenbossche
Copy link
Member Author

List of methods where we fallback (currently without warning): #58551 (comment)

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2024

I think we should keep the warning. In general I think the vast majority of users would benefit from using pyarrow strings, so the more it gets promoted the better. For the people that cannot use pyarrow, they can either filter out the warning or ignore it pretty easily

@jorisvandenbossche
Copy link
Member Author

@WillAyd I think you are misunderstanding the issue.

In general I think the vast majority of users would benefit from using pyarrow strings, so the more it gets promoted the better.

This issue is about warnings you get when already using the pyarrow strings, so there is nothing to promote to use pyarrow.

There are just certain methods, also for our pyarrow-backed string arrays, where we don't actually use pyarrow because pyarrow is missing some features. So it is not about the user not using pyarrow, but our implementation not being able to do so.

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2024

Ah yes I did misunderstand. Ok yea I agree - this is not something that should be exposed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants