-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: Setting Arrow-backed dtypes by default #51433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah it was discussed previously that this setting should only apply when I think your first example should eventually return pyarrow backed types in the future for any method, but that would require a lot more changes. |
Additionally, neither our own nullables nor pyarrow dtypes are yet implemented in our constructors |
I see that for the constructor it's just that it's not implemented, I thought it was related to the IO readers decision. Sorry for mixing both things, I guess we agree on that. For the readers, I've been reading the discussions, but I still couldn't find the reason why we want to ignore the user preference unless Personally, I'd make the But regardless of that, I don't see a use case when a user says (with the option) "I want pyarrow dtypes", and we still give them numpy dtypes, because a mostly unrelated parameter Am I missing something? Is there an advantage on forcing the user to set the |
True, the coupling the I think @phofl expressed this before too, but an ideal end state is decoupling these two and only needing to set |
I see that parquet works as expected (as expected by me :) >>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
... print(pandas.read_parquet('test.parquet').dtypes)
Date string[pyarrow]
Open double[pyarrow]
High double[pyarrow]
Low double[pyarrow]
Close double[pyarrow]
Volume int64[pyarrow]
OpenInt int64[pyarrow]
dtype: object But I test also ORC, and seems it's behaving like CSV. Is there any objection to make all readers return parquet types if |
I‘d prefer implementing this everywhere before creating an option that is not tied to a keyword. Right now it works in half or 2/3 of our methods/functions and it‘s kind of vodoo when it works and when it does not. I think we used this approach to avoid having two keywords, e.g. use_nullable_dtypes and something like use_pyarrow_dtypes |
Thanks for the feedback @phofl. I don't fully understand what you propose. What exactly works in half or two thirds of methods? My understanding is that the Or am I missing something? |
If we don't tie the The alternative we considered was adding a |
I see your point, thanks for clarifying. Personally I think users will still expect that data is loaded with the Arrow backend, even if we create this (in my opinion arbitrary) link with a keyword. And it'd be better to give them as much of that even if we can't immediately use Arrow types for any data loaded into pandas. I don't know what regular users would expect, but I don't think as a user I'd expect that the option will get me Arrow data "everywhere" in the sense that If you really think the current approach to make that option only relevant when |
@datapythonista these are great longer term points but we need a transition and complete implementation before every single last thing works documented behavior even if only half working is better than nothing |
Fully agree with you @jreback, and I think there is agreement that for the constructors we will have this half working behavior. But if I'm not wrong, for the readers we're literally talking about removing the https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/base_parser.py#L754 So, it's not about being half working, it's about making the decision to not give the user Arrow types when they explicitly asked for them, unless they also specify |
This is fair. For the "why", it's just categorizing the numpy-nulllable-backed vs pyarrow-backed types as "nullable" answering "if you specified Overall this is just semantics though and maybe not significantly meaningful to users. I am not opposed to |
I still think we should change this before the release. I understand where the current behavior is coming from, but to me it's obvious that every user will expect that setting the backend option to arrow will make pandas use arrow for the constructors and readers. I understand that for the constructors is more work and maybe worth waiting for 2.1 (I'm surely +1 in making it a blocker for the release). But for the readers, the change is trivial, and feels like the current behavior only makes sense for the people involved in the development of the parameter, but for everyone else this will be misleading and frustrating. For reference, I just read this: https://mobile.twitter.com/rasbt/status/1632095663532957700 Also, for parquet we're not requiring Any objection if I open a PR to make the behavior consistent to use arrow dtypes if the backend option is set to arrow, regardless of the nullable dtypes parameter, which would only apply to the numpy backend? |
"io_dtype_backend"? Changing this may require an rc2 |
This would only solve the problem partly, since we don’t have a implementation for all io methods (but probably not really relevant because we covered the common ones) there is another reason why I don’t like the option only approach. Arrow is still new and experimental, we have lots of places where arrow support doesn’t exist or has probably a bunch of bugs, using dtype_backend option and nothing else removes local control over turning the option on or off. I think a more likely approach is that you’ll try it out in a couple of functions but not everywhere. I agree that the coupling isn’t ideal but having a global option only is worse imo. If you really want to opt in globally you can just set pd.options.mode.nullable_dtypes = True as well. I don’t like making this behavior Option only, which removes control that is quite helpful in the beginning without adding much. And yes we should change parquet to be consistent |
I understand those concerns. The option only applying in some cases where it's been already implemented, and not having control over a single operation after setting the option are surely not ideal. But I think the current approach just makes things worse. We're going to confuse every single users, and get a cumbersome API that we won't be able to fix immediately after the release for backward compatibility. I'm ok eplacing the Of course this would delay a bit the release and require another RC as Brock says. But totally worth doing in my opinion. What do you think? PS: Here you have another example of someone using the option in the obvious but incorrect way: https://github.com/pola-rs/tpch/pull/36/files |
Would an acceptable option just be for |
Yeah renaming sounds like a solution, not great on such short notice though :( I am ok with a global option, this makes definitely sense, but I don't think that it should be the only way of switching yet. Automatically switching use_nullable_dtypes to True might create more confusion |
Agree, it'd been much better to have this conversation earlier, and surely before the RC. But IMO better now than never. And I personally think the cost of postponing the pandas 2.0 a bit is much less than the cost of releasing things as they are now. First because of how much more effort will be to change things in a backward compatible way once they are released. And second, because I think the current API is confusing and will make users interested in Arrow feel frustrated, and the reputational damage will be quite significant (and pandas doesn't have a great reputation for a consistent API already. This is something I saw on twitter the other day, I guess we all can more or less relate... ;) |
Wouldn't a user's local control of testing this option on a subset of code can be done through use of the I can see how tying a global option to a method parameter can be confusing to users and would be more work in the future if we walk that back, so I do think it's a good idea to decouple the Should the steps forward be:
|
I agree with @mroeschke here. I think if a user sets the To the proposed steps I think it's missing adding a Also, since one of the concerns is that pyarrow is experimental and after selecting @phofl @pandas-dev/pandas-core is everybody happy with this approach, and to make the change before pandas 2.0 (which would require another RC)? |
I didn't really consider the context manager but I am not sure this is practical in general. I think we have quite different expectations how an arrow backend adaption would go. We have a big chunk of users with existing code bases. I can't imagine that they would consider turning the option on globally. Couple of reasons why not:
A reasonable migration approach would look similar to this (imo):
I think this is how most adaptions would go. I don't think that a relevant part of our users will just do
You could of course argue that all of this can be done via a context manager as well, but imo this is significantly more cumbersome than just setting a keyword. I'd like to put this on the agenda tomorrow and happy to provide a pr afterwards no matter what we decide. This way we should be able to push he rc end of this week? |
I think there are two independent conversations we are discussing: Q1) What is the way to define the backend to use globally:Option 1:
Option 2:
To me it's obvious that option 2 is harder to understand, confusing, and doesn't provide any value. Unless the goal here is to overcomplicate things so users don't use any backend other than the numpy one. Q2) Which API to provide users to overwrite the default backend:Option 1: with pandas.option_context('mode.dtype_backend', 'pyarrow'):
df = pandas.read_csv(fname) Option 2: df = pandas.read_csv(fname, dtype_backend='pyarrow') Option 2 is clearly nicer, but other than to migrate the dtype backend, it will be rarely used, and option 1 has couple of advantages IMO:
I'll be teaching tomorrow at the time of the meeting, I can't join, but my opinion: Strong -1 to the `is_nullable_dtypes (option 2) to the first question. Preference for the context manager to the second as it simplifies our codebase significantly and makes things simpler, but adding a parameter to all readers is also fine if that's what others prefer. Also, as I said, I'd add a warning when settings the |
I am on board with option 1 for question 1, should probably have stated this a bit more explicit before |
@datapythonista during today's dev call, there was a sense that 2.0 should not introduce a global option What do you think? |
Sounds good to me, not sure if for the long term we may still want to deprecate those parameters in favor of the global option, but for pandas 2.0 seems the most reasonable thing to do. 100% onboard with the idea. I assume we're still getting rid of the @phofl I think you said you'll work on this. Let me know if you need help. Is the plan still to do a second RC once this is done, and try to release the final 2.0 around two weeks after? |
I think we agreed to deprecate
Yup, sounds good to me. I can help out with this too. |
+1 on these decisions |
I opened #51853 Feedback very welcome |
#51853 has been merged, is this closable? |
Yes, thanks @jbrockmendel |
It seems that it was once possible to use Is |
Even |
This option was removed, it was only available for a short time in a release candidate, never in a proper release |
@phofl so if I use |
Is there any plan on adding back the option to choose |
I've been using the new Arrow backed dtypes, and I'm a bit confused on how it is decided which backend is used. One example:
Why is setting the
dtype_backend
topyarrow
not enough to use Arrow in theSeries
constructor when no dtype is specified?Also, when using for example
read_csv
:Why again is not enough that the user set the backend to
pyarrow
to use Arrow dtypes, and needs to calluse_nullable_dtypes
? This s what we returned, which doesn't make sense to me:What I would expect:
Sorry if I missed the discussion, maybe I'm just missing something. But I don't see what's the use case for a user to explicitly say they want Arrow types with the option, but still giving them NumPy backed series and dataframes... Is this something it was agreed, or we just didn't make the changes to have a more intuitive behavior?
CC: @mroeschke
The text was updated successfully, but these errors were encountered: