Skip to content

API: Setting Arrow-backed dtypes by default #51433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Feb 16, 2023 · 36 comments
Closed

API: Setting Arrow-backed dtypes by default #51433

datapythonista opened this issue Feb 16, 2023 · 36 comments
Labels
API - Consistency Internal Consistency of API/Behavior API Design Arrow pyarrow functionality Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking

Comments

@datapythonista
Copy link
Member

I've been using the new Arrow backed dtypes, and I'm a bit confused on how it is decided which backend is used. One example:

>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
...     pandas.Series([1, 2, 3, 4])
... 
0    1
1    2
2    3
3    4
dtype: int64

Why is setting the dtype_backend to pyarrow not enough to use Arrow in the Series constructor when no dtype is specified?

Also, when using for example read_csv:

>>> import pandas
>>> pandas.read_csv('test.csv').dtypes
name    object
age      int64
dtype: object
>>> pandas.read_csv('test.csv', use_nullable_dtypes=True).dtypes
name    string[python]
age              Int64
dtype: object
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
...     pandas.read_csv('test.csv').dtypes
... 
name    object
age      int64
dtype: object
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
...     pandas.read_csv('test.csv', use_nullable_dtypes=True).dtypes
... 
name    string[pyarrow]
age      int64[pyarrow]
dtype: object

Why again is not enough that the user set the backend to pyarrow to use Arrow dtypes, and needs to call use_nullable_dtypes? This s what we returned, which doesn't make sense to me:

dtype_backend=None dtype_backend=pyarrow
use_nullable_dtypes=False NumPy NumPy ???
use_nullable_dtypes=True Arrow+NumPy nullables Arrow

What I would expect:

dtype_backend=None dtype_backend=pyarrow
use_nullable_dtypes=False NumPy Arrow
use_nullable_dtypes=True Arrow eventually, Arrow+Numpy nullables for now Arrow

Sorry if I missed the discussion, maybe I'm just missing something. But I don't see what's the use case for a user to explicitly say they want Arrow types with the option, but still giving them NumPy backed series and dataframes... Is this something it was agreed, or we just didn't make the changes to have a more intuitive behavior?

CC: @mroeschke

@datapythonista datapythonista added API Design Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking API - Consistency Internal Consistency of API/Behavior Arrow pyarrow functionality labels Feb 16, 2023
@datapythonista datapythonista changed the title API: Settings Arrow-backed dtypes by default API: Setting Arrow-backed dtypes by default Feb 16, 2023
@mroeschke
Copy link
Member

Yeah it was discussed previously that this setting should only apply when use_nullable_dtype=True was set or df.convert_dtypes(). This was chosen instead of changing use_nullable_dtypes="pandas"|"pyarrow" for example

I think your first example should eventually return pyarrow backed types in the future for any method, but that would require a lot more changes.

@phofl
Copy link
Member

phofl commented Feb 16, 2023

Additionally, neither our own nullables nor pyarrow dtypes are yet implemented in our constructors

@datapythonista
Copy link
Member Author

I see that for the constructor it's just that it's not implemented, I thought it was related to the IO readers decision. Sorry for mixing both things, I guess we agree on that.

For the readers, I've been reading the discussions, but I still couldn't find the reason why we want to ignore the user preference unless use_nullable_dtype is specified. The API seems quite cumbersome, and I see in the discussion in #50291 that @phofl seems to agree, and emphasizes that the behavior should be well documented, since it's not intuitive. For what I understand, seems like it was motivated by the original option being nullable_backend before being renamed to dtype_backend. Before the rename it made sense to use it if use_nullable_dtypes was True. I understand that decision in the historical context, but I think the final API after the rename is counter-intuitive for no reason.

Personally, I'd make the nullable_dtypes option and the use_nullable_dtype parameter just a value of dtype_backend (which could be a parameter too). Not proposing to change that now, but feels like we could have an option for numpy (now named pandas), another for custom-nullable (the mix of numpy nullables and the Arrow string type used now with use_nullable_dtype=True), and the last one pyarrow (which should be respected). That would probably simplify things IMO.

But regardless of that, I don't see a use case when a user says (with the option) "I want pyarrow dtypes", and we still give them numpy dtypes, because a mostly unrelated parameter use_nullable_dtypes is set. pyarrow dtypes can be nullable, but still that option is mostly unrelated, and having it to false would make users expect to have pyarrow dtypes without the NA mask, not numpy types.

Am I missing something? Is there an advantage on forcing the user to set the use_nullable_dtype even if it set the pyarrow backend option?

@mroeschke
Copy link
Member

True, the coupling the use_nullable_dtypes keyword with the nullable_backend global option is heavily tied due to the historical renaming during development, inspired from the revelation of "pyarrow is an alternative nullable type implementation" to "pyarrow is generically a whole different type implementation".

I think @phofl expressed this before too, but an ideal end state is decoupling these two and only needing to set dtype_backend to 'numpy', 'pandas', 'pyarrow' (or similar) and getting numpy, pandas-nullable, and pyarrow-backed dtypes everywhere respectively and not needing use_nullable_dtypes at all

@datapythonista
Copy link
Member Author

I see that parquet works as expected (as expected by me :)

>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
...     print(pandas.read_parquet('test.parquet').dtypes)
Date       string[pyarrow]
Open       double[pyarrow]
High       double[pyarrow]
Low        double[pyarrow]
Close      double[pyarrow]
Volume      int64[pyarrow]
OpenInt     int64[pyarrow]
dtype: object

But I test also ORC, and seems it's behaving like CSV.

Is there any objection to make all readers return parquet types if mode.dtype_backend=pyarrow (regardless of the parameter use_nullable_dtypes? @pandas-dev/pandas-core

@phofl
Copy link
Member

phofl commented Feb 16, 2023

I‘d prefer implementing this everywhere before creating an option that is not tied to a keyword. Right now it works in half or 2/3 of our methods/functions and it‘s kind of vodoo when it works and when it does not.

I think we used this approach to avoid having two keywords, e.g. use_nullable_dtypes and something like use_pyarrow_dtypes

@datapythonista
Copy link
Member Author

Thanks for the feedback @phofl. I don't fully understand what you propose. What exactly works in half or two thirds of methods?

My understanding is that the dtype_backend option should affect mostly when data is created or loaded. For the constructor doesn't seem immediate to implement using it. But for the readers I think we can change the behavior to what read_parquet is doing relatively fast if there is agreement, no? And that would help avoid this magic of not understanding when it's working one way or another, and things would be better than the current state, no?

Or am I missing something?

@phofl
Copy link
Member

phofl commented Feb 16, 2023

If we don't tie the dtype_backend to a keyword, I would expect (as a user) that I'll get arrow-backed data everywhere. This is not the case right now, since constructors, among others, aren't implemented yet.

The alternative we considered was adding a dtype_backend keyword to every method where the option is supported, which is something I would prefer over the option magically working for some functions but not for others

@datapythonista
Copy link
Member Author

I see your point, thanks for clarifying. Personally I think users will still expect that data is loaded with the Arrow backend, even if we create this (in my opinion arbitrary) link with a keyword. And it'd be better to give them as much of that even if we can't immediately use Arrow types for any data loaded into pandas.

I don't know what regular users would expect, but I don't think as a user I'd expect that the option will get me Arrow data "everywhere" in the sense that df['numpy_backed_column'] + 1 would return an Arrow backend column. I see the scope here being data loaded/imported into pandas, via I/O or from Python structures via constructors.

If you really think the current approach to make that option only relevant when use_nullable_dtypes is set to True (which as I said, I think it's a very bad idea from the user experience point of view), would you make read_parquet use NumPy types when dtype_backend='pyarrow' and use_nullable_dtype=False then?

@jreback
Copy link
Contributor

jreback commented Feb 16, 2023

@datapythonista these are great longer term points but we need a transition and complete implementation before every single last thing works

documented behavior even if only half working is better than nothing

@datapythonista
Copy link
Member Author

Fully agree with you @jreback, and I think there is agreement that for the constructors we will have this half working behavior.

But if I'm not wrong, for the readers we're literally talking about removing the use_nullable_dtypes from this condition:

https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/base_parser.py#L754

So, it's not about being half working, it's about making the decision to not give the user Arrow types when they explicitly asked for them, unless they also specify use_nullable_dtypes=True in the call. Which kind of make sense to us since nullable dtypes is the origin of the dtype_backend option, but for the user is just an arbitrary parameter that we're asking in the documentation to pass, without being able to explain why. At least I can't explain it in the post I'm writing about pandas 2.0 and Arrow.

@mroeschke
Copy link
Member

user is just an arbitrary parameter that we're asking in the documentation to pass, without being able to explain why

This is fair. For the "why", it's just categorizing the numpy-nulllable-backed vs pyarrow-backed types as "nullable" answering "if you specified use_nullable_dtypes=True, which nullable implementation would you like". With this categorization, use_nullable_dtypes=False and dtype_backend="pyarrow" would seem clashing the pyarrow types are nullable so should we still return Arrow ( I know you highlighted this in your OP that this should be Arrow) ?

Overall this is just semantics though and maybe not significantly meaningful to users. I am not opposed to dtype_backend taking effect whether or not use_nullable_dtypes=True|False for the IO methods. Implementation wise IIRC some readers "piggy back" off of use_nullable_dtypes=True code paths, so some untangling will be needed.

@datapythonista
Copy link
Member Author

I still think we should change this before the release. I understand where the current behavior is coming from, but to me it's obvious that every user will expect that setting the backend option to arrow will make pandas use arrow for the constructors and readers. I understand that for the constructors is more work and maybe worth waiting for 2.1 (I'm surely +1 in making it a blocker for the release). But for the readers, the change is trivial, and feels like the current behavior only makes sense for the people involved in the development of the parameter, but for everyone else this will be misleading and frustrating. For reference, I just read this: https://mobile.twitter.com/rasbt/status/1632095663532957700

Also, for parquet we're not requiring use_nullable_dtypes, there is no consistency among formats.

Any objection if I open a PR to make the behavior consistent to use arrow dtypes if the backend option is set to arrow, regardless of the nullable dtypes parameter, which would only apply to the numpy backend?

@jbrockmendel
Copy link
Member

every user will expect that setting the backend option to arrow will make pandas use arrow for the constructors

"io_dtype_backend"?

Changing this may require an rc2

@phofl
Copy link
Member

phofl commented Mar 5, 2023

This would only solve the problem partly, since we don’t have a implementation for all io methods (but probably not really relevant because we covered the common ones)

there is another reason why I don’t like the option only approach. Arrow is still new and experimental, we have lots of places where arrow support doesn’t exist or has probably a bunch of bugs, using dtype_backend option and nothing else removes local control over turning the option on or off. I think a more likely approach is that you’ll try it out in a couple of functions but not everywhere. I agree that the coupling isn’t ideal but having a global option only is worse imo.

If you really want to opt in globally you can just set

pd.options.mode.nullable_dtypes = True

as well.

I don’t like making this behavior Option only, which removes control that is quite helpful in the beginning without adding much.

And yes we should change parquet to be consistent

@datapythonista
Copy link
Member Author

I understand those concerns. The option only applying in some cases where it's been already implemented, and not having control over a single operation after setting the option are surely not ideal.

But I think the current approach just makes things worse. We're going to confuse every single users, and get a cumbersome API that we won't be able to fix immediately after the release for backward compatibility.

I'm ok eplacing the use_nullable_dtypes parameter by a dtypes_backed one that accepts "numpy", "pandas" and "pyarrow". That addresses your concerns, and maybe worth to remove the option for now, or make it generate a warning saying this is experimental, and only supported by some operations.

Of course this would delay a bit the release and require another RC as Brock says. But totally worth doing in my opinion. What do you think?

PS: Here you have another example of someone using the option in the obvious but incorrect way: https://github.com/pola-rs/tpch/pull/36/files

@lithomas1
Copy link
Member

Would an acceptable option just be for dtype_backend to set use_nullable_dtypes to True when it's set to pyarrow.

@phofl
Copy link
Member

phofl commented Mar 5, 2023

Yeah renaming sounds like a solution, not great on such short notice though :(

I am ok with a global option, this makes definitely sense, but I don't think that it should be the only way of switching yet.

Automatically switching use_nullable_dtypes to True might create more confusion

@datapythonista
Copy link
Member Author

Yeah renaming sounds like a solution, not great on such short notice though :(

Agree, it'd been much better to have this conversation earlier, and surely before the RC. But IMO better now than never. And I personally think the cost of postponing the pandas 2.0 a bit is much less than the cost of releasing things as they are now. First because of how much more effort will be to change things in a backward compatible way once they are released. And second, because I think the current API is confusing and will make users interested in Arrow feel frustrated, and the reputational damage will be quite significant (and pandas doesn't have a great reputation for a consistent API already.

This is something I saw on twitter the other day, I guess we all can more or less relate... ;)

41eaf906-86d1-4a5f-bcb9-58de1d990303

@mroeschke
Copy link
Member

using dtype_backend option and nothing else removes local control over turning the option on or off.

Wouldn't a user's local control of testing this option on a subset of code can be done through use of the option_context context manager?

I can see how tying a global option to a method parameter can be confusing to users and would be more work in the future if we walk that back, so I do think it's a good idea to decouple the dtypes_backend option from use_nullable_dtypes parameter before 2.0. I would be okay with moving forward with just having the dtypes_backend option dictate whether numpy, pandas nullable, or pyarrow types are returned. I think eventually we want to get to a state where the global option can dictate what dtype implementation is used without the use of method parameters.

Should the steps forward be:

  1. Remove the newly added use_nullable_dtypes parameter from methods (everything except read_parquet IIRC)
  2. Ensure the dtypes_backend option affects the dtype implementation of the result regardless
  3. (Optional) deprecate use_nullable_dtypes in read_parquet?

@datapythonista
Copy link
Member Author

I agree with @mroeschke here. I think if a user sets the dtype_backend to a backend, and for a particular call prefers a different backend, is reasonable for them to use the context manager, instead of having an argument to each I/O call.

To the proposed steps I think it's missing adding a numpy (default) option to dtype_backend, which doesn't seem to exist, as I guess use_nullable_dtype=False is the current way to tell pandas that we want pure-numpy dtypes.

Also, since one of the concerns is that pyarrow is experimental and after selecting dtypes_backend=pyarrow it won't always be satisfied, I think we should add a warning when setting the option to pyarrow let users know. Besides what we can say in the documentation, I think the warning will help manage expectations much better.

@phofl @pandas-dev/pandas-core is everybody happy with this approach, and to make the change before pandas 2.0 (which would require another RC)?

@phofl
Copy link
Member

phofl commented Mar 7, 2023

I didn't really consider the context manager but I am not sure this is practical in general.

I think we have quite different expectations how an arrow backend adaption would go.

We have a big chunk of users with existing code bases. I can't imagine that they would consider turning the option on globally. Couple of reasons why not:

  • You would probably be busy fixing tests for weeks
  • You'll probably run into a bunch of bugs
  • Performance bottlenecks in some areas (try a groupby with pyarrow vs NumPy dtypes for example)

A reasonable migration approach would look similar to this (imo):

  • Add the arrow backend one step at a time to keep the failing tests limited and investigate performance if relevant.
  • move on to another area

I think this is how most adaptions would go. I don't think that a relevant part of our users will just do

pd.options.mode.dtype_backend = "pyarrow"

You could of course argue that all of this can be done via a context manager as well, but imo this is significantly more cumbersome than just setting a keyword.

I'd like to put this on the agenda tomorrow and happy to provide a pr afterwards no matter what we decide. This way we should be able to push he rc end of this week?

@datapythonista
Copy link
Member Author

I think there are two independent conversations we are discussing:

Q1) What is the way to define the backend to use globally:

Option 1:

dtype_backend
      |
      |---> numpy
      |---> pandas
      |---> pyarrow

Option 2:

                         yes                           pandas
is_nullable_dtype ----------------> dtype_backend --------------> (pandas)
        |                                   |
        | no                                | pyarrow
        V                                   V
  (numpy backend)                       (pyarrow)

To me it's obvious that option 2 is harder to understand, confusing, and doesn't provide any value. Unless the goal here is to overcomplicate things so users don't use any backend other than the numpy one.

Q2) Which API to provide users to overwrite the default backend:

Option 1:

with pandas.option_context('mode.dtype_backend', 'pyarrow'):
    df = pandas.read_csv(fname)

Option 2:

df = pandas.read_csv(fname, dtype_backend='pyarrow')

Option 2 is clearly nicer, but other than to migrate the dtype backend, it will be rarely used, and option 1 has couple of advantages IMO:

  • Simpler, literally no change to any of the read functions is needed (once the global option is working)
  • Avoids duplication, and possibly lack of consistency. Not only in the signature, but in the documentation of the parameter, which needs to be duplicated, or use one of the template hacks
  • Some of the readers have a huge number of parameters, like read_csv, not adding more things there would be nice

I'll be teaching tomorrow at the time of the meeting, I can't join, but my opinion: Strong -1 to the `is_nullable_dtypes (option 2) to the first question. Preference for the context manager to the second as it simplifies our codebase significantly and makes things simpler, but adding a parameter to all readers is also fine if that's what others prefer.

Also, as I said, I'd add a warning when settings the dtype_backend to any backend we consider experimental (surely pyarrow, not sure the pandas/nullable backend).

@phofl
Copy link
Member

phofl commented Mar 7, 2023

I am on board with option 1 for question 1, should probably have stated this a bit more explicit before

@mroeschke
Copy link
Member

@datapythonista during today's dev call, there was a sense that 2.0 should not introduce a global option dtype_backend and stick to introducing a dtype_backend=lib.no_default| "numpy_nullable" | "pyarrow" keyword in IO readers, convert_dtypes and to_numeric. The reasoning was a global option might have a user expectation to work everywhere but in reality will not yet for 2.0, and a parameter would be an easier migration/testing path for users. But a global option can be introduced in a later version when we have better arrow support (like in constructors).

What do you think?

@datapythonista
Copy link
Member Author

Sounds good to me, not sure if for the long term we may still want to deprecate those parameters in favor of the global option, but for pandas 2.0 seems the most reasonable thing to do. 100% onboard with the idea.

I assume we're still getting rid of the use_nullable_dtypes for everything except read_parquet, right?

@phofl I think you said you'll work on this. Let me know if you need help. Is the plan still to do a second RC once this is done, and try to release the final 2.0 around two weeks after?

@mroeschke
Copy link
Member

I assume we're still getting rid of the use_nullable_dtypes for everything except read_parquet, right?

I think we agreed to deprecate use_nullable_dtypes in favor of a dtype_backend keyword too

Is the plan still to do a second RC once this is done, and try to release the final 2.0 around two weeks after?

Yup, sounds good to me. I can help out with this too.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2023

I assume we're still getting rid of the use_nullable_dtypes for everything except read_parquet, right?

I think we agreed to deprecate use_nullable_dtypes in favor of a dtype_backend keyword too

Is the plan still to do a second RC once this is done, and try to release the final 2.0 around two weeks after?

Yup, sounds good to me. I can help out with this too.

+1 on these decisions

@phofl
Copy link
Member

phofl commented Mar 9, 2023

I opened #51853

Feedback very welcome

@datapythonista datapythonista mentioned this issue Mar 9, 2023
1 task
@jbrockmendel
Copy link
Member

#51853 has been merged, is this closable?

@datapythonista
Copy link
Member Author

Yes, thanks @jbrockmendel

@czcindy426
Copy link

It seems that it was once possible to use pd.options.mode.dtype_backend = "pyarrow" to do the global setting, as I saw it in at least two different code snippets. Yet when I tried it myself today, I got an error OptionError: 'You can only set the value of existing options'.

Is with pandas.option_context('mode.dtype_backend', 'pyarrow'): the recommended way of reverting the backend now?

@OSuwaidi
Copy link

It seems that it was once possible to use pd.options.mode.dtype_backend = "pyarrow" to do the global setting, as I saw it in at least two different code snippets. Yet when I tried it myself today, I got an error OptionError: 'You can only set the value of existing options'.

Is with pandas.option_context('mode.dtype_backend', 'pyarrow'): the recommended way of reverting the backend now?

Even with pandas.option_context('mode.dtype_backend', 'pyarrow'): doesn't work now, it spits out: OptionError: No such keys(s): 'mode.dtype_backend'

@phofl
Copy link
Member

phofl commented Aug 24, 2023

This option was removed, it was only available for a short time in a release candidate, never in a proper release

@CGarces
Copy link

CGarces commented Nov 1, 2023

@phofl so if I use
df = pd.DataFrame.from_dict(prices)
The only way to use arrow is set the data types after create the dataframe?

@mthiboust
Copy link
Contributor

Is there any plan on adding back the option to choose pyarrow as the default dtype backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Arrow pyarrow functionality Needs Discussion Requires discussion from core team before further action Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

No branches or pull requests

10 participants