Skip to content

API: how to check for "logical" equality of dtypes? #60305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Nov 13, 2024 · 11 comments
Open

API: how to check for "logical" equality of dtypes? #60305

jorisvandenbossche opened this issue Nov 13, 2024 · 11 comments

Comments

@jorisvandenbossche
Copy link
Member

Assume you have a series, which has a certain dtype. In the case that this dtype is an instance of potentially multiple variants of a logical dtype (for example, string backed by python or backed by pyarrow), how do you check for the "logical" equality of such dtypes?

For checking the logical equality for one series, you have the option to compare it with the generic string alias (which will return True for any variant of it) or checking the dtype with isinstance or some is_..._dtype (although we have deprecated some of those). Using string dtype as the example:

ser.dtype == "string"
# or
isinstance(ser.dtype, pd.StringDtype)
pd.api.types.is_string_dtype(ser.dtype)

When you want to check if two serieses have the same dtype, the == will check for exact equality (in the string dtype example, the below can evaluate to False even if both are a StringDtype, but have a different storage):

ser1.dtype == ser2.dtype

But so how to check this logical equality for two dtypes? In the example, how to know that both dtypes are representing the same logical dtype (i.e. both a StringDtype instance), without necessarily wanting to check the exact type (i.e. the user doesn't necessarily know it are string dtypes, just want to check if they are logically the same)

# this might work?
type(ser1.dtype) == type(ser2.dtype)

Do we want some other API here that is a bit more user friendly? (just brainstorming, something like dtype1.is_same_type(dtype2), or a function, ..)


This is important in the discussion around logical dtypes (#58455), but so it is already an issue for the new string dtype as well in pandas 3.0

cc @WillAyd @Dr-Irv @jbrockmendel (tagging some people that were most active in the logical dtypes discussion)

@jorisvandenbossche
Copy link
Member Author

And maybe an additional question is how to propagate that notion of "exact equality" vs "logical equality" into methods like .equals() or assert_frame_equal() (those could get a keyword about it?)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 13, 2024

One idea is to make .equals() mean "is it the same logical type", but that won't work because sometimes the dtype of a Series is a numpy dtype.

IMHO, let == mean "exactly the same dtype", and introduce a function to mean "is same logical dtype"

@jorisvandenbossche
Copy link
Member Author

Something like pd.api.types.is_same_dtype(dtype1, dtype2) ?

@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2024

When you want to check if two serieses have the same dtype, the == will check for exact equality (in the string dtype example, the below can evaluate to False even if both are a StringDtype, but have a different storage):

I think these should compare as equal if they are logically equivalent; otherwise we are back to the issue of exposing the implementation detail to end users.

So I think the reverse of being proposed is what we should have. By default, equality comparisons use the logical semantics, and if you wanted a more granular physical comparison you should use a dedicated function. I think we have a prior art for that already when considering .equals and testing.assert_frame_equal

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 13, 2024

I think these should compare as equal if they are logically equivalent; otherwise we are back to the issue of exposing the implementation detail to end users.

@WillAyd that would create a change of behavior. See this example:

>>> s = pd.Series([1,2,3], dtype="Int64")
>>> s2 = pd.Series([1,2,3])
>>> s
0    1
1    2
2    3
dtype: Int64
>>> s2
0    1
1    2
2    3
dtype: int64
>>> s.dtype == s2.dtype
False

So you're proposing that == reports True here, and that's a possible big change for users.

@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2024

In the long run yes...although we definitely need to be careful about the steps that we take to get there.

I'm also coming from the perspective that we've discussed in PDEP-13, where dtype="Int64" and dtype==np.int64 lose their physical nature and only expose their logical behavior to end users

@simonjayhawkins
Copy link
Member

There seems to be a concern about changing the behavior of equality checks, as it could affect users who rely on the current exact equality checks.

in the string dtype example, the below can evaluate to False even if both are a StringDtype, but have a different storage

as @WillAyd mentions this is exposing the implementation detail to end users. So could it be considered reasonable to argue that this is a bug and a change would be a bugfix not a change in behavior for just this dtype in isolation which is still currently considered experimental?

@WillAyd
Copy link
Member

WillAyd commented Nov 14, 2024

Yea that's an interesting point that @simonjayhawkins brings up. Do we think there's a huge risk to changing that behavior for strings today?

@arnaudlegout
Copy link
Contributor

I am very confused by your discussion around logical equality of dtypes, considering that you never defined what logical means. Do you consider np.int32==np.int64? I don't think so because you might say that the represented values are not the same. But in that case what do you do with "str"=="string", they do not represent the same thing at all (one if fixed size, the other one not),
or between np.int32=="Int32" one is nullable "Int32" , not np.int32, so you have one more possible value (pd.NA) for "Int32", I would certainly not like to see np.int32=="Int32" returns True.

We can even consider "string[python]"=="string[pyarrow]" one is replacing in-place, not the other one. This is indeed an implementation detail that I would not qualify as a detail is you work with large datastructures.

Personally, I would expect to see == return True for the same dtype, and a specific method to test for logical equality with a long documentation explaining all the corner cases (because you will have a good bunch of them, in particular with strings).

Using == for logical equality would be very confusing for me.

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2024

@arnaudlegout thanks for the feedback.

I am very confused by your discussion around logical equality of dtypes, considering that you never defined what logical means.

There is a larger (although incomplete) discussion of this in #58455 that is good for context

Do you consider np.int32==np.int64

Nope :-) These types have different front-end requirements for the user, i.e. they store different widths.

In the integral space, we have three different implementations of a 64 bit integer. The default implementation uses NumPy storage, the second iteration we did uses pandas custom code on top of NumPy storage, and the third int64 implementation uses PyArrow. The logical type system would consider all of those equal.

That is still a ways off though and needs further discussion in that PDEP. The only real logical implementation we've tried today is our string data type that is getting implemented in 3.0

But in that case what do you do with "str"=="string", they do not represent the same thing at all (one if fixed size, the other one not),

Neither of these is fixed size, they just use a different null sentinel (str uses np.nan, string uses pd.NA)

We can even consider "string[python]"=="string[pyarrow]" one is replacing in-place, not the other one.

While there are theoretically some differences to how assignment can work with either of these storage types, we don't make any guarantees about that through our API. inplace modification is something we have been moving away from for quite some time; in 3.0 copy-on-write becomes the default behavior for all types.

That speaks to the general issue though in that by exposing type implementation details, we are allowing users to make assumptions about how we manage types that we ourselves do not actually adhere to. We are not developing string[python] as "the type that is more suitable for modification" versus string[pyarrow] as "the type that is read-only optimized". string[python] is only supposed to allow for the same logical interface for handling strings in the absence of pyarrow being installed

@jbrockmendel
Copy link
Member

== should remain strict on the dtype object. It is often used for eg “can I do a binary operation between these two without a cast” (eg get_indexer).

maybe dtype.logical_dtype attribute which can use ==?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants