-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Consistent naming conventions for string dtype aliases #58141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Meant to tag @jorisvandenbossche |
i like this idea, though as i mentioned at the sprint i think we should avoid "backend". maybe dtype "family"? |
Maybe "type provider"? |
Thinking through some more what I suggested above for I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction. This would be a change from where we are today but could be better in the long run (?) |
As an exercise I tried to map out all of the types that pandas does today support (or reasonably could in the near term) and place in a hierarchy. Here is what I was able to come up with: Tagging @pandas-dev/pandas-core in case this is of use to the larger team graphviz used to build this: digraph type_graph { node [shape=box]; "type" -> "scalar" "numeric" -> "integral" "integral" -> "unsigned" "numeric" -> "floating point" "numeric" -> "fixed point" "scalar" -> "boolean" "scalar" -> "temporal" "temporal" -> "datetime" "temporal" -> "duration" "temporal" -> "interval" "scalar" -> "binary" "binary" -> "string" "scalar" -> "categorical" "scalar" -> "sparse" "type" -> "aggregate" "aggregate" -> "struct" "aggregate" -> "dictionary" |
From a typing perspective, supporting all the different string versions of valid types for Having said that, if we are to deprecate the strings, we'd probably need a PDEP for that.... |
I would be supportive of this as well. Especially for dtypes as strings that take parameters (timezone types, decimal types), it would be great to avoid string parsing to dtype object construction |
To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP) This is a topic that I brought up last summer during the sprint, but never got around writing up publicly. The summary is that I would like to see us move to just having "pandas" dtypes, at least for the majority of the users that don't need to know the lower-level details. The current string aliases for non-default dtypes are I think mostly a band-aid to let people more easily specify those dtypes, and I fully agree those aren't very pretty. I do think it will be hard (or even desirable) to fully do away with string aliases though, at least for the default data types, because this is so widespread. |
So maybe then for each category in the type hierarchy above we have wrappers with signatures like: class pd.int8(dtype_backend="pyarrow"): ...
class pd.string(dtype_backend="pyarrow", nullability="numpy"): ...
class pd.datetime(dtype_backend="pyarrow", unit="us", tz=None): ...
class pd.list(value_type, dtype_backend="pyarrow"): ...
class pd.categorical(key_type="infer", value_type="infer", dtype_backend="pandas"): ... I know @jbrockmendel prefers something besides
I was thinking this as well |
Yea this would be a long process. I think what's hard about the string alias is that it only works for very basic types. It definitely has been and would continue to be relatively easy for users to just say "int64" and get a 64 bit integer irrespective of what that is backed by, but if the user wants to then create a list column they can't just do "list". I think users will end up with a frankenstein of string aliases alongside arguments like |
I agree. One possibility to consider is to limit the number of string aliases to simple types |
I found the notebook that I presented at the sprint last summer. It's a little bit off topic for the discussion just about string aliases, but I think it is relevant for the bigger picture (that we need to look at anyway if considering to move away from string aliases), so just dumping the content here (updated a little bit). I like to have "pandas data types" with a consistent interface:
For example, for datetime-like data, we currently have: # current options
ser.astype("datetime64[ms]")
# vs
ser.astype("timestamp[us][pyarrow]")
# How to specify a "datetime" dtype being agnostic about the exact backend you are using?
# -> should use a single name and will pick the default backend based on your settings
ser.astype(pd.datetime("ns"))
# or
ser.astype(pd.timestamp("ns"))
# for user that want's to be explicit
ser.astype(pd.datetime("ns", backend="..")) Another example, we currently have Logical data types vs physical data types:
For pandas, I think most users should care about logical data types, and not too much about the physical data type (and we can choose the best default, and advanced users can give hints which to use for performance optimizations) Assuming we want a single pandas interface to all dtypes, we need to decide:
Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions: pd.StringDtype(), pd.StringDtype(backend="arrow"), pd.StringDtype(backend="numpy")
isinstance(dtype, pd.StringDtype) -> but that means that choosing the approach of the current or we could have different classes but then we definitely need the functional interface and dtype-checking helpers (because isinstance then doesn't work): pd.string(), pd.string(backend="arrow"), pd.string(backend="numpy")
pd.api.types.is_string(..) (and maybe In this case we are more free to keep whatever classes structure we want under the hood. |
I forget the details, but remember finding Joris's presentation at the sprint compelling. |
This is an interesting example, but do we even need to support the pyarrow date64? I'm not really clear what advantages that has over date32. Per the hierarchy above I would just abstract this as Outside of date types I do see that issue with strings where
In an ideal world I would be indifferent, but the problem with the class constructors is that they already exist (pd.StringDtype, pd.Int64Dtype, etc...). Repurposing them might only add to the confusion Overall though I agree with your sentiment of starting at a place where we think in terms of logical data types foremost, which should cover the majority of use cases, and then giving some control over the physical data types via keyword arguments or options |
Is this in reference to how nulls are stored or how they are expressed to the end user? Storage-wise I feel like it would be a mistake to stray from the Arrow implementation |
I think we already have both types somewhat, so we will need to clean this up to a certain extent whichever choice we make:
So while those class constructors indeed already exist, I think we have to repurpose or change the existing ones (and add new ones) to some extent anyway. And it is also not because we have those classes right now, that we can't decide we want to hide them more from the user by providing an alternative. I don't think there are already that many users that use |
In the first place to how they are expressed to the end user, because IMO that's the most important aspect (since we are talking about the user interface how dtypes are specified / presented). Personally I would also prefer to use a consistent implementation storage-wise, but that's more of an implementation detail that we could discuss/compromise per dtype. |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Right now the string aliases for our types is inconsistent
Strings have a similar inconsistency with "string", "string[pyarrow]" and "string[pyarrow_numpy]"
Feature Description
I think we should create"int8[numpy]" and "int8[pandas]" aliases to stay consistent with pyarrow. This also has the advantage of decoupling "int8" from NumPy, so perhaps in the future we can allow the setting of the backend determine if NumPy or pyarrow types are returned
The pattern thus becomes "data_type[backend]", with the exception of "string[pyarrow_numpy]" which combines combines the backend and nullability semantics together. I am less sure what to do in that case - maybe even that should be called "string[pyarrow, numpy]" where the second argument is nullability?
In any case I am just hoping we can start to detach the logical type from the physical storage / nulllability semantics with a well defined pattern
@phofl
Alternative Solutions
n/a
Additional Context
No response
The text was updated successfully, but these errors were encountered: