Skip to content

How to specify a dtype with a dtype= keyword #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgommers opened this issue Apr 27, 2023 · 12 comments
Closed

How to specify a dtype with a dtype= keyword #155

rgommers opened this issue Apr 27, 2023 · 12 comments

Comments

@rgommers
Copy link
Member

DataFrame.from_sequence introduces a dtype= keyword, and specifies it's a string without going into details. This is a potentially tricky topic. There's a number of ways this could go:

  • Indeed use strings, e.g. dtype='float64'
  • Use canonical names like in the array API, e.g. dtype=float64
  • Use the same as done in the interchange protocol, e.g. Dtype = Tuple[DtypeKind, int, str, str]

My first instinct would be to use canonical names rather than strings, but it depends a bit on how often we'll use dtypes I think.

@kkraus14
Copy link
Collaborator

If we use strings, I would propose we use the Arrow format strings to be consistent with where we use strings elsewhere.

How would canonical names work with things like nullability and categoricals, where presumably we want to capture the category type and index type?

We could also consider doing something like having a Data Type class specification that everyone then duck types.

@rgommers
Copy link
Member Author

How would canonical names work with things like nullability

All types are able to be nullable by design, right? And if it's not needed for your data, why would you need to explicitly express that?

and categoricals, where presumably we want to capture the category type and index type?

yeah, that's a more difficult one. You'd need an ABC or duck type class as you said.

I have a feeling that we kinda already have a ready-made solution - or better, two. Arrow strings, or the interchange protocol work we did. Neither are particularly user-friendly though, so not quite sure yet what I'd like best here.

@jorisvandenbossche
Copy link
Member

The Arrow C interface format strings (https://arrow.apache.org/docs/dev/format/CDataInterface.html#data-type-description-format-strings) are not particularly meant to be human friendly (descriptive) strings, so I would consider that as a big drawback to use those in actual python code.

It also doesn't cover categorical data type (in the Arrow C interface, the format string is the index type (eg int32), and the categories' type can be inferred separately from the dictionary array)

@rgommers
Copy link
Member Author

Okay, here is a proposal:

  • For dtypes supported by the array API standard, we use the same design, so this gives us a bool, uintX, intX and floatX (in the top-level namespace, or in a dtypes sub-namespace if preferred). These are basically plain objects with given names, not much more than that.
    • Do we want/need complex64/complex128 too, or not? My assumption is not.
  • No string aliases allowed, to stay with a single way of doing things.
  • For strings we can do the same, because I don't think there's anything we need beyond a name. So string. That'd be a variable-length string with support for nulls.
  • Categorical and datetime dtypes are the harder ones. These seem to require classes that can be instantiated:
# Similar to pandas.CategoricalDtype
class CategoricalDtype:
    def __init__(self, categories: Column | None = None, *, ordered=False)
        """
        If categories is None: values-only categorical
        If categories is a Column: dictionary-style categorical
        """

    @property
    def categories(self)

    @property
    def is_ordered(self)

Datetime dtypes are trickiest, it may require a datetime/timestamp and a period/duration dtype, and there's multiple ways to go about precision and timezone specifications.

class DateTime:
    def __init__(self, precision : str, /, *, timezone : str | None =None)
        """
        precision: {'days', 's', 'ms', 'us', 'ns'}
        timezone: string in same format as Arrow accepts
        """

    @property
    def precision(self)

    @property
    def timezone(self)

class TimeDelta:
    def __init__(self, precision : str, /)
        """
        precision: {'days', 's', 'ms', 'us', 'ns'}
        """

    @property
    def precision(self)

@jorisvandenbossche
Copy link
Member

These seem to require classes that can be instantiated:

It's not necessarily that they require a class, it could also be a function with parameters? (namespace.categorical(categories, ordered))

The class also defines an interface for the return value, but similarly we could only describe that the returned object should have certain attributes present?

@rgommers
Copy link
Member Author

Yes, fair enough, it could be a function as well. Given that it returns a dtype object of some sort with attributes, it seemed equivalent and a bit more natural to model it as a class. But I'm not attached to that choice.

@jorisvandenbossche
Copy link
Member

Also for the other (numeric) dtypes we might want to specify some specific behaviours / methods / attributes? (without that this is specified as we class)
Although the array API seems to only specify a requirement for __eq__?

@rgommers
Copy link
Member Author

Although the array API seems to only specify a requirement for __eq__?

Pretty much, there's not much else that is needed beyond the objects existing under the given names, and then an isdtype function to deal with checking their properties (guess we may want that here too).

@rgommers
Copy link
Member Author

rgommers commented May 11, 2023

A quick comment on this after the discussion today: it looks like bool/numericalstring/categorical as sketched above are okay, but the datetime dtypes are not. The combo of days + a timezone is going to cause issues, and it seems like the DateTime class should instead be separated into TimeStamp and Date.

@MarcoGorelli
Copy link
Contributor

separated into TimeStamp and Date.

should this be Date and Datetime? With the former just being Date, and the second one having precision and time_zone. I presume Date doesn't need a precision?

pandas doesn't technically have a Date type, but in its Standard implementation I think we could wrap its Datetime and have it behave as a date, possibly with some extra validation steps necessary

How do we specify the precision, just with strings (like, "ns", "us", ...)? Which precisions should be accepted? Do all of them need to be there? For example pandas has ['s', 'ms', 'us', 'ns'] whereas as polars has ['ms', 'us', 'ns']. Should 's' be part of the Standard, and polars just raises NotImplementedError if that's requested?

Can discuss in the call, just wanted to get some points down

period/duration dtype

Is this the timedelta kind of thing we want, right? I'd be pretty keen to avoid pandas' Period dtype here 😄

@jorisvandenbossche
Copy link
Member

We could also punt on dates for now, AFAIK that's also not supported in the interchange protocol (only datetime).

(although timedelta is also missing in the interchange protocol, and I assume we want that, since operations on datetimes can result in timedeltas; that should probably be added to the interchange protocol, though)

@MarcoGorelli
Copy link
Contributor

we do now have dtypes in the API, so I think this can be closed (can always revisit if the current solution proves problematic in any way)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants