How to specify a dtype with a `dtype=` keyword #155

rgommers · 2023-04-27T14:03:11Z

DataFrame.from_sequence introduces a dtype= keyword, and specifies it's a string without going into details. This is a potentially tricky topic. There's a number of ways this could go:

Indeed use strings, e.g. dtype='float64'
Use canonical names like in the array API, e.g. dtype=float64
Use the same as done in the interchange protocol, e.g. Dtype = Tuple[DtypeKind, int, str, str]

My first instinct would be to use canonical names rather than strings, but it depends a bit on how often we'll use dtypes I think.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2023-04-27T16:49:01Z

If we use strings, I would propose we use the Arrow format strings to be consistent with where we use strings elsewhere.

How would canonical names work with things like nullability and categoricals, where presumably we want to capture the category type and index type?

We could also consider doing something like having a Data Type class specification that everyone then duck types.

rgommers · 2023-04-27T17:02:32Z

How would canonical names work with things like nullability

All types are able to be nullable by design, right? And if it's not needed for your data, why would you need to explicitly express that?

and categoricals, where presumably we want to capture the category type and index type?

yeah, that's a more difficult one. You'd need an ABC or duck type class as you said.

I have a feeling that we kinda already have a ready-made solution - or better, two. Arrow strings, or the interchange protocol work we did. Neither are particularly user-friendly though, so not quite sure yet what I'd like best here.

jorisvandenbossche · 2023-04-28T14:41:01Z

The Arrow C interface format strings (https://arrow.apache.org/docs/dev/format/CDataInterface.html#data-type-description-format-strings) are not particularly meant to be human friendly (descriptive) strings, so I would consider that as a big drawback to use those in actual python code.

It also doesn't cover categorical data type (in the Arrow C interface, the format string is the index type (eg int32), and the categories' type can be inferred separately from the dictionary array)

rgommers · 2023-05-10T20:46:27Z

Okay, here is a proposal:

For dtypes supported by the array API standard, we use the same design, so this gives us a bool, uintX, intX and floatX (in the top-level namespace, or in a dtypes sub-namespace if preferred). These are basically plain objects with given names, not much more than that.
- Do we want/need complex64/complex128 too, or not? My assumption is not.
No string aliases allowed, to stay with a single way of doing things.
For strings we can do the same, because I don't think there's anything we need beyond a name. So string. That'd be a variable-length string with support for nulls.
Categorical and datetime dtypes are the harder ones. These seem to require classes that can be instantiated:

# Similar to pandas.CategoricalDtype
class CategoricalDtype:
    def __init__(self, categories: Column | None = None, *, ordered=False)
        """
        If categories is None: values-only categorical
        If categories is a Column: dictionary-style categorical
        """

    @property
    def categories(self)

    @property
    def is_ordered(self)

Datetime dtypes are trickiest, it may require a datetime/timestamp and a period/duration dtype, and there's multiple ways to go about precision and timezone specifications.

class DateTime:
    def __init__(self, precision : str, /, *, timezone : str | None =None)
        """
        precision: {'days', 's', 'ms', 'us', 'ns'}
        timezone: string in same format as Arrow accepts
        """

    @property
    def precision(self)

    @property
    def timezone(self)

class TimeDelta:
    def __init__(self, precision : str, /)
        """
        precision: {'days', 's', 'ms', 'us', 'ns'}
        """

    @property
    def precision(self)

jorisvandenbossche · 2023-05-11T15:29:30Z

These seem to require classes that can be instantiated:

It's not necessarily that they require a class, it could also be a function with parameters? (namespace.categorical(categories, ordered))

The class also defines an interface for the return value, but similarly we could only describe that the returned object should have certain attributes present?

rgommers · 2023-05-11T15:39:29Z

Yes, fair enough, it could be a function as well. Given that it returns a dtype object of some sort with attributes, it seemed equivalent and a bit more natural to model it as a class. But I'm not attached to that choice.

jorisvandenbossche · 2023-05-11T15:43:21Z

Also for the other (numeric) dtypes we might want to specify some specific behaviours / methods / attributes? (without that this is specified as we class)
Although the array API seems to only specify a requirement for __eq__?

rgommers · 2023-05-11T16:09:29Z

Although the array API seems to only specify a requirement for __eq__?

Pretty much, there's not much else that is needed beyond the objects existing under the given names, and then an isdtype function to deal with checking their properties (guess we may want that here too).

rgommers · 2023-05-11T18:05:21Z

A quick comment on this after the discussion today: it looks like bool/numericalstring/categorical as sketched above are okay, but the datetime dtypes are not. The combo of days + a timezone is going to cause issues, and it seems like the DateTime class should instead be separated into TimeStamp and Date.

MarcoGorelli · 2023-05-25T15:13:40Z

separated into TimeStamp and Date.

should this be Date and Datetime? With the former just being Date, and the second one having precision and time_zone. I presume Date doesn't need a precision?

pandas doesn't technically have a Date type, but in its Standard implementation I think we could wrap its Datetime and have it behave as a date, possibly with some extra validation steps necessary

How do we specify the precision, just with strings (like, "ns", "us", ...)? Which precisions should be accepted? Do all of them need to be there? For example pandas has ['s', 'ms', 'us', 'ns'] whereas as polars has ['ms', 'us', 'ns']. Should 's' be part of the Standard, and polars just raises NotImplementedError if that's requested?

Can discuss in the call, just wanted to get some points down

period/duration dtype

Is this the timedelta kind of thing we want, right? I'd be pretty keen to avoid pandas' Period dtype here 😄

jorisvandenbossche · 2023-05-25T17:57:17Z

We could also punt on dates for now, AFAIK that's also not supported in the interchange protocol (only datetime).

(although timedelta is also missing in the interchange protocol, and I assume we want that, since operations on datetimes can result in timedeltas; that should probably be added to the interchange protocol, though)

MarcoGorelli · 2023-10-27T14:58:42Z

we do now have dtypes in the API, so I think this can be closed (can always revisit if the current solution proves problematic in any way)

rgommers added the API design label Apr 27, 2023

rgommers mentioned this issue Apr 27, 2023

add Column.from_sequence #148

Merged

MarcoGorelli mentioned this issue Apr 28, 2023

How to type return value of Column reductions #158

Closed

rgommers mentioned this issue May 10, 2023

move from_sequence to namespace #164

Merged

rgommers mentioned this issue May 11, 2023

add Column.dtype #166

Merged

MarcoGorelli closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to specify a dtype with a `dtype=` keyword #155

How to specify a dtype with a `dtype=` keyword #155

rgommers commented Apr 27, 2023

kkraus14 commented Apr 27, 2023

Uh oh!

rgommers commented Apr 27, 2023

Uh oh!

jorisvandenbossche commented Apr 28, 2023

Uh oh!

rgommers commented May 10, 2023

Uh oh!

jorisvandenbossche commented May 11, 2023

Uh oh!

rgommers commented May 11, 2023

Uh oh!

jorisvandenbossche commented May 11, 2023

Uh oh!

rgommers commented May 11, 2023

Uh oh!

rgommers commented May 11, 2023 •

edited

Loading

Uh oh!

MarcoGorelli commented May 25, 2023

Uh oh!

jorisvandenbossche commented May 25, 2023

Uh oh!

MarcoGorelli commented Oct 27, 2023

Uh oh!

How to specify a dtype with a dtype= keyword #155

How to specify a dtype with a dtype= keyword #155

Comments

rgommers commented Apr 27, 2023

kkraus14 commented Apr 27, 2023

Uh oh!

rgommers commented Apr 27, 2023

Uh oh!

jorisvandenbossche commented Apr 28, 2023

Uh oh!

rgommers commented May 10, 2023

Uh oh!

jorisvandenbossche commented May 11, 2023

Uh oh!

rgommers commented May 11, 2023

Uh oh!

jorisvandenbossche commented May 11, 2023

Uh oh!

rgommers commented May 11, 2023

Uh oh!

rgommers commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented May 25, 2023

Uh oh!

jorisvandenbossche commented May 25, 2023

Uh oh!

MarcoGorelli commented Oct 27, 2023

Uh oh!

How to specify a dtype with a `dtype=` keyword #155

How to specify a dtype with a `dtype=` keyword #155

rgommers commented May 11, 2023 •

edited

Loading