-
Notifications
You must be signed in to change notification settings - Fork 21
How to specify a dtype with a dtype=
keyword
#155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If we use strings, I would propose we use the Arrow format strings to be consistent with where we use strings elsewhere. How would canonical names work with things like nullability and categoricals, where presumably we want to capture the category type and index type? We could also consider doing something like having a Data Type class specification that everyone then duck types. |
All types are able to be nullable by design, right? And if it's not needed for your data, why would you need to explicitly express that?
yeah, that's a more difficult one. You'd need an ABC or duck type class as you said. I have a feeling that we kinda already have a ready-made solution - or better, two. Arrow strings, or the interchange protocol work we did. Neither are particularly user-friendly though, so not quite sure yet what I'd like best here. |
The Arrow C interface format strings (https://arrow.apache.org/docs/dev/format/CDataInterface.html#data-type-description-format-strings) are not particularly meant to be human friendly (descriptive) strings, so I would consider that as a big drawback to use those in actual python code. It also doesn't cover categorical data type (in the Arrow C interface, the format string is the index type (eg int32), and the categories' type can be inferred separately from the dictionary array) |
Okay, here is a proposal:
# Similar to pandas.CategoricalDtype
class CategoricalDtype:
def __init__(self, categories: Column | None = None, *, ordered=False)
"""
If categories is None: values-only categorical
If categories is a Column: dictionary-style categorical
"""
@property
def categories(self)
@property
def is_ordered(self) Datetime dtypes are trickiest, it may require a datetime/timestamp and a period/duration dtype, and there's multiple ways to go about precision and timezone specifications. class DateTime:
def __init__(self, precision : str, /, *, timezone : str | None =None)
"""
precision: {'days', 's', 'ms', 'us', 'ns'}
timezone: string in same format as Arrow accepts
"""
@property
def precision(self)
@property
def timezone(self)
class TimeDelta:
def __init__(self, precision : str, /)
"""
precision: {'days', 's', 'ms', 'us', 'ns'}
"""
@property
def precision(self) |
It's not necessarily that they require a class, it could also be a function with parameters? ( The class also defines an interface for the return value, but similarly we could only describe that the returned object should have certain attributes present? |
Yes, fair enough, it could be a function as well. Given that it returns a dtype object of some sort with attributes, it seemed equivalent and a bit more natural to model it as a class. But I'm not attached to that choice. |
Also for the other (numeric) dtypes we might want to specify some specific behaviours / methods / attributes? (without that this is specified as we class) |
Pretty much, there's not much else that is needed beyond the objects existing under the given names, and then an |
A quick comment on this after the discussion today: it looks like bool/numericalstring/categorical as sketched above are okay, but the datetime dtypes are not. The combo of days + a timezone is going to cause issues, and it seems like the |
should this be pandas doesn't technically have a How do we specify the precision, just with strings (like, Can discuss in the call, just wanted to get some points down
Is this the |
We could also punt on dates for now, AFAIK that's also not supported in the interchange protocol (only datetime). (although timedelta is also missing in the interchange protocol, and I assume we want that, since operations on datetimes can result in timedeltas; that should probably be added to the interchange protocol, though) |
we do now have dtypes in the API, so I think this can be closed (can always revisit if the current solution proves problematic in any way) |
DataFrame.from_sequence
introduces adtype=
keyword, and specifies it's a string without going into details. This is a potentially tricky topic. There's a number of ways this could go:dtype='float64'
dtype=float64
Dtype = Tuple[DtypeKind, int, str, str]
My first instinct would be to use canonical names rather than strings, but it depends a bit on how often we'll use dtypes I think.
The text was updated successfully, but these errors were encountered: