Skip to content

BUG: Cannot create third-party ExtensionArrays for datetime types (xfail) #34987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 14, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions pandas/tests/extension/arrow/test_timestamp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import datetime
from typing import Type

import pytest

import pandas as pd
from pandas.api.extensions import ExtensionDtype, register_extension_dtype

pytest.importorskip("pyarrow", minversion="0.13.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could technicaly be later but ok for now


import pyarrow as pa # isort:skip

from .arrays import ArrowExtensionArray # isort:skip


@register_extension_dtype
class ArrowTimestampUSDtype(ExtensionDtype):

type = datetime.datetime
kind = "M"
name = "arrow_timestamp_us"
na_value = pa.NULL

@classmethod
def construct_array_type(cls) -> Type["ArrowTimestampUSArray"]:
"""
Return the array type associated with this dtype.

Returns
-------
type
"""
return ArrowTimestampUSArray


class ArrowTimestampUSArray(ArrowExtensionArray):
def __init__(self, values):
if not isinstance(values, pa.ChunkedArray):
raise ValueError

assert values.type == pa.timestamp("us")
self._data = values
self._dtype = ArrowTimestampUSDtype()


def test_constructor_extensionblock():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be xfailed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can xfail this, so this can be merged. I would prefer to fix this myself though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just need a pointer at which code section I should apply a fix. Should I change the order in pandas/pandas/core/internals/blocks.py so that we only create a DatetimeTZBlock for pandas-provided datetime-based ExtensionArrays or shouldn't is_datetime64tz_dtype return True for my ExtensionDtype?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to fix this myself though.

Sounds good. Is this a use case you have a need to get working near-term, or more of a Principle Of The Thing? I ask because...

I just need a pointer at which code section I should apply a fix.

This is pretty daunting, as I expect this is scattered across the code. There are lots of places where we either a) implicitly assume nanoseconds or b) check dtype.kind in ["M", "m"] (much more performant than the is_foo_dtype checks)

Should I change the order in pandas/pandas/core/internals/blocks.py so that we only create a DatetimeTZBlock for pandas-provided datetime-based ExtensionArrays

That will probably be part of a solution.

or shouldn't is_datetime64tz_dtype return True for my ExtensionDtype?

I'd be very reticent to make that change, since I think a lot of code expects that to imply its getting our Datetime64TZDtype. Maybe a is_3rd_party_ea_dtype that we would check for before checking for any 1st-party dtypes? That runs into the "ideally we should treat 3rd party EAs symmetrically with 1st-party" problems.

So getting back to the motivation: how high a priority is this?

One thing I can unambiguously encourage is more tests, even if xfailed:

  • what happens if you pass one of these to the DatetimeIndex constructor? vice-versa?
  • what happens if i do DatetimeIndex.astype(this_new_ea_dtype)
  • addition/subtraction with the gamut of datetime/timedelta scalars/arrays we already support?
  • How does this behave if you stuff it inside a Categorical/CategoricalIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to fix this myself though.

Sounds good. Is this a use case you have a need to get working near-term, or more of a Principle Of The Thing? I ask because...

More in the next 6 months range, thus I'm definitely going to add an xfail here as the points below indicate that we should rather think more than "fix quick".

I would love to have a nullable, non-nanosecond timestamp (actually I desparately need it but e.g. having a performant string is more important to me) but there are several other places that either assume that all timestamps are nanoseconds or backed by a numpy-array, so this is going to be a major effort.

or shouldn't is_datetime64tz_dtype return True for my ExtensionDtype?

I'd be very reticent to make that change, since I think a lot of code expects that to imply its getting our Datetime64TZDtype. Maybe a is_3rd_party_ea_dtype that we would check for before checking for any 1st-party dtypes? That runs into the "ideally we should treat 3rd party EAs symmetrically with 1st-party" problems.

So getting back to the motivation: how high a priority is this?

As already pointed out: Less than other things I want to contribute to pandas, so xfailing and adding more (possibly) xfailing tests is the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so xfailing and adding more (possibly) xfailing tests is the way to go.

Sounds good.

actually I desparately need [...] that either assume that all timestamps are nanoseconds or backed by a numpy-array

Would your need be solved if we get numpy-backed non-nano in place? There's a reasonable chance of that happening in the next 6 months.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would your need be solved if we get numpy-backed non-nano in place? There's a reasonable chance of that happening in the next 6 months.

For now: Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now: Yes.

I'm slowly tackling this from the cython side of the code. The parallelizable step is to comb through the rest of the code to find all the places where we implicitly/explicitly assume nanos. I'd start with pandas/plotting and pandas/io.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if we can at least get this one working.

i think we'll need to edit the dtype.kind check in is_datetime64tz_dtype, and possible the issubclass(vtype, np.datetime64) check in internals.blocks.get_block_type

# GH 34986
pd.DataFrame(
{
"timestamp": ArrowTimestampUSArray.from_scalars(
[None, datetime.datetime(2010, 9, 8, 7, 6, 5, 4)]
)
}
)