Skip to content

[WIP]: API ExtensionDtype for DTA & TDA #24674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

TomAugspurger
Copy link
Contributor

Closes #24662

TODO:

  • Dedicated tests for new is_ types
  • Run the extension dtype tests for new dtypes
  • (maybe?) see if the new EA tests pass for the new arrays. Tricky, since they aren't allowed internally. g
  • failing dtype validation tests

This is going to fail. Going offline for a bit, but putting this up right now.

I'm going to clean up the changes in internals and the series constructor.

@TomAugspurger
Copy link
Contributor Author

Still a WIP, but probably not going to have time to update this any further through the end of the week.

return DatetimeArray

@classmethod
def construct_from_string(cls, string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to move this to the sub-class as DatetimeDtype would actually accept this but raise on a construction error (as you can't pass a tz). I would maybe instead just have a unified constructor that accepts tz=None for naive.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 8, 2019
@jorisvandenbossche
Copy link
Member

I would maybe instead just have a unified constructor that accepts tz=None for naive.

@jreback do you also mean just having a single class internally? DatetimeDtype(tz=None) is the naive one and DatetimeDtype(tz=tz) is the aware one (and DatetimeTzDtype can be kept as an alias for now for backwards compatibility)

I didn't really think about any possible technical implications of that, but it seems logical to me to do that, since we also have a single DatetimeArray.

@TomAugspurger
Copy link
Contributor Author

Sorry I put this up in haste and didn't explain things well. Initially, I tried that, but ran into some difficulties. It turns out that having pandas_dtype("M8[ns]") return a non NumPy dtype breaks a lot of stuff. So I split them apart, and don't register the DatetimeDtype.

I'm not sure if it's possible to both

  1. Have a DatetimeDtype that registers datetime64[ns, tz], but not datetime64[ns]
  2. Have construct_from_string always work correctly

but I can try (later, not today).

@jreback
Copy link
Contributor

jreback commented Jan 9, 2019

yeah I think this is a great idea @TomAugspurger but also possibly to involve changing some tricky code; would prefer to push this to 0.25

@TomAugspurger
Copy link
Contributor Author

The main downside with pushing to 0.25 is it'll be an API breaking change for DatetimeArray.dtype, and TimedeltaArray.dtype with no path for deprecation.


just pushed an update, but still a WIP.

@codecov
Copy link

codecov bot commented Jan 15, 2019

Codecov Report

Merging #24674 into master will decrease coverage by 0.02%.
The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24674      +/-   ##
==========================================
- Coverage   92.38%   92.36%   -0.03%     
==========================================
  Files         166      166              
  Lines       52379    52474      +95     
==========================================
+ Hits        48392    48467      +75     
- Misses       3987     4007      +20
Flag Coverage Δ
#multiple 90.78% <88.23%> (-0.03%) ⬇️
#single 42.98% <72.94%> (+0.07%) ⬆️
Impacted Files Coverage Δ
pandas/core/nanops.py 94.36% <ø> (ø) ⬆️
pandas/core/api.py 100% <ø> (ø) ⬆️
pandas/core/series.py 93.7% <100%> (+0.01%) ⬆️
pandas/core/dtypes/common.py 96.86% <100%> (+0.08%) ⬆️
pandas/core/frame.py 96.93% <100%> (ø) ⬆️
pandas/core/arrays/datetimes.py 97.78% <100%> (ø) ⬆️
pandas/core/indexes/timedeltas.py 90.3% <100%> (+0.08%) ⬆️
pandas/core/indexes/datetimes.py 96.28% <100%> (+0.01%) ⬆️
pandas/core/indexes/base.py 96.3% <100%> (ø) ⬆️
pandas/core/internals/construction.py 94% <60.86%> (-1.94%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 512830b...2abda43. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 15, 2019

Codecov Report

Merging #24674 into master will decrease coverage by 49.41%.
The diff coverage is 73.65%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24674       +/-   ##
===========================================
- Coverage   92.38%   42.97%   -49.42%     
===========================================
  Files         166      166               
  Lines       52366    52469      +103     
===========================================
- Hits        48380    22550    -25830     
- Misses       3986    29919    +25933
Flag Coverage Δ
#multiple ?
#single 42.97% <73.65%> (+0.06%) ⬆️
Impacted Files Coverage Δ
pandas/core/nanops.py 31.27% <ø> (-63.1%) ⬇️
pandas/core/api.py 86.66% <ø> (-13.34%) ⬇️
pandas/core/series.py 49.03% <100%> (-44.66%) ⬇️
pandas/core/dtypes/common.py 71.15% <100%> (-25.63%) ⬇️
pandas/core/frame.py 35.82% <100%> (-61.11%) ⬇️
pandas/core/arrays/datetimes.py 66.13% <100%> (-31.65%) ⬇️
pandas/core/indexes/timedeltas.py 44.87% <100%> (-45.38%) ⬇️
pandas/core/indexes/datetimes.py 48.62% <100%> (-47.65%) ⬇️
pandas/core/indexes/base.py 56.49% <100%> (-39.82%) ⬇️
pandas/core/internals/construction.py 62.11% <60.86%> (-33.83%) ⬇️
... and 136 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bd454c...f1ecb8b. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

Somewhat awkward API corner: what's the expected types for

  1. DatetimeArray.astype("M8[ns]")
  2. DatetimeArray.astype(np.dtype("M8[ns]"))
  3. DatetimeArray.astype(pd.DatetimeDtype())

Previously, we'd use .astype('M8[ns]') to convert to an ndarray. But now that we have a dtype roughly equivalent to M8[ns], should we keep doing that?

Another one, what should we do about pd.DatetimeDtype() == "M8[ns]" or == np.dtpye("M8[ns]").

NumPy doesn't make this easy, since they would raise if the operands were flipped, rather than returning NotImplemented.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 17, 2019

For the expected types, I also have been wondering about this in general. So eg also for DTA.astype("int64").
Because we now have PandasArray and PandasDtype, so we could also return those? Similarly as pd.array(DTA, dtype='int64') does return that, we could do the same for EA.astype. It could be a specification of EA.astype that the return value is always another EA.

@TomAugspurger
Copy link
Contributor Author

Yeah, it's not obvious to me what's best here.

On the one hand, you could say that .astype(np.dtype(.)) should always return an ndarray.

On the other hand, is it weird for .astype(np.dtype("int64")) to not match .astype("int64")? I suppose we'll need to sort all this out.

I'm really not sure what to do here :/

@TomAugspurger TomAugspurger mentioned this pull request Jan 20, 2019
@h-vetinari
Copy link
Contributor

@TomAugspurger: Yeah, it's not obvious to me what's best here.
On the one hand, you could say that .astype(np.dtype(.)) should always return an ndarray.
On the other hand, is it weird for .astype(np.dtype("int64")) to not match .astype("int64")? I suppose we'll need to sort all this out.
I'm really not sure what to do here :/

How about .astype and pd.array always returning PandasDType (yes, even for an np.dtype)? If the user really wants an ndarray, they can then call to_numpy. That would in many ways be more consistent - and could later be extended to the point that the only way to get an ndarray from pandas is to call to_numpy.

@jbrockmendel
Copy link
Member

This has been inactive for a while. Still a goal?

@TomAugspurger
Copy link
Contributor Author

I'm not sure. The API changes are tricky to get right.

Regardless, I don't think things will be done on this branch. I'll post an update on the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dtype for tz-naive DatetimeArray and TimedeltaArray
5 participants