-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Implement _most_ of the EA interface for DTA/TDA #23643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jbrockmendel! Thanks for updating the PR.
Comment last updated on November 12, 2018 at 20:25 Hours UTC |
pandas/core/arrays/datetimelike.py
Outdated
|
||
@classmethod | ||
def _concat_same_type(cls, to_concat): | ||
# for TimedeltaArray and PeriodArray; DatetimeArray overrides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem being that DatetimeArray needs to pass through tz
info?
For PeriodArray at least (haven't checked TimedeltaArray) you should able to implement _concat_same_type
just in terms of .dtype
. It's hashable and can be passed to PeriodArray.__init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As opposed to PeriodArray, freq
is not part of the dtype for DatetimeArray/TimedeltaArray, so I am not sure this check for freq should be done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the PeriodArray constructor allows duplicate freq
and dtype
, as long as the match, so
PeriodArray(data, freq='H', dtype=PeriodDtype("H")
should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, we added dtype to the PeriodArray constructor specifically so that type(self)(values, freq=self.freq, dtype=self.dtype) would be valid for all three TDA/DTA/PA classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that's not what I was meaning. The difference is the meaning of freq
for PeriodArray vs DatetimeArray. For PeriodArray it is part of the dtype and it is defining how the stored integers are interpreted (and thus need to match to just concatenate those integers), but for DatetimeArray it is simply an informative attribute telling you about the regularity of the Array, but not essential to describe it. So I would assume that _concat_same_type
needs to handle different arrays with different freqs for DatetimeArray.
Ported the last of the relevant tests from #23415 |
pandas/core/arrays/datetimelike.py
Outdated
|
||
@classmethod | ||
def _concat_same_type(cls, to_concat): | ||
# for TimedeltaArray and PeriodArray; DatetimeArray overrides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As opposed to PeriodArray, freq
is not part of the dtype for DatetimeArray/TimedeltaArray, so I am not sure this check for freq should be done here.
class SharedTests(object): | ||
index_cls = None | ||
|
||
def test_take(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not need to add such basic tests I think, as those are covered by the base Extension tests (we should of course test datetime specific aspects).
Is there anything in this test not tested by the base tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class-specific tests have tests for invalid fill_values
arr.take([0, 1], allow_fill=True, | ||
fill_value=pd.Timestamp.now().time) | ||
|
||
def test_concat_same_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I think also commented on one of the previous PRs that started doing this, I don't think we should test _concat_same_type
here directly. It is already tested by the base extension tests and by all the tests that actually use it under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, most of these tests were salvaged from one of those older PRs. I don't see much downside to having the tests, but am pretty happy to pawn this decision/PR off on Tom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still want the fail cases (different dtypes) here.
At this point, we should be able to simplify core/dtypes/concat.py::concat_datetimetz
and DatetimeIndex._concat_same_dtype
right? I'll take a look.
pandas/core/arrays/datetimelike.py
Outdated
|
||
@classmethod | ||
def _from_factorized(cls, values, original): | ||
return cls(values, dtype=original.dtype, freq=original.freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to pass original.freq
here?
It seems a bit strange as this is creating a new array which does not necessarily have the same order as the original one.
Although in practice, if you have a freq, that means you have a regular and unique array to start with, so the factorization is kind of a no-op and the result will still have the same freq? (but might be missing corner cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking of a possible corner case, which is currently actually broken: a sorted factorize of a DatetimeIndex with a negative freq:
In [57]: idx = pd.date_range("2012-01-01", periods=3)
In [58]: idx
Out[58]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='D')
In [59]: pd.factorize(idx)
Out[59]:
(array([0, 1, 2]),
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='D'))
In [60]: pd.factorize(idx[::-1])
Out[60]:
(array([0, 1, 2]),
DatetimeIndex(['2012-01-03', '2012-01-02', '2012-01-01'], dtype='datetime64[ns]', freq='-1D'))
In [61]: pd.factorize(idx[::-1], sort=True)
Out[61]:
(array([2, 1, 0]),
DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='-1D'))
@TomAugspurger are you still good taking this PR over? (BTW Travis failures look like unrelated timeouts) |
Sure will push changes here. |
Ah, I see what you mean now. Yes you're right.
And I suppose we can't even assume that concatenating multiple
DatetimeArrays with the same
freq will end up with a DatetimeIndex with the same freq.
…On Tue, Nov 13, 2018 at 2:34 PM Joris Van den Bossche < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/arrays/datetimelike.py
<#23643 (comment)>:
> + raise AbstractMethodError(self)
+
+ def take(self, indices, allow_fill=False, fill_value=None):
+ if allow_fill:
+ fill_value = self._validate_fill_value(fill_value)
+
+ new_values = take(self.asi8,
+ indices,
+ allow_fill=allow_fill,
+ fill_value=fill_value)
+
+ return type(self)(new_values, dtype=self.dtype)
+
+ @classmethod
+ def _concat_same_type(cls, to_concat):
+ # for TimedeltaArray and PeriodArray; DatetimeArray overrides
Sorry, that's not what I was meaning. The difference is the meaning of
freq for PeriodArray vs DatetimeArray. For PeriodArray it is part of the
dtype and it is defining how the stored integers are interpreted (and thus
need to match to just concatenate those integers), but for DatetimeArray it
is simply an informative attribute telling you about the regularity of the
Array, but not essential to describe it. So I would assume that
_concat_same_type needs to handle different arrays with different freqs
for DatetimeArray.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23643 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIphnpyu5IlWltCDwJywVRHis1KMFks5uuyzTgaJpZM4YaMQT>
.
|
Indeed, as that would only be the case if they are nicely consecutive ranges. |
_concat._concat_datetimetz -> DatetimeIndex._concat_same_dtype -> DatetimeArray._concat_same_type
Pushed some changes.
|
Codecov Report
@@ Coverage Diff @@
## master #23643 +/- ##
==========================================
+ Coverage 92.24% 92.24% +<.01%
==========================================
Files 161 161
Lines 51318 51340 +22
==========================================
+ Hits 47339 47361 +22
Misses 3979 3979
Continue to review full report at Codecov.
|
return _concat._concat_datetimetz(to_concat, name) | ||
# TODO(DatetimeArray) | ||
# - remove the .asi8 here | ||
# - remove the _maybe_box_as_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you can remove part of the comment
just a really minor comment. ok to merge. |
Going to leave the stale comment for now if that's OK. That whole if / else should be going away soon. |
ok! thanks @jbrockmendel and @TomAugspurger |
* upstream/master: (25 commits) DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651) DOC: Change release and whatsnew (pandas-dev#21599) DOC: Fix format of the See Also descriptions (pandas-dev#23654) DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374) ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692) CLN: Remove unnecessary code (pandas-dev#23696) Pin flake8-rst version (pandas-dev#23699) Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643) CI: raise clone depth limit on CI BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688) REF: Move Excel names parameter handling to CSV (pandas-dev#23690) DOC: Accessing files from a S3 bucket. (pandas-dev#23639) Fix errorbar visualization (pandas-dev#23674) DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678) DOC: Update is_sparse docstring (pandas-dev#19983) BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661) Add to_flat_index method to MultiIndex (pandas-dev#22866) CLN: Move to_excel to generic.py (pandas-dev#23656) TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660) CI: Allow to compile docs with ipython 7.11 pandas-dev#22990 (pandas-dev#23655) ...
…fixed * upstream/master: DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651) DOC: Change release and whatsnew (pandas-dev#21599) DOC: Fix format of the See Also descriptions (pandas-dev#23654) DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374) ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692) CLN: Remove unnecessary code (pandas-dev#23696) Pin flake8-rst version (pandas-dev#23699) Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643) CI: raise clone depth limit on CI BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688) REF: Move Excel names parameter handling to CSV (pandas-dev#23690) DOC: Accessing files from a S3 bucket. (pandas-dev#23639) Fix errorbar visualization (pandas-dev#23674) DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678) DOC: Update is_sparse docstring (pandas-dev#19983) BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661) Add to_flat_index method to MultiIndex (pandas-dev#22866) CLN: Move to_excel to generic.py (pandas-dev#23656) TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
@TomAugspurger as promised, this implements a handful of tests, and a few more can be ported from #23415, but the EA-specific tests haven't been started.
The big missing piece is DatetimeArray._from_sequence, which I'm getting started on now in a new branch.
Closes #23586