Skip to content

REF/TST: Fix remaining DatetimeArray with DateOffset arithmetic ops #23789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Nov 28, 2018

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

@jbrockmendel
Copy link
Member Author

Travis fail is the test_rank segfault

@codecov
Copy link

codecov bot commented Nov 20, 2018

Codecov Report

Merging #23789 into master will decrease coverage by <.01%.
The diff coverage is 94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23789      +/-   ##
==========================================
- Coverage   92.31%   92.31%   -0.01%     
==========================================
  Files         161      161              
  Lines       51471    51501      +30     
==========================================
+ Hits        47515    47542      +27     
- Misses       3956     3959       +3
Flag Coverage Δ
#multiple 90.7% <94%> (-0.01%) ⬇️
#single 42.43% <36%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/frequencies.py 97.08% <100%> (ø) ⬆️
pandas/core/arrays/timedeltas.py 96.25% <100%> (-0.19%) ⬇️
pandas/core/indexes/datetimes.py 96.48% <100%> (+0.01%) ⬆️
pandas/core/indexes/timedeltas.py 89.28% <100%> (+0.09%) ⬆️
pandas/tseries/offsets.py 96.84% <86.36%> (-0.14%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee283fa...c3d775e. Read the comment docs.

@@ -1328,20 +1346,24 @@ def _end_apply_index(self, dtindex):

base, mult = libfrequencies.get_freq_code(self.freqstr)
base_period = dtindex.to_period(base)
if not isinstance(base_period._data, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem to be doing this unwrappng a lot, wouldn't it be better to have a ._to_unwrapped_period(..)? that just does this (basically to_period() with unwrapping to the base data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that we want to unwrap a PeriodIndex, but leave a PeriodArray unchanged. Or are you suggesting that _to_unwrapped_period() would return self for the PeriodArray? I'm kind of expecting that down the road we'll have some general version of this to allow us to treat EAs and EA-backed Indexes symmetrically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a thought. happy to defer, but i think it does make sense to have methods that unwrap on Index but return self for Array

@@ -227,6 +242,18 @@ def _validate_fill_value(self, fill_value):
"Got '{got}'.".format(got=fill_value))
return fill_value

@property
def is_monotonic_increasing(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling the constructor with freq="infer" we end up calling tseries.frequencies.infer_freq, which accesses this property

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I already expressed this previously, but I would prefer that we have this discussion for all the Arrays, not just for TimedeltaArray (or DatetimeArray), as this is not a datetime-specific attribute. If you don't want to have that discussion first, you can always make it a private attribute here, and check that in infer_freq (or call the algos methods there if the passed values don't have such an attribute)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a discussion you'd like to have for EAs more generally, go ahead and open an issue for it.

I would be +1 on putting these attributes in the mixin class so that they are available on all three of DTA/TDA/PA, but for now they are only needed on TDA.

Special-casing inside infer_freq (or more specifically FrequencyInferer is a pretty ugly hack that I'd rather avoid. That said, if you're going to insist on it, I might as well get it over with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I would really prefer if you leave this out of this PR (I mean adding the attributes, so which means special casing this in infer_freq).

Looking at the code again, I think the easiest to do is to ensure what is passed to _FrequencyInferrer is an actual index. So if we have an array, simply convert to Index without copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converting to an Index would break the "Array should be ignorant of Index" rule discussed elsewhere.

I'll special-case this to get this over with, but I maintain this is introducing a code smell.

.format(inferred=inferred_freq,
passed=freq.freqstr))
elif freq is None:
# TODO: should this be the stronger condition `if freq_infer`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this TODO comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We might also be able to resolve the answer as "no" in this thread.

ATM if we get an inferred_freq here, we are setting freq = inferred_freq regardless of whether the user passed freq="infer". In particular, we are ignoring the possibility that the user passed freq=None specifically wanting the result to have no freq.


result = cls._simple_new(values, freq=freq)
# check that we are matching freqs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could simplify a lot of these checks if also pass inferred_freq (optionally to ._validate_frequencey, and for example handle the fact that result could be a scalar (just ignore the validation) / sequence (do the validation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill take a look at this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that we can change this from a 3-4 liner into a 1-2 liner. Since this pattern shows up in all four of TDA/DTA/TDI/DTI constructors (actually, future tense for DTA), I'd like to do change them all at once in a dedicated follow-up

@jbrockmendel
Copy link
Member Author

gentle ping

elif freq is None:
# TODO: should this be the stronger condition `if freq_infer`?
# i.e what if the user passed `freq=None` and specifically
# wanted freq=None in the result?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is a valid condition. what are you trying to do here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is a valid condition. what are you trying to do here?

To be clear, this is the same as the existing logic in TimedeltaIndex.__new__. The question is if we're sure this is the desired behavior. ATM:

tdi = pd.timedelta_range('1D', periods=5, freq='D')
tdi2 = pd.TimedeltaIndex(tdi, freq=None)
tdi3 = pd.TimedeltaIndex(tdi, freq="infer")

tdi2 and tdi3 both have daily freq, and there is no extant way for the user to say "no, seriously, I want freq=None" except by setting it manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt a user setting a freq is actually used anywhere, not even sure of the purpose of it; nor is it likely tested.

This is just adding complextiy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll remove the comment

@@ -227,6 +227,10 @@ def _format_native_types(self, na_rep=u'NaT', date_format=None, **kwargs):
# -------------------------------------------------------------------
# Wrapping TimedeltaArray

is_monotonic_increasing = Index.is_monotonic_increasing
is_monotonic_decreasing = Index.is_monotonic_decreasing
is_unique = Index.is_unique
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to time this, we have a dedicated routine in the cython engine for this, the main reason for using it is it a O(n) op once the hashtable is computed, rather than a non-cached computation which you did above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. TimedeltaIndex uses the _engine-based implementation that is available because we know TDI is immutable. TimedeltaArray uses the naive implementation, at least for now. There's an Issue to investigate caching on PeriodArray which can be extended to TDA/DTA when the time comes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not obvious at all. i would expect a comment on this. maybe even do this in a separate PR, with testing for this. I am not sure its relevant to the changes in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment. It is definitely relevant to this PR, since without implementing these in the TDA calls to infer_freq (via __new__) raise AttributeError

@@ -227,6 +242,18 @@ def _validate_fill_value(self, fill_value):
"Got '{got}'.".format(got=fill_value))
return fill_value

@property
def is_monotonic_increasing(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I would really prefer if you leave this out of this PR (I mean adding the attributes, so which means special casing this in infer_freq).

Looking at the code again, I think the easiest to do is to ensure what is passed to _FrequencyInferrer is an actual index. So if we have an array, simply convert to Index without copy.


offset_cls = getattr(pd.offsets, cls_name)

with warnings.catch_warnings(record=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the record=True is needed? (as you are not checking the captured warnings)


weeks = (kwds.get('weeks', 0)) * self.n
if weeks:
# integer addition on PeriodIndex is deprecated,
# so we directly use _time_shift instead
asper = i.to_period('W')
shifted = asper._data._time_shift(weeks)
if not isinstance(asper._data, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this not the case?

(or why was this change needed?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and something else, this full if weeks block does not seem to be covered (don't know if it is a false positive or not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asper at this point may be either a PeriodIndex or a PeriodArray. If it is a PeriodIndex we want to unwrap it to get a PeriodArray. If it is a PeriodArray, we dont want to unwrap it since that would give a ndarray

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good note on the if weeks. Adding coverage is out of scope, but I'll add a note in the offsets TODO list

@@ -912,7 +919,9 @@ def apply(self, other):
@apply_index_wraps
def apply_index(self, i):
shifted = liboffsets.shift_months(i.asi8, self.n, self._day_opt)
return i._shallow_copy(shifted)
# TODO: going through __new__ raises on call to _validate_frequency;
# are we passing incorrect freq?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this todo needs to be discussed / handled now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy thinking of this as a note to future-me. Really looking forward to getting the current crop of PRs done with and DatetimeArray out the door.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche I think your main concern has been addressed. LMK if any others are deal-breakers.

@@ -295,8 +295,19 @@ def __init__(self, index, warn=True):
if len(index) < 3:
raise ValueError('Need at least 3 dates to infer frequency')

self.is_monotonic = (self.index.is_monotonic_increasing or
self.index.is_monotonic_decreasing)
if not hasattr(index, "is_monotonic_increasing"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you define a method on both TimedeltaArray & TimedeltaIndex which basically does the correct thing here? I find this not very obvious what you are doing and the differencde betwee TDI and TA is pretty stark here (of the calling conventions and such).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did originally, but @jorisvandenbossche insisted on not-that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is the wrong place for logic like this. Its a private method, not sure the reason. @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a private method, not sure the reason.

Not sure what you mean with this comment. This is a private method yes, but why would that be a reason that this is not the good place to put it?
For now, it is also the only place where this logic is needed, so I don't think it is necessarily a bad place. Yes, if this would start to be used in a lot of places, putting it as a method on the Array class makes sense, for now I don't really care.

What I asked was to just leave out the addition of public attributes on the Array classes out of this PR (as that needs a more general discussion). Whether the logic is put here, in a private attribute on the Array, or by doing a self.index = pd.Index(self.index, copy=False) to ensure it has the needed properties (and the rest of the code here can stay the same), I don't care much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because you are putting logic that clearly has a different implementation outside of the class which defines it ; this is breaking all kinds of abstraction here

the inference engine needs to know some info which is better off donenall together for perf reasons - so call a method - that is what they are for

@jbrockmendel move this to a private method on ta and tdi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Now frequencies.infer_freq is accessing private attrs, but we can revisit that smell later

@jbrockmendel
Copy link
Member Author

Should be all squared away now.

@jbrockmendel
Copy link
Member Author

gentle ping. This and #23829 are the main things standing in the way of declaring DTA/TDA "finished" (internally)

@jreback
Copy link
Contributor

jreback commented Nov 28, 2018

couple of notes for followup, but lgtm.

@jorisvandenbossche

@jreback jreback added this to the 0.24.0 milestone Nov 28, 2018
@jorisvandenbossche jorisvandenbossche merged commit b1ee2df into pandas-dev:master Nov 28, 2018
@jorisvandenbossche
Copy link
Member

Thanks @jbrockmendel !

@jbrockmendel
Copy link
Member Author

Booyah. Thanks @jorisvandenbossche, this one is a big win.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants