-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix Timestamp rounding #21507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Timestamp rounding #21507
Conversation
Hello @alimcmaster1! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 28, 2018 at 20:01 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #21507 +/- ##
=======================================
Coverage 91.9% 91.9%
=======================================
Files 154 154
Lines 49555 49555
=======================================
Hits 45542 45542
Misses 4013 4013
Continue to review full report at Codecov.
|
A few thoughts:
|
Thanks @mroeschke for the comments. I've done (2) and (3) added a few additional test cases to the DateTimeIndex ceil/floor and one that clearly shows how round should behave for this bug.
Best, Alistair |
Rather have
|
dt = Timestamp(test_input) | ||
expected = Timestamp(expected) | ||
|
||
result_ceil = dt.ceil(freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also parametrize over the rounding methods (ceil
, floor
, and round
)? It would help reduce this duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure makes sense to me
pandas/_libs/tslibs/timestamps.pyx
Outdated
@@ -72,30 +72,50 @@ def round_ns(values, rounder, freq): | |||
------- | |||
int or :obj:`ndarray` | |||
""" | |||
def _round_non_int_multiple(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you just have the int check inside here? this is convoluting the logic a lot
pandas/_libs/tslibs/timestamps.pyx
Outdated
|
||
r = (unit * rounder((values * (divisor / float(unit))) / divisor) | ||
.astype('i8')) | ||
if type(values) is int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not just another part of the if?
So it shows up on the issue properly: #21262 |
Cleaned up my implementation here
Thoguhts @mroeschke ? |
pandas/_libs/tslibs/timestamps.pyx
Outdated
return r | ||
return r | ||
|
||
return np.fromiter((_round_non_int_multiple(item) for item in values), np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you iterating? this doesn’t make any sense to do so with. vectorizes function
pandas/_libs/tslibs/timestamps.pyx
Outdated
""" | ||
Applies rounding function at given frequency | ||
|
||
Parameters | ||
---------- | ||
values : int, :obj:`ndarray` | ||
rounder : function | ||
values : np.array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave this as :obj:ndarray
pandas/_libs/tslibs/timestamps.pyx
Outdated
freq : str, obj | ||
|
||
Returns | ||
------- | ||
int or :obj:`ndarray` | ||
np.array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
pandas/_libs/tslibs/timestamps.pyx
Outdated
value = np.array([value], dtype=np.int64) | ||
|
||
# Will only ever contain 1 element for timestamp | ||
r = round_ns(value, rounder, freq).item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. I think just indexing into this array (i.e. round_ns(value, rounder, freq)[0]
) is just fine. Looks like item
returns a copy of a Python scalar (and we may want to keep this a numpy scalar just in case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you but one advantage I did see of item()
is that it will throw if the size of the array is > 1. We could do [0] and justify this by asserting len(r) == 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.value
from Timestamp
will always be a scalar, so we implicitly know the result of this is be a one element array.
Thanks for the revision. As @jreback mentions, I dont think it's necessary to iterate, You should be able to perform the adjustments with vectorization. At a high level the logic should look like:
|
That @mroeschke your logic above seems much neater let me refactor! |
pandas/_libs/tslibs/timestamps.pyx
Outdated
values : int, :obj:`ndarray` | ||
rounder : function | ||
values : :obj:`ndarray` | ||
rounder : function, eg. 'Ceil', 'Floor', 'round' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you lowercase ceil and floor
pandas/_libs/tslibs/timestamps.pyx
Outdated
|
||
r = (unit * rounder((values * (divisor / float(unit))) / divisor) | ||
.astype('i8')) | ||
values = np.copy(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a copy needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to handle the case where 'NaT' exists in the DateTimeIndex. Removing it will cause test_ceil_floor_edge
in test_scalar_compact.py
to fail.
I found that datetimelike.py self.hasnans
in _maybe_mask_results
will return False if we don't do the copy, by copying we ensure that we arn't referencing the base of the input array. Think that is the issue here, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jreback.
Thanks for the investigation. That sounds reasonable; but I am not too familiar with nan
/Nat
ops with respect to references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are mutating in place, so you DO need to copy.
that's fine. though pls use values.copy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i haven't fully reviewd yet
pandas/_libs/tslibs/timestamps.pyx
Outdated
|
||
r = (unit * rounder((values * (divisor / float(unit))) / divisor) | ||
.astype('i8')) | ||
values = np.copy(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are mutating in place, so you DO need to copy.
that's fine. though pls use values.copy()
* Google Cloud Storage support using gcsfs
Removing the semicolon delimiter at the end of the modified line of code allows the line's output to be displayed.
* Add link to Pandas-GBQ 0.5.0 in what's new. * Remove unnecessary sleep in GBQ tests. Closes googleapis/python-bigquery-pandas#177 Closes #21627
thanks @alimcmaster1 |
thanks @jreback and @mroeschke for helping review! |
(cherry picked from commit 76ef7c4)
(cherry picked from commit 76ef7c4)
0.23.0
. #21262This change-set is to avoid rounding a timestamp when the timestamp is a multiple of the frequency string passed in.
"Values" param passed into round_ns can either be a np array or int. So relevant handling added for both.
FYI I havn't used Cython much before so keen to get peoples thoughts/feedback.
Thanks