Skip to content

Fix Timestamp rounding #21507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 114 commits into from
Jun 29, 2018
Merged

Fix Timestamp rounding #21507

merged 114 commits into from
Jun 29, 2018

Conversation

alimcmaster1
Copy link
Member

@alimcmaster1 alimcmaster1 commented Jun 15, 2018

This change-set is to avoid rounding a timestamp when the timestamp is a multiple of the frequency string passed in.

"Values" param passed into round_ns can either be a np array or int. So relevant handling added for both.

FYI I havn't used Cython much before so keen to get peoples thoughts/feedback.

Thanks

@pep8speaks
Copy link

pep8speaks commented Jun 15, 2018

Hello @alimcmaster1! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 28, 2018 at 20:01 Hours UTC

@codecov
Copy link

codecov bot commented Jun 16, 2018

Codecov Report

Merging #21507 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21507   +/-   ##
=======================================
  Coverage    91.9%    91.9%           
=======================================
  Files         154      154           
  Lines       49555    49555           
=======================================
  Hits        45542    45542           
  Misses       4013     4013
Flag Coverage Δ
#multiple 90.27% <ø> (ø) ⬆️
#single 42.03% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0f978d...2d0fa8b. Read the comment docs.

@mroeschke
Copy link
Member

mroeschke commented Jun 16, 2018

A few thoughts:

  1. With your solution, I think it's better to make round_ns only take arrays (therefore wrap scalar inputs as array) to avoid checking the type of the input (and moreover cdefing them) and iterating over values. It appears the solution can still work by computing values % unit as a boolean array and operating with that. We do something similar with tz_localize_to_utc.

  2. Include tests for DatetimeIndex.round

  3. Looks like there are some linting errors (your spacing seems to be 2 instead of 4 in some places).

@mroeschke mroeschke added the Datetime Datetime data dtype label Jun 16, 2018
@alimcmaster1
Copy link
Member Author

Thanks @mroeschke for the comments. I've done (2) and (3) added a few additional test cases to the DateTimeIndex ceil/floor and one that clearly shows how round should behave for this bug.

  1. Yes this definitely makes sense. Just to clarify are you thinking in
    _round we check the type of value if its an int we wrap in an np array. Hence we can cdef round_ns?

Best,

Alistair

@mroeschke
Copy link
Member

Rather have round_ns (which we can at best cpdef since it's imported by a python file) only accept numpy arrays. So since this method is used by timestamps, we would have to wrap it's self.value input as a numpy array. The relevant rounding code for Timestamps would turn into:

def _round(self, freq, rounder):
    if self.tz is not None:
        value = self.tz_localize(None).value
    else:
        value = self.value
    value = np.array([value], dtype=np.int64)
    r = round_ns(value, rounder, freq)

dt = Timestamp(test_input)
expected = Timestamp(expected)

result_ceil = dt.ceil(freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also parametrize over the rounding methods (ceil, floor, and round)? It would help reduce this duplication

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure makes sense to me

@@ -72,30 +72,50 @@ def round_ns(values, rounder, freq):
-------
int or :obj:`ndarray`
"""
def _round_non_int_multiple(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just have the int check inside here? this is convoluting the logic a lot


r = (unit * rounder((values * (divisor / float(unit))) / divisor)
.astype('i8'))
if type(values) is int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just another part of the if?

@Safrone
Copy link

Safrone commented Jun 20, 2018

So it shows up on the issue properly: #21262

@alimcmaster1
Copy link
Member Author

Cleaned up my implementation here

  1. Params added in test cases
  2. round_ns now only takes np.array ( hence cleaned up the logic in here )

Thoguhts @mroeschke ?

return r
return r

return np.fromiter((_round_non_int_multiple(item) for item in values), np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you iterating? this doesn’t make any sense to do so with. vectorizes function

"""
Applies rounding function at given frequency

Parameters
----------
values : int, :obj:`ndarray`
rounder : function
values : np.array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave this as :obj:ndarray

freq : str, obj

Returns
-------
int or :obj:`ndarray`
np.array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

value = np.array([value], dtype=np.int64)

# Will only ever contain 1 element for timestamp
r = round_ns(value, rounder, freq).item()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. I think just indexing into this array (i.e. round_ns(value, rounder, freq)[0]) is just fine. Looks like item returns a copy of a Python scalar (and we may want to keep this a numpy scalar just in case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you but one advantage I did see of item() is that it will throw if the size of the array is > 1. We could do [0] and justify this by asserting len(r) == 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.value from Timestamp will always be a scalar, so we implicitly know the result of this is be a one element array.

@mroeschke
Copy link
Member

Thanks for the revision. As @jreback mentions, I dont think it's necessary to iterate, You should be able to perform the adjustments with vectorization. At a high level the logic should look like:

mask = value % unit == 0
if mask.all():
    return value
values[~mask] = _round_non_int_multiple(values[~mask])
return values

@alimcmaster1
Copy link
Member Author

That @mroeschke your logic above seems much neater let me refactor!

values : int, :obj:`ndarray`
rounder : function
values : :obj:`ndarray`
rounder : function, eg. 'Ceil', 'Floor', 'round'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you lowercase ceil and floor


r = (unit * rounder((values * (divisor / float(unit))) / divisor)
.astype('i8'))
values = np.copy(values)
Copy link
Member

@mroeschke mroeschke Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a copy needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to handle the case where 'NaT' exists in the DateTimeIndex. Removing it will cause test_ceil_floor_edge in test_scalar_compact.py to fail.

I found that datetimelike.py self.hasnans in _maybe_mask_resultswill return False if we don't do the copy, by copying we ensure that we arn't referencing the base of the input array. Think that is the issue here, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jreback.

Thanks for the investigation. That sounds reasonable; but I am not too familiar with nan/Nat ops with respect to references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are mutating in place, so you DO need to copy.

that's fine. though pls use values.copy()

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't fully reviewd yet


r = (unit * rounder((values * (divisor / float(unit))) / divisor)
.astype('i8'))
values = np.copy(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are mutating in place, so you DO need to copy.

that's fine. though pls use values.copy()

jorisvandenbossche and others added 20 commits June 28, 2018 20:39
* Google Cloud Storage support using gcsfs
Removing the semicolon delimiter at the end of the modified line of code allows the line's output to be displayed.
* Add link to Pandas-GBQ 0.5.0 in what's new.
* Remove unnecessary sleep in GBQ tests.

Closes googleapis/python-bigquery-pandas#177

Closes #21627
@jreback jreback merged commit 76ef7c4 into pandas-dev:master Jun 29, 2018
@jreback
Copy link
Contributor

jreback commented Jun 29, 2018

thanks @alimcmaster1

@alimcmaster1
Copy link
Member Author

thanks @jreback and @mroeschke for helping review!

@alimcmaster1 alimcmaster1 deleted the timestamp-fixes branch July 1, 2018 13:12
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jul 2, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jul 5, 2018
(cherry picked from commit 76ef7c4)
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.