Skip to content

implement _tz_convert_dst for de-duplication #21730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 6, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 74 additions & 67 deletions pandas/_libs/tslibs/conversion.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -608,9 +608,66 @@ cpdef inline datetime localize_pydatetime(datetime dt, object tz):
# ----------------------------------------------------------------------
# Timezone Conversion

cdef inline int64_t[:] _tz_convert_dst(ndarray[int64_t] values, tzinfo tz,
bint to_utc=True):
"""
tz_convert for non-UTC non-tzlocal cases where we have to check
DST transitions pointwise.

Parameters
----------
values : ndarray[int64_t]
tz : tzinfo
to_utc : bool
True if converting _to_ UTC, False if converting _from_ utc

Returns
-------
result : ndarray[int64_t]
"""
cdef:
Py_ssize_t n = len(values)
Py_ssize_t i, j, pos
ndarray[int64_t] result = np.empty(n, dtype=np.int64)
ndarray[int64_t] tt, trans, deltas
ndarray[Py_ssize_t] posn
int64_t v

trans, deltas, typ = get_dst_info(tz)
if not to_utc:
# We add `offset` below instead of subtracting it
deltas = -1 * deltas

tt = values[values != NPY_NAT]
if not len(tt):
# if all NaT, return all NaT
return values

posn = trans.searchsorted(tt, side='right')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other places where get_dst_info is used don't do this tt thing.

_normalize_local, period.localize_dt64_arr_to_period, resolution._reso_local all do:

trans, deltas, typ = get_dst_info(tz)
_pos = trans.searchsorted(stamps, side='right') - 1
 if _pos.dtype != np.int64:
        _pos = _pos.astype(np.int64)
pos = _pos

is_date_array_normalized and tslib.ints_to_pydatetime makes searchsorted calls pointwise

trans, deltas, typ = get_dst_info(tz)
for i in range(n):
    pos = trans.searchsorted(stamps[i]) - 1

I'm checking to see if anything breaks if we change it here to match the others.

Perf Considerations:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the values != NPY_NAT check we could pass values as an int64_t[:] which I think would be more performant.

why? NPY_NAT is i8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you can search an array of timestamps to searchsorted here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? NPY_NAT is i8

AFAICT int64_t[:] doesn't support __eq__. You can basically iterate through it and little else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you can search an array of timestamps to searchsorted here?

I don't understand the question. trans.searchsorted(x) assumes (doesn't check) that trans is sorted, doesn't care about x (which can be array or scalar)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no what i mean is that why can you do trans.searchsorted(array_of_time_statmps) rather than iteratiing?


j = 0
for i in range(n):
v = values[i]
if v == NPY_NAT:
result[i] = v
else:
pos = posn[j] - 1
j += 1
if pos < 0:
raise ValueError('First time before start of DST info')
result[i] = v - deltas[pos]

return result


cdef inline int64_t _tz_convert_tzlocal_utc(int64_t val, tzinfo tz,
bint to_utc=True):
"""
Convert the i8 representation of a datetime from a tzlocal timezone to
UTC, or vice-versa.

Private, not intended for use outside of tslibs.conversion

Parameters
----------
val : int64_t
Expand Down Expand Up @@ -672,6 +729,7 @@ cpdef int64_t tz_convert_single(int64_t val, object tz1, object tz2):
Py_ssize_t pos
int64_t v, offset, utc_date
pandas_datetimestruct dts
ndarray[int64_t] arr # TODO: Is there a lighter-weight way to do this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t[:] arr is a memory view, should work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM that doesn't work because _tz_convert_dst takes an ndarray[int64_t] and not a memoryview. I'm hopeful that down the road that can be changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok to make that change here if you can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok to make that change here if you can

Not an option yet. To do this we need to get rid of the values != NPY_NAT check in _tz_convert_dst, which entails a non-trivial redesign of that function. Need to a) make sure it doesn't break things and b) check perf.

ATM the options are either wrap in a 1-element array or keep duplicated code.


# See GH#17734 We should always be converting either from UTC or to UTC
assert (is_utc(tz1) or tz1 == 'UTC') or (is_utc(tz2) or tz2 == 'UTC')
Expand All @@ -683,29 +741,23 @@ cpdef int64_t tz_convert_single(int64_t val, object tz1, object tz2):
if is_tzlocal(tz1):
utc_date = _tz_convert_tzlocal_utc(val, tz1, to_utc=True)
elif get_timezone(tz1) != 'UTC':
trans, deltas, typ = get_dst_info(tz1)
pos = trans.searchsorted(val, side='right') - 1
if pos < 0:
raise ValueError('First time before start of DST info')
offset = deltas[pos]
utc_date = val - offset
arr = np.array([val])
utc_date = _tz_convert_dst(arr, tz1, to_utc=True)[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatching by wrapping in an ndarray is not my favorite, but I consider the de-duplication worth it.

I'm hopeful that there's a way to make the argument an int64_t[:] which should have less overhead, but that is not yet an option (see comments above on _tz_convert_dst)

else:
utc_date = val

if get_timezone(tz2) == 'UTC':
return utc_date
elif is_tzlocal(tz2):
return _tz_convert_tzlocal_utc(utc_date, tz2, to_utc=False)

# Convert UTC to other timezone
trans, deltas, typ = get_dst_info(tz2)

pos = trans.searchsorted(utc_date, side='right') - 1
if pos < 0:
raise ValueError('First time before start of DST info')

offset = deltas[pos]
return utc_date + offset
else:
# Convert UTC to other timezone
arr = np.array([utc_date])
# Note: at least with cython 0.28.3, doing a looking `[0]` in the next
# line is sensitive to the declared return type of _tz_convert_dst;
# if it is declared as returning ndarray[int64_t], a compile-time error
# is raised.
return _tz_convert_dst(arr, tz2, to_utc=False)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do an assignment here that then specifies the type

liek

arr = _tz_convert.....
cdef int64_t result = arr[0]
return result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does that help in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the return type is well defined (was reading your comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh gotcha. Fortunately the version here where _tz_convert_dst returns a memoryview is preferred anyway. I could remove the comment, but it did take me a few rounds to figure out that was where the problem was coming from.



@cython.boundscheck(False)
Expand Down Expand Up @@ -746,70 +798,25 @@ def tz_convert(ndarray[int64_t] vals, object tz1, object tz2):
else:
utc_dates[i] = _tz_convert_tzlocal_utc(v, tz1, to_utc=True)
else:
trans, deltas, typ = get_dst_info(tz1)

# all-NaT
tt = vals[vals != NPY_NAT]
if not len(tt):
return vals

posn = trans.searchsorted(tt, side='right')
j = 0
for i in range(n):
v = vals[i]
if v == NPY_NAT:
utc_dates[i] = NPY_NAT
else:
pos = posn[j] - 1
j = j + 1
if pos < 0:
raise ValueError('First time before start of DST info')
offset = deltas[pos]
utc_dates[i] = v - offset
utc_dates = np.array(_tz_convert_dst(vals, tz1, to_utc=True))
else:
utc_dates = vals

if get_timezone(tz2) == 'UTC':
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditions are in a different order for the to_utc block above vs this from_utc block, but with _tz_convert_dst and _tz_convert_tzlocal_utc, the blocks are now identical up to an argument change to_utc True/False. After this PR, both tz_convert and tz_convert_single can be cut in half.

return utc_dates

result = np.zeros(n, dtype=np.int64)
if is_tzlocal(tz2):
elif is_tzlocal(tz2):
result = np.zeros(n, dtype=np.int64)
for i in range(n):
v = utc_dates[i]
if v == NPY_NAT:
result[i] = NPY_NAT
else:
result[i] = _tz_convert_tzlocal_utc(v, tz2, to_utc=False)
return result

# Convert UTC to other timezone
trans, deltas, typ = get_dst_info(tz2)

# use first non-NaT element
# if all-NaT, return all-NaT
if (result == NPY_NAT).all():
return result

# if all NaT, return all NaT
tt = utc_dates[utc_dates != NPY_NAT]
if not len(tt):
return utc_dates

posn = trans.searchsorted(tt, side='right')

j = 0
for i in range(n):
v = utc_dates[i]
if vals[i] == NPY_NAT:
result[i] = vals[i]
else:
pos = posn[j] - 1
j = j + 1
if pos < 0:
raise ValueError('First time before start of DST info')
offset = deltas[pos]
result[i] = v + offset
return result
else:
# Convert UTC to other timezone
return np.array(_tz_convert_dst(utc_dates, tz2, to_utc=False))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to explicitly wrap in np.array bc _tz_convert_dst returns an int64_t[:].



# TODO: cdef scalar version to call from convert_str_to_tsobject
Expand Down