Skip to content

implement _tz_convert_dst for de-duplication #21730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 6, 2018

Conversation

jbrockmendel
Copy link
Member

Analogous to #21727, this handles the get_dst_info case whereas that handled the tzlocal case.

Fleshes out docstring as discussed in #21727

All the other uses of get_dst_info are handled slightly differently. Need to figure out whether those are intentional to see if more de-duplication (or perf) is available.

offset = deltas[pos]
utc_date = val - offset
arr = np.array([val])
utc_date = _tz_convert_dst(arr, tz1, to_utc=True)[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatching by wrapping in an ndarray is not my favorite, but I consider the de-duplication worth it.

I'm hopeful that there's a way to make the argument an int64_t[:] which should have less overhead, but that is not yet an option (see comments above on _tz_convert_dst)

raise ValueError('First time before start of DST info')
offset = deltas[pos]
utc_dates[i] = v - offset
utc_dates = np.array(_tz_convert_dst(vals, tz1, to_utc=True))
else:
utc_dates = vals

if get_timezone(tz2) == 'UTC':
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditions are in a different order for the to_utc block above vs this from_utc block, but with _tz_convert_dst and _tz_convert_tzlocal_utc, the blocks are now identical up to an argument change to_utc True/False. After this PR, both tz_convert and tz_convert_single can be cut in half.

return result
else:
# Convert UTC to other timezone
return np.array(_tz_convert_dst(utc_dates, tz2, to_utc=False))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to explicitly wrap in np.array bc _tz_convert_dst returns an int64_t[:].

@@ -672,6 +729,7 @@ cpdef int64_t tz_convert_single(int64_t val, object tz1, object tz2):
Py_ssize_t pos
int64_t v, offset, utc_date
pandas_datetimestruct dts
ndarray[int64_t] arr # TODO: Is there a lighter-weight way to do this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t[:] arr is a memory view, should work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM that doesn't work because _tz_convert_dst takes an ndarray[int64_t] and not a memoryview. I'm hopeful that down the road that can be changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok to make that change here if you can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok to make that change here if you can

Not an option yet. To do this we need to get rid of the values != NPY_NAT check in _tz_convert_dst, which entails a non-trivial redesign of that function. Need to a) make sure it doesn't break things and b) check perf.

ATM the options are either wrap in a 1-element array or keep duplicated code.

# line is sensitive to the declared return type of _tz_convert_dst;
# if it is declared as returning ndarray[int64_t], a compile-time error
# is raised.
return _tz_convert_dst(arr, tz2, to_utc=False)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could do an assignment here that then specifies the type

liek

arr = _tz_convert.....
cdef int64_t result = arr[0]
return result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does that help in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the return type is well defined (was reading your comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh gotcha. Fortunately the version here where _tz_convert_dst returns a memoryview is preferred anyway. I could remove the comment, but it did take me a few rounds to figure out that was where the problem was coming from.

# if all NaT, return all NaT
return values

posn = trans.searchsorted(tt, side='right')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other places where get_dst_info is used don't do this tt thing.

_normalize_local, period.localize_dt64_arr_to_period, resolution._reso_local all do:

trans, deltas, typ = get_dst_info(tz)
_pos = trans.searchsorted(stamps, side='right') - 1
 if _pos.dtype != np.int64:
        _pos = _pos.astype(np.int64)
pos = _pos

is_date_array_normalized and tslib.ints_to_pydatetime makes searchsorted calls pointwise

trans, deltas, typ = get_dst_info(tz)
for i in range(n):
    pos = trans.searchsorted(stamps[i]) - 1

I'm checking to see if anything breaks if we change it here to match the others.

Perf Considerations:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the values != NPY_NAT check we could pass values as an int64_t[:] which I think would be more performant.

why? NPY_NAT is i8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you can search an array of timestamps to searchsorted here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? NPY_NAT is i8

AFAICT int64_t[:] doesn't support __eq__. You can basically iterate through it and little else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you can search an array of timestamps to searchsorted here?

I don't understand the question. trans.searchsorted(x) assumes (doesn't check) that trans is sorted, doesn't care about x (which can be array or scalar)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no what i mean is that why can you do trans.searchsorted(array_of_time_statmps) rather than iteratiing?

@codecov
Copy link

codecov bot commented Jul 4, 2018

Codecov Report

Merging #21730 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21730   +/-   ##
=======================================
  Coverage   91.92%   91.92%           
=======================================
  Files         158      158           
  Lines       49705    49705           
=======================================
  Hits        45693    45693           
  Misses       4012     4012
Flag Coverage Δ
#multiple 90.3% <ø> (ø) ⬆️
#single 41.99% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1070976...522ec03. Read the comment docs.

@gfyoung gfyoung added Timezones Timezone data dtype Clean labels Jul 5, 2018
@jreback jreback added this to the 0.24.0 milestone Jul 6, 2018
@jreback jreback merged commit 3273309 into pandas-dev:master Jul 6, 2018
@jbrockmendel jbrockmendel deleted the tz_convert_dst branch July 6, 2018 14:30
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants