-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Recognize timezoned labels when accessing dataframes. #17920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 51 commits
4671aeb
2297833
69b517e
6532e76
c354271
a9202fb
88bf001
7d8c9ab
1310680
46d9416
c8a604e
de7a065
7c0a3be
8844b2e
62695a2
7691209
edad476
bd958a1
ef9a06c
ba279c0
2a31f7b
aa5ea0f
4bfbca9
dd761d3
a6353dd
c440981
2c3faad
fff48bb
00f61bb
69a3b06
ffd363b
8587a3d
763b5f7
9456b77
fd49175
d944bfd
1641bf2
f12caa1
1a3ab3b
31ef655
edfd895
fbf8a1c
9f0dc5d
817bfef
5c11e02
6a218e5
577d742
931b7f9
02aa59f
0e4c499
16fe3c3
5724292
a4f3a5c
8a2176d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -557,6 +557,50 @@ We are stopping on the included end-point as it is part of the index | |
dft2 = dft2.swaplevel(0, 1).sort_index() | ||
dft2.loc[idx[:, '2013-01-05'], :] | ||
|
||
.. versionadded:: 0.21.1 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note that this behavior works with Timestamps or strings There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually also |
||
``DatetimeIndex`` partial string indexing can be used with naive datetime-like labels when the ``DatetimeIndex`` has no timezone set. | ||
If a timezone is provided by the label, the datetime index is assumed to be UTC and a ``UserWarning`` is emitted. | ||
|
||
.. note:: | ||
|
||
This both works with ``pd.Timestamp`` and strings | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a bit confusing here. This section is about "partial datetime string indexing", so for me it is confusing to mention Timestamp There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please talk to @jreback who suggested to mention it. Actually it also works for |
||
|
||
.. ipython:: python | ||
:okwarning: | ||
|
||
first_january_implicit_utc = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you make this a much shorter index? (you only need the first 10 to show the actual behaviour) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can be shortened but I would keep it a bit longer than the first 10 because of the comparison in the end. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which comparison? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought (without carefully checking) that maybe in the end I will just compare two empty dataframes which will accidentially happen to be equal. To avoid such wrong positive test I thought having a bit longer df can be helpful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This are the docs, not tests. And you perfectly control what you do in the example, so you can just make it a bit longer than needed for the slicing to see the effect. |
||
freq='T') | ||
|
||
df = pd.DataFrame(index=first_january_implicit_utc, | ||
data=np.arange(len(first_january_implicit_utc))) | ||
|
||
df | ||
|
||
four_minute_slice = df["2016-01-01T00:00-02:00":"2016-01-01T02:03"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is actually not an example of partial datetime string indexing. The dataframe index has a frequency of minutes, and you provide strings with a minute resolution There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes you are right. What is the consequence in your eyes? I just want the timezones to work, that is my only desire. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no consequence for the behaviour, so this PR will fix your usecase, But for the example in the docs, we should make a clear one. So either I would make this actual partial slicing, or move this section to somewhere else There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then better move it because the timezones can not always be parsed, e.g. for months still UTC will be assumed as it goes through another path. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, you can just edit the example a little bit. For example keep the minute resolution, and use strings with only hours (instead of the minutes now, and that still provides ability to specify time zone), or change the resolution of the df to seconds, and keep the strings as they are. Note you can do eg each 30s to avoid that selecting some minutes results in many rows. |
||
|
||
four_minute_slice | ||
|
||
|
||
``DatetimeIndex`` partial string indexing is always well-defined on a ``DatetimeIndex`` with timezone information. | ||
If a timezone is provided by the label, that timezone is respected. | ||
If no timezone is provided, then the same timezone as used in the ``DatetimeIndex`` is assumed. | ||
|
||
.. ipython:: python | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will show the warning in the docs, so use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
first_january_cet = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59', | ||
freq='T', tz="CET") | ||
|
||
df = pd.DataFrame(index=first_january_cet, | ||
data=np.arange(len(first_january_cet))) | ||
|
||
df | ||
|
||
four_minute_slice = df["2016-01-01T00:00-01:00":"2016-01-01T02:03"] | ||
|
||
four_minute_slice | ||
|
||
|
||
.. _timeseries.slice_vs_exact_match: | ||
|
||
Slice vs. Exact Match | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1311,14 +1311,28 @@ def _parsed_string_to_bounds(self, reso, parsed): | |
---------- | ||
reso : Resolution | ||
Resolution provided by parsed string. | ||
parsed : datetime | ||
parsed : datetime or object | ||
Datetime from parsed string. | ||
|
||
Returns | ||
------- | ||
lower, upper: pd.Timestamp | ||
|
||
""" | ||
parsed = Timestamp(parsed) | ||
if self.tz is None: | ||
if parsed.tz is None: # both are naive, nothing to do | ||
pass | ||
else: # naive datetime index but label provides timezone | ||
warnings.warn("Access naive datetime index with a label " | ||
"containing a timezone, assume UTC") | ||
parsed = parsed.tz_convert(utc) | ||
else: | ||
if parsed.tz is None: # treat like in same timezone | ||
parsed = parsed.tz_localize(self.tz) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This case already worked before AFAIK, do you know why this is needed? (although the code seems logical) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code is not necessarily needed when it is done somewhere else |
||
else: # actual timezone of the label should be considered | ||
parsed = parsed.tz_convert(tz=self.tz) | ||
|
||
if reso == 'year': | ||
return (Timestamp(datetime(parsed.year, 1, 1), tz=self.tz), | ||
Timestamp(datetime(parsed.year, 12, 31, 23, | ||
|
@@ -1364,7 +1378,7 @@ def _parsed_string_to_bounds(self, reso, parsed): | |
st = datetime(parsed.year, parsed.month, parsed.day, | ||
parsed.hour, parsed.minute, parsed.second, | ||
parsed.microsecond) | ||
return (Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz)) | ||
return Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since you're already making edits here there's a small bug-like that might be worth fixing. The day, hour, minute, and second cases don't have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for pointing that out! By now the timezone is like ignored (or at least not considered well enough) in the later stages but maybe one day that will change. |
||
else: | ||
raise KeyError | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -236,7 +236,8 @@ def test_stringified_slice_with_tz(self): | |
start = datetime.datetime.now() | ||
idx = DatetimeIndex(start=start, freq="1d", periods=10) | ||
df = DataFrame(lrange(10), index=idx) | ||
df["2013-01-14 23:44:34.437768-05:00":] # no exception here | ||
with tm.assert_produces_warning(UserWarning): | ||
df["2013-01-14 23:44:34.437768-05:00":] # no exception here | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would remove the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure whether that is a good idea. That comment is not mine and it is not related to my code. The proposed refactoring is beyond the scope of this pull request. |
||
|
||
def test_append_join_nondatetimeindex(self): | ||
rng = date_range('1/1/2000', periods=10) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -637,3 +637,45 @@ def test_partial_set_empty_frame_empty_consistencies(self): | |
df.loc[0, 'x'] = 1 | ||
expected = DataFrame(dict(x=[1], y=[np.nan])) | ||
tm.assert_frame_equal(df, expected, check_dtype=False) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. parametrize both of these tests with strings & with Timestamp for the indexers There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. either I do not understand you correctly or re-read the tests. I already both use string labels and |
||
def test_access_timezoned_datetimeindex_with_timezoned_label(self): | ||
|
||
# GH 6785 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is an incorrect issue number There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh let me have a look how that could happen |
||
# timezone was ignored when string was provided as a label | ||
|
||
first_january = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would really like to parametrize these to avoid the code repetition. so i think you can do it with 2 test functions, one which slices and compares with an expected, and the 2nd function which checks for the warnings (you can actually do it with one if you add some more paramaters) something like
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this contradicts the idea of having the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be able to write the different strings so that they give the same expected frame I think |
||
freq='T', tz="UTC") | ||
df = pd.DataFrame(index=first_january, data=np.arange(len( | ||
first_january))) | ||
|
||
result = df[ | ||
"2016-01-01T00:00-02:00":"2016-01-01T02:03" | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you put this all on a single line? |
||
|
||
expected = df[ | ||
pd.Timestamp("2016-01-01T00:00-02:00"): | ||
pd.Timestamp("2016-01-01T02:03") | ||
] | ||
|
||
tm.assert_frame_equal(result, expected) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you assert both results (with strings or with Timestamps) with an independelty constructed one? (eg There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure what you mean with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean to created 'expected' with something like df.iloc[120:124] (but then with the cirrect numbers) |
||
|
||
def test_access_naive_datetimeindex_with_timezoned_label(self): | ||
|
||
# GH 6785 | ||
# timezone was ignored when string was provided as a label | ||
# this test is for completeness | ||
|
||
first_january = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59', | ||
freq='T') | ||
df = pd.DataFrame(index=first_january, data=np.arange(len( | ||
first_january))) | ||
|
||
with tm.assert_produces_warning(UserWarning): | ||
result = df["2016-01-01T00:00-02:00":"2016-01-01T02:03"] | ||
|
||
expected = df[ | ||
pd.Timestamp("2016-01-01T00:00-02:00"): | ||
pd.Timestamp("2016-01-01T02:03") | ||
] | ||
|
||
tm.assert_frame_equal(expected, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a sub-section label here (with a ref), call it something like
slicing with timezones
.