Skip to content

Recognize timezoned labels when accessing dataframes. #17920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
4671aeb
Recognize timezoned labels when accessing dataframes.
1kastner Oct 19, 2017
2297833
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
Oct 31, 2017
69b517e
Make `test_access_datetimeindex_with_timezoned_label` PEP08 compliant.
Oct 31, 2017
6532e76
add translate function for converting time zones.
Oct 31, 2017
c354271
Move NaT to self-contained module (#18014)
jbrockmendel Nov 1, 2017
a9202fb
Separate out arithmetic tests for datetimelike indexes (#18049)
jbrockmendel Nov 1, 2017
88bf001
Adding skip to test failing because of lxml import (#17747) (#17748)
datapythonista Nov 1, 2017
7d8c9ab
a zillion flakes (#18046)
jbrockmendel Nov 1, 2017
1310680
TST: separate out grouping-type tests (#18057)
jreback Nov 1, 2017
46d9416
BUG: DataFrame.groupby() interprets tuple as list of keys
GuessWhoSamFoo Nov 1, 2017
c8a604e
CLN: some lint issues
jreback Nov 1, 2017
de7a065
read_html(): rewinding [wip] (#18017)
LiamIm Nov 1, 2017
7c0a3be
CI: temp disable scipy on windows 3.6 build (#18078)
jreback Nov 2, 2017
8844b2e
DOC: Remove duplicate 'in' from contributing.rst (#18040) (#18076)
mattayes Nov 2, 2017
62695a2
improve test output for Categoricals (#18069)
topper-123 Nov 2, 2017
7691209
MAINT: Remove np.array_equal calls in tests (#18047)
gfyoung Nov 2, 2017
edad476
Move scalar arithmetic tests to tests.scalars (#18075)
jbrockmendel Nov 2, 2017
bd958a1
Update Contributing Environment section (#18052)
TomAugspurger Nov 2, 2017
ef9a06c
Index tests in the wrong places (#18074)
jbrockmendel Nov 2, 2017
ba279c0
Move comparison utilities to np_datetime; (#18080)
jbrockmendel Nov 2, 2017
2a31f7b
Separate _TSObject into conversion (#18060)
jbrockmendel Nov 2, 2017
aa5ea0f
Port Timedelta implementation to tslibs.timedeltas (#17937)
jbrockmendel Nov 3, 2017
4bfbca9
COMPAT: compare platform return on 32-bit (#18090)
jreback Nov 3, 2017
dd761d3
Fix 18068: Updates merge_asof error, now outputs datatypes (#18082)
manrajgrover Nov 3, 2017
a6353dd
TST: Add regression test for empty DataFrame groupby (#18097)
Licht-T Nov 4, 2017
c440981
BUG: Fix the error when reading the compressed UTF-16 file (#18091)
Licht-T Nov 4, 2017
2c3faad
BUG: Implement PeriodEngine to fix PeriodIndex truncate bug (#17755)
Licht-T Nov 4, 2017
fff48bb
standardize indentation, arrange in allphabetical order (#18104)
jbrockmendel Nov 4, 2017
00f61bb
BLD: Make sure to copy ZIP files for parser tests (#18108)
gfyoung Nov 4, 2017
69a3b06
Revert "CI: temp disable scipy on windows 3.6 build (#18078)" (#18105)
jreback Nov 4, 2017
ffd363b
Masking and overflow checks for datetimeindex and timedeltaindex ops …
jbrockmendel Nov 4, 2017
8587a3d
BUG: Override mi-columns in to_csv if requested (#18110)
gfyoung Nov 5, 2017
763b5f7
fix failing tests.
1kastner Nov 5, 2017
9456b77
Merge branch 'error-on-non-naive-datetime-strings' of https://github.…
1kastner Nov 5, 2017
fd49175
Rewrite naive/timezone matrix condition, Improve test cases
1kastner Nov 5, 2017
d944bfd
adjust as it was before (un-done changes)
1kastner Nov 5, 2017
1641bf2
Add tz keyword.
1kastner Nov 5, 2017
f12caa1
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
1kastner Nov 13, 2017
1a3ab3b
Apply suggestions of review
Nov 13, 2017
31ef655
refactor: replace _utc() with utc
Nov 13, 2017
edfd895
fix flake8 issues
1kastner Nov 13, 2017
fbf8a1c
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
1kastner Nov 13, 2017
9f0dc5d
replace datetime.datetime with pd.Timestamp
Nov 14, 2017
817bfef
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
1kastner Nov 16, 2017
5c11e02
Add whatsnew and documentation.
1kastner Nov 16, 2017
6a218e5
Fix variable name in documentation
1kastner Nov 16, 2017
577d742
Apply review suggestions.
1kastner Nov 17, 2017
931b7f9
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
1kastner Nov 17, 2017
02aa59f
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
1kastner Nov 17, 2017
0e4c499
Move change to bug and rename into result and expected
1kastner Nov 23, 2017
16fe3c3
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
1kastner Nov 23, 2017
5724292
Add CET timezoned datetime index as another test case
1kastner Nov 26, 2017
a4f3a5c
Merge branch 'master' of https://github.com/pandas-dev/pandas into er…
1kastner Nov 26, 2017
8a2176d
Adjust for flake8
Nov 26, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions doc/source/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -557,6 +557,50 @@ We are stopping on the included end-point as it is part of the index
dft2 = dft2.swaplevel(0, 1).sort_index()
dft2.loc[idx[:, '2013-01-05'], :]

.. versionadded:: 0.21.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a sub-section label here (with a ref), call it something like slicing with timezones.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this behavior works with Timestamps or strings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually also datetime.datetime as pd.Timestamp can also digest that. Should it be a note like I did now?

``DatetimeIndex`` partial string indexing can be used with naive datetime-like labels when the ``DatetimeIndex`` has no timezone set.
If a timezone is provided by the label, the datetime index is assumed to be UTC and a ``UserWarning`` is emitted.

.. note::

This both works with ``pd.Timestamp`` and strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit confusing here. This section is about "partial datetime string indexing", so for me it is confusing to mention Timestamp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please talk to @jreback who suggested to mention it. Actually it also works for datetime.datetime.


.. ipython:: python
:okwarning:

first_january_implicit_utc = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a much shorter index? (you only need the first 10 to show the actual behaviour)
I would also try to use a shorter variable name here (eg idx_naive)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be shortened but I would keep it a bit longer than the first 10 because of the comparison in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which comparison?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought (without carefully checking) that maybe in the end I will just compare two empty dataframes which will accidentially happen to be equal. To avoid such wrong positive test I thought having a bit longer df can be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are the docs, not tests. And you perfectly control what you do in the example, so you can just make it a bit longer than needed for the slicing to see the effect.

freq='T')

df = pd.DataFrame(index=first_january_implicit_utc,
data=np.arange(len(first_january_implicit_utc)))

df

four_minute_slice = df["2016-01-01T00:00-02:00":"2016-01-01T02:03"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not an example of partial datetime string indexing. The dataframe index has a frequency of minutes, and you provide strings with a minute resolution

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. What is the consequence in your eyes? I just want the timezones to work, that is my only desire.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no consequence for the behaviour, so this PR will fix your usecase, But for the example in the docs, we should make a clear one. So either I would make this actual partial slicing, or move this section to somewhere else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then better move it because the timezones can not always be parsed, e.g. for months still UTC will be assumed as it goes through another path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can just edit the example a little bit. For example keep the minute resolution, and use strings with only hours (instead of the minutes now, and that still provides ability to specify time zone), or change the resolution of the df to seconds, and keep the strings as they are. Note you can do eg each 30s to avoid that selecting some minutes results in many rows.


four_minute_slice


``DatetimeIndex`` partial string indexing is always well-defined on a ``DatetimeIndex`` with timezone information.
If a timezone is provided by the label, that timezone is respected.
If no timezone is provided, then the same timezone as used in the ``DatetimeIndex`` is assumed.

.. ipython:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will show the warning in the docs, so use :okwarning:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

first_january_cet = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59',
freq='T', tz="CET")

df = pd.DataFrame(index=first_january_cet,
data=np.arange(len(first_january_cet)))

df

four_minute_slice = df["2016-01-01T00:00-01:00":"2016-01-01T02:03"]

four_minute_slice


.. _timeseries.slice_vs_exact_match:

Slice vs. Exact Match
Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.21.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,8 @@ Conversion
- Bug in :class:`DatetimeIndex` subtracting datetimelike from DatetimeIndex could fail to overflow (:issue:`18020`)
- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`)
- Bug in :func:`DataFrame.to_dict` where columns of datetime that are tz-aware were not converted to required arrays when used with ``orient='records'``, raising``TypeError` (:issue:`18372`)
-
- Bug in :class:`DatetimeIndex` when partial string label indices are actually timezone aware (:issue:`16785`)

-

Indexing
Expand Down
18 changes: 16 additions & 2 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1311,14 +1311,28 @@ def _parsed_string_to_bounds(self, reso, parsed):
----------
reso : Resolution
Resolution provided by parsed string.
parsed : datetime
parsed : datetime or object
Datetime from parsed string.

Returns
-------
lower, upper: pd.Timestamp

"""
parsed = Timestamp(parsed)
if self.tz is None:
if parsed.tz is None: # both are naive, nothing to do
pass
else: # naive datetime index but label provides timezone
warnings.warn("Access naive datetime index with a label "
"containing a timezone, assume UTC")
parsed = parsed.tz_convert(utc)
else:
if parsed.tz is None: # treat like in same timezone
parsed = parsed.tz_localize(self.tz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case already worked before AFAIK, do you know why this is needed? (although the code seems logical)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is not necessarily needed when it is done somewhere else

else: # actual timezone of the label should be considered
parsed = parsed.tz_convert(tz=self.tz)

if reso == 'year':
return (Timestamp(datetime(parsed.year, 1, 1), tz=self.tz),
Timestamp(datetime(parsed.year, 12, 31, 23,
Expand Down Expand Up @@ -1364,7 +1378,7 @@ def _parsed_string_to_bounds(self, reso, parsed):
st = datetime(parsed.year, parsed.month, parsed.day,
parsed.hour, parsed.minute, parsed.second,
parsed.microsecond)
return (Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz))
return Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're already making edits here there's a small bug-like that might be worth fixing. The day, hour, minute, and second cases don't have tz=self.tz passed to the upper half of the returned tuple.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out! By now the timezone is like ignored (or at least not considered well enough) in the later stages but maybe one day that will change.

else:
raise KeyError

Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/indexes/datetimes/test_datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,8 @@ def test_stringified_slice_with_tz(self):
start = datetime.datetime.now()
idx = DatetimeIndex(start=start, freq="1d", periods=10)
df = DataFrame(lrange(10), index=idx)
df["2013-01-14 23:44:34.437768-05:00":] # no exception here
with tm.assert_produces_warning(UserWarning):
df["2013-01-14 23:44:34.437768-05:00":] # no exception here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the # no exception here and add a comment about the warning that is produced

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether that is a good idea. That comment is not mine and it is not related to my code. The proposed refactoring is beyond the scope of this pull request.


def test_append_join_nondatetimeindex(self):
rng = date_range('1/1/2000', periods=10)
Expand Down
42 changes: 42 additions & 0 deletions pandas/tests/indexing/test_partial.py
Original file line number Diff line number Diff line change
Expand Up @@ -637,3 +637,45 @@ def test_partial_set_empty_frame_empty_consistencies(self):
df.loc[0, 'x'] = 1
expected = DataFrame(dict(x=[1], y=[np.nan]))
tm.assert_frame_equal(df, expected, check_dtype=False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize both of these tests with strings & with Timestamp for the indexers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either I do not understand you correctly or re-read the tests. I already both use string labels and pd.Timestamp labels there.

def test_access_timezoned_datetimeindex_with_timezoned_label(self):

# GH 6785
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an incorrect issue number

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh let me have a look how that could happen

# timezone was ignored when string was provided as a label

first_january = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to parametrize these to avoid the code repetition. so i think you can do it with 2 test functions, one which slices and compares with an expected, and the 2nd function which checks for the warnings (you can actually do it with one if you add some more paramaters)

something like

@pytest.mark.parametrize("tz, start, end",......)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this contradicts the idea of having the df.iloc check suggested by @jorisvandenbossche because that is rather specific. I would rather delete the non-naive UTC test because the CET test shows much more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to write the different strings so that they give the same expected frame I think

freq='T', tz="UTC")
df = pd.DataFrame(index=first_january, data=np.arange(len(
first_january)))

result = df[
"2016-01-01T00:00-02:00":"2016-01-01T02:03"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this all on a single line?


expected = df[
pd.Timestamp("2016-01-01T00:00-02:00"):
pd.Timestamp("2016-01-01T02:03")
]

tm.assert_frame_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assert both results (with strings or with Timestamps) with an independelty constructed one? (eg df.iloc[...])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean with df.iloc[...] but generally speaking yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to created 'expected' with something like df.iloc[120:124] (but then with the cirrect numbers)


def test_access_naive_datetimeindex_with_timezoned_label(self):

# GH 6785
# timezone was ignored when string was provided as a label
# this test is for completeness

first_january = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59',
freq='T')
df = pd.DataFrame(index=first_january, data=np.arange(len(
first_january)))

with tm.assert_produces_warning(UserWarning):
result = df["2016-01-01T00:00-02:00":"2016-01-01T02:03"]

expected = df[
pd.Timestamp("2016-01-01T00:00-02:00"):
pd.Timestamp("2016-01-01T02:03")
]

tm.assert_frame_equal(expected, result)