Skip to content

BUG: Correctly localize naive datetime strings with Series and datetimetztype (#17415) #17603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 12, 2018

Conversation

mroeschke
Copy link
Member

@pep8speaks
Copy link

pep8speaks commented Sep 20, 2017

Hello @mroeschke! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 12, 2018 at 05:11 Hours UTC

)
is_dt_string = is_string_dtype(value)
value = to_datetime(value, errors=errors)
if is_dt_string:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm adding this incremental logic is error prone. I think this should simply be converted higher up. can you see if this can be simplied that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is pretty high up the stack already.

The Series constructor calls _sanitize_array which then calls maybe_cast_to_datetime (which I am editing here) fairly early. The line I am editing is the first instance where arrays of numbers (and strings) are converted with to_datetime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try passing utc=True to to_datetime

this logic is too specific here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it might be that

to_datetime(...., tz=dtype.tz) might be the correct idiom here (another PR implements this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah #17413 will make this easier.

However I think there will need to be logic to determine if the incoming data is naive or already localized to UTC. The code comments that were already here note that (numeric) data is already UTC while string data (what I am trying to fix here) may be naive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is solving this here is wrong

this is a bug in to_datetime itself

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing utc=True to to_datetime will make this work (then just converting to the dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try passing utc=True here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utc=True does not work here because the timestamp string should be localized to the specified dtype tz directly intead of UTC in between.

# with utc=True
In [5]: ts_naive = '2013-01-01 00:00:00'
   ...:  result = Series([ts_naive], dtype='datetime64[ns, CET]')
   ...:  expected = Series([pd.Timestamp(ts_naive, tz='CET')])

In [6]: result
Out[6]: 
0   2013-01-01 01:00:00+01:00
dtype: datetime64[ns, CET]

In [7]: expected
Out[7]: 
0   2013-01-01 00:00:00+01:00
dtype: datetime64[ns, CET]

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype labels Sep 20, 2017
@codecov
Copy link

codecov bot commented Sep 21, 2017

Codecov Report

Merging #17603 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17603      +/-   ##
==========================================
- Coverage   91.19%   91.18%   -0.02%     
==========================================
  Files         163      163              
  Lines       49625    49629       +4     
==========================================
- Hits        45257    45252       -5     
- Misses       4368     4377       +9
Flag Coverage Δ
#multiple 88.96% <100%> (ø) ⬆️
#single 40.19% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 87.9% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fedf922...8910a54. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 21, 2017

Codecov Report

Merging #17603 into master will increase coverage by 0.02%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17603      +/-   ##
==========================================
+ Coverage   91.52%   91.54%   +0.02%     
==========================================
  Files         148      147       -1     
  Lines       48778    48777       -1     
==========================================
+ Hits        44642    44655      +13     
+ Misses       4136     4122      -14
Flag Coverage Δ
#multiple 89.92% <87.5%> (+0.02%) ⬆️
#single 41.62% <87.5%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 88.66% <87.5%> (+0.23%) ⬆️
pandas/core/frame.py 97.62% <0%> (ø) ⬆️
pandas/tools/hashing.py
pandas/core/internals.py 94.49% <0%> (+0.06%) ⬆️
pandas/plotting/_converter.py 66.95% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 982e112...cee514f. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

pls rebase and move note to 0.21.1

@@ -488,6 +488,7 @@ Conversion
- Bug in :func:`Series.fillna` returns frame when ``inplace=True`` and ``value`` is dict (:issue:`16156`)
- Bug in :attr:`Timestamp.weekday_name` returning a UTC-based weekday name when localized to a timezone (:issue:`17354`)
- Bug in ``Timestamp.replace`` when replacing ``tzinfo`` around DST changes (:issue:`15683`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.21.1

@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

let me have a look

@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

can you rebase

)
is_dt_string = is_string_dtype(value)
value = to_datetime(value, errors=errors)
if is_dt_string:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try passing utc=True here

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I guess this is ok, until we have tz= in pd.to_datetime this must be done here. some comments.

@@ -65,6 +65,7 @@ Conversion
- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`)
- Bug in :func:`DataFrame.to_dict` where columns of datetime that are tz-aware were not converted to required arrays when used with ``orient='records'``, raising``TypeError` (:issue:`18372`)
- Bug in :class:`DateTimeIndex` and :meth:`date_range` where mismatching tz-aware ``start`` and ``end`` timezones would not raise an err if ``end.tzinfo`` is None (:issue:`18431`)
- Bug in localization of a naive, datetime string in a ``Series`` constructor with a ``datetime[ns, timezone]`` dtype (:issue:`174151`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.22

@@ -574,6 +574,12 @@ def test_constructor_with_datetime_tz(self):
expected = Series(pd.DatetimeIndex(['NaT', 'NaT'], tz='US/Eastern'))
assert_series_equal(s, expected)

# GH 17415: With naive string
ts_naive = '2013-01-01 00:00:00'
result = Series([ts_naive], dtype='datetime64[ns, CET]')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a separate test

parametrize over adding a NaT, nan, None here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [3]: Series([pd.NaT], dtype='datetime64[ns, CET]') # same for np.nan and None
Out[3]: 
0   NaT
dtype: datetime64[ns, CET]

In [4]: Series(pd.Timestamp(pd.NaT, tz='CET')) # same for np.nan and None
Out[4]: 
0   NaT
dtype: datetime64[ns]

So I figure both of these should be equivalent and In[4] should be the correct behavior?

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

if you can change this to 0.22 for the whatsnew and rebase can have a look

@mroeschke
Copy link
Member Author

Rebased and moved the whatsnew to v0.22.

Still curious how to handle the following situation:

In [3]: Series([pd.NaT], dtype='datetime64[ns, CET]') # same for np.nan and None
Out[3]: 
0   NaT
dtype: datetime64[ns, CET]

In [4]: Series(pd.Timestamp(pd.NaT, tz='CET')) # same for np.nan and None
Out[4]: 
0   NaT
dtype: datetime64[ns]

Out[4] should be the correct behavior for both I assume.

@jreback
Copy link
Contributor

jreback commented Dec 11, 2017

@mroeschke a way you can handle is like this

In [1]: from pandas.api.types import DatetimeTZDtype

In [2]: DatetimeTZDtype('datetime64[ns, CET]')
Out[2]: datetime64[ns, CET]

In [3]: DatetimeTZDtype('CET')
ValueError: DatetimeTZDtype constructor must have a tz supplied

so in a try/except if you succeed you have a valid dtype, otherwise you can pass it directly to the constructor.

@mroeschke
Copy link
Member Author

mroeschke commented Dec 12, 2017

@jreback After some digging, the reason why:

In [3]: Series([pd.NaT], dtype='datetime64[ns, CET]') # same for np.nan and None
Out[3]: 
0   NaT
dtype: datetime64[ns, CET]

occurs is that this happens up the stack:

In [3]: pd.to_datetime([pd.NaT]).tz_localize('CET')
Out[3]: DatetimeIndex(['NaT'], dtype='datetime64[ns, CET]', freq=None)

Curious to get your opinion if this should raise instead.

@jreback
Copy link
Contributor

jreback commented Dec 12, 2017

In [1]: pd.Timestamp('NaT')
Out[1]: NaT

In [2]: pd.Timestamp('NaT').tz_localize('CET')
Out[2]: NaT
In [4]: pd.to_datetime(['nat']).tz_localize('CET')
Out[4]: DatetimeIndex(['NaT'], dtype='datetime64[ns, CET]', freq=None)

In [5]: pd.to_datetime(['nat', '20170101']).tz_localize('CET')
Out[5]: DatetimeIndex(['NaT', '2017-01-01 00:00:00+01:00'], dtype='datetime64[ns, CET]', freq=None)

In [6]: pd.to_datetime(['nat']).tz_localize('CET')[0]
Out[6]: NaT

@mroeschke this is pretty consistent. NaT can be part of a tz-aware Index/Series. It doesn't have a tz itself though.

@mroeschke
Copy link
Member Author

Makes sense, thanks for confirming.

@mroeschke
Copy link
Member Author

I realized I was concerned about a non-issue since:

In [4]: pd.to_datetime(['nat']).tz_localize('CET')
Out[4]: DatetimeIndex(['NaT'], dtype='datetime64[ns, CET]', freq=None)

makes sense then

In [3]: Series([pd.NaT], dtype='datetime64[ns, CET]') # same for np.nan and None
Out[3]: 
0   NaT
dtype: datetime64[ns, CET]

makes sense as well.

All green.

…type (pandas-dev#17415)

Add whatsnew

Fix lint error

Rebase and move whatsnew note

Move whatsnew and add tests
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments. otherwise lgtm. ping on green.

@@ -380,7 +380,7 @@ Conversion
- Fixed bug where comparing :class:`DatetimeIndex` failed to raise ``TypeError`` when attempting to compare timezone-aware and timezone-naive datetimelike objects (:issue:`18162`)
- Bug in :class:`DatetimeIndex` where the repr was not showing high-precision time values at the end of a day (e.g., 23:59:59.999999999) (:issue:`19030`)
- Bug where dividing a scalar timedelta-like object with :class:`TimedeltaIndex` performed the reciprocal operation (:issue:`19125`)
-
- Bug in localization of a naive, datetime string in a ``Series`` constructor with a ``datetime[ns, timezone]`` dtype (:issue:`174151`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime64[ns, tz]

.tz_localize('UTC')
.tz_convert(dtype.tz)
)
# This block can be simplified once PR #17413 is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to the comment what that PR is doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an xref (with this PR) to that issue as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to #13712 here (the issue and not the PR which was closed)

@jreback jreback added this to the 0.23.0 milestone Jan 11, 2018
@jreback jreback merged commit 499802c into pandas-dev:master Jan 12, 2018
@jreback
Copy link
Contributor

jreback commented Jan 12, 2018

thanks!

@mroeschke mroeschke deleted the fix_17415 branch January 12, 2018 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Incorrect localization of naive time string with Series and datetime[ns, tz] dtype
3 participants