Skip to content

Allow choosing the utc timezone class in pd.to_datetime #32619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vmarkovtsev opened this issue Mar 11, 2020 · 8 comments · Fixed by #45095
Closed

Allow choosing the utc timezone class in pd.to_datetime #32619

vmarkovtsev opened this issue Mar 11, 2020 · 8 comments · Fixed by #45095
Labels
Datetime Datetime data dtype good first issue Needs Tests Unit test(s) needed to prevent regressions Timezones Timezone data dtype
Milestone

Comments

@vmarkovtsev
Copy link

Code Sample, a copy-pastable example if possible

from datetime import datetime, timezone
import pandas as pd

dt1 = pd.to_datetime(datetime(2020, 3, 11), utc=True)
print(repr(dt1))
print(type(dt1.tz))

dt2 = pd.to_datetime(datetime(2020, 3, 11, tzinfo=timezone.utc))
print(repr(dt2))
print(type(dt2.tz))

# dragons here
print(dt1 - dt2)

outputs

Timestamp('2020-03-11 00:00:00+0000', tz='UTC')
<class 'pytz.UTC'>
Timestamp('2020-03-11 00:00:00+0000', tz='UTC')
<class 'datetime.timezone'>

TypeError: Timestamp subtraction must have the same timezones or no timezones

Problem description

There is no ability to specify which "UTC" the Timestamp should be. I suggest extending the interface of pd.to_datetime() to specify utc_cls=pytz.UTC.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.3.0-40-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.17.4
pytz             : 2019.2
dateutil         : 2.7.3
pip              : 19.3.1
setuptools       : 42.0.1
Cython           : 0.29.14
pytest           : 5.3.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.5.0
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.10.3
IPython          : 7.10.0
pandas_datareader: None
bs4              : 4.8.1
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
pytest           : 5.3.1
pyxlsb           : None
s3fs             : None
scipy            : 1.2.1
sqlalchemy       : 1.3.12
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None
@TomAugspurger
Copy link
Contributor

I don't think we need a separate keyword for this. I think we could accept utc=timezone.utc in addition to a boolean.

@TomAugspurger TomAugspurger added API Design Datetime Datetime data dtype Timezones Timezone data dtype labels Mar 11, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Mar 11, 2020
@se7entyse7en
Copy link

se7entyse7en commented Mar 11, 2020

I think that the weird thing is that when calling tz_compare, that is called during Timestamp.__sub__, this returns False even if they're both utc. Infact both tzinfo objects described in the issue pass this check, but then tz_compare returns False given that their representation is different.

In [1]: from datetime import datetime, timezone

In [2]: import pandas as pd

In [3]: from pandas._libs.tslibs.timezones import tz_compare, is_utc

In [4]: dt1 = pd.to_datetime(datetime(2020, 3, 11), utc=True)

In [5]: dt2 = pd.to_datetime(datetime(2020, 3, 11, tzinfo=timezone.utc))

In [6]: is_utc(dt1.tzinfo), is_utc(dt2.tzinfo)
Out[6]: (True, True)

In [7]: tz_compare(dt1.tzinfo, dt2.tzinfo)
Out[7]: False

I think that it would be more correct to fix the behavior of tz_compare.

@TomAugspurger
Copy link
Contributor

cc @mroeschke. I vaguely recall good reasons for them not comparing the same, but I don't recall them right now.

@mroeschke
Copy link
Member

I believe that tz_compare has this behavior because we are trying to guard against inadvertently coercing the user's timezone object to a pytz timezone object (which is our "default" internally) after certain operations. An example is #23807

Nonetheless, I think OP's example should work and just return a new Timestamp with a pytz.UTC timezone object (and better document our default timezone object behavior).

I can see modifying tz_compare to add a strict kwarg in which we can internally control cases like this.

@mroeschke
Copy link
Member

I'd be -0.5 to allow UTC to accept non booleans as IMO it's idiomatic to use tz_convert after to_datetime

In [3]: pd.to_datetime(datetime(2020, 3, 11), utc=True).tz
Out[3]: <UTC>

In [5]: pd.to_datetime(datetime(2020, 3, 11), utc=True).tz_convert(timezone.utc).tz
Out[5]: datetime.timezone.utc

@jbrockmendel
Copy link
Member

The bug here is in tz_compare not considering the two different UTCs as equivalent. I agree with @mroeschke that we shouldnt change the to_datetime API.

I would be +1 on changing our defaults from pytz to the stdlib tzinfo (and zoneinfo going forward)

@raffienficiaud
Copy link
Contributor

There is something that is not right also wrt. astype("datetime64[ns, UTC]"), consider this:

ipdb> p matched_signals.prediction_time
0   2021-06-17 19:22:37.999687+00:00
Name: prediction_time, dtype: datetime64[ns, UTC]

ipdb> p matched_signals.event_time
0   2021-06-18 19:22:37.999687+00:00
Name: event_time, dtype: datetime64[ns, UTC]

ipdb>  matched_signals.prediction_time - matched_signals.event_time
*** TypeError: DatetimeArray subtraction must have the same timezones or no timezones

ipdb> matched_signals.event_time.dt.tz
<UTC>
ipdb> matched_signals.prediction_time.dt.tz
datetime.timezone.utc

The code on column event_time does an explicit

matched_signals.loc[:, "event_time"] = matched_signals["event_time"].astype(
        "datetime64[ns, UTC]"
    )

while prediction_time is constructed as is from datetime.datetime with a time zone set to datetime.timezone.utc. However it is printed as the type information as for the explicit case to datetime64[ns, UTC].
The workaround is to explicitly transform that column as well

matched_signals.loc[:, "event_time"] = matched_signals["event_time"].astype("datetime64[ns, UTC]")

@mroeschke
Copy link
Member

Looks like the original bug has been fix by #39216, but could use a test to exercise the arithmetic case of different utc timezones

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design Bug labels Jul 30, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype good first issue Needs Tests Unit test(s) needed to prevent regressions Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants