Skip to content

Difference between two Timestamp with different timezone object type fails #31793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
antoine-gallix opened this issue Feb 7, 2020 · 11 comments · Fixed by #37329
Closed

Difference between two Timestamp with different timezone object type fails #31793

antoine-gallix opened this issue Feb 7, 2020 · 11 comments · Fixed by #37329
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Numeric Operations Arithmetic, Comparison, and Logical operations Timezones Timezone data dtype
Milestone

Comments

@antoine-gallix
Copy link

antoine-gallix commented Feb 7, 2020

I encountered an issue while calculating the difference between two timestamps
of different origin. The first timestamp comes from a function using the arrow
package. The timestamp is in UTC timezone. The other timestamp comes for another
package returning a pandas.DataFrame, also with UTC timestamps. An exception is
raised when trying to get the difference between the two timestamps. Here follow
some come that reproduces the error, with some details about the respective
time zones.

import arrow
import pandas


def info(name, thing):
    print(f"{name} | tzinfo: {thing.tzinfo} | tzinfo type: {type(thing.tzinfo)}")


arrow_now = arrow.utcnow()
info("arrow.utcnow()", arrow_now)
arrow_now_datetime = arrow_now.datetime
info("arrow.utcnow().datetime", arrow_now_datetime)
timestamp_from_datetime_from_arrow_now = pandas.Timestamp(arrow_now_datetime)
info(
    "pandas.Timestamp(arrow.utcnow().datetime)", timestamp_from_datetime_from_arrow_now
)
pandas_now_timestamp = pandas.Timestamp.utcnow()
info("pandas.Timestamp.utcnow()", pandas_now_timestamp)

Running the above code produces:

arrow.utcnow() | tzinfo: tzutc() | tzinfo type: <class 'dateutil.tz.tz.tzutc'>
arrow.utcnow().datetime | tzinfo: tzutc() | tzinfo type: <class 'dateutil.tz.tz.tzutc'>
pandas.Timestamp(arrow.utcnow().datetime) | tzinfo: tzutc() | tzinfo type: <class 'dateutil.tz.tz.tzutc'>
pandas.Timestamp.utcnow() | tzinfo: UTC | tzinfo type: <class 'pytz.UTC'>

Both timestamps are in the same timezone, but the timezone object itself comes from different packages. Now when trying to get the difference between the two timestamps:

timestamp_from_datetime_from_arrow_now - pandas_now_timestamp

I get the following error:

TypeError: Timestamp subtraction must have the same timezones or no timezones

I would expect that if the Timestamp objects are initialized without error,
which is the case, that they are valid object that can work with each other,
independently from what were the timestamps build with. It's not the case.

System Info

  • system: Ubuntu 19.04 disco
  • python version: 3.7.4
  • arrow version: 0.15.5
  • pandas version: 1.0.0

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.0.0-38-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 6.0.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@jreback
Copy link
Contributor

jreback commented Feb 7, 2020

there are a couple of issue about arrow objects not playing nice

Timestamp is a subclass if datetime.datetime so likely this is an issue in the arrow side

you can certainly investigate but panda does not offer any support here

@jbrockmendel
Copy link
Member

I can't speak to the arrow side of things, but on the Timestamp side this is an easy fix if we want to support it (which I would be +1 on)

The check in _libs.tslibs.c_timestamp on L295 if not tz_compare(self.tzinfo, other.tzinfo): would instead check for if (self.tzinfo is None and other.tzinfo is not None) or (self.tzinfo is not None and other.tzinfo is None):

@antoine-gallix
Copy link
Author

I notified people at Arrow's of the bug. Maybe they agree with @jreback and do something on their side. Here is the link: arrow-py/arrow#755

@systemcatch
Copy link

Hey everyone, arrow uses dateutil everywhere for it's tzinfos. We don't have any plans to move to pytz.

@jbrockmendel's solution looks good, I can submit a PR if this is the agreed path.

@antoine-gallix
Copy link
Author

antoine-gallix commented Feb 28, 2020

Have anything been decided here?
Meanwhile I'd like to share the workaround I'm using at the moment.

# using variable from my first post
series_using_arrow = pandas.Series(timestamp_from_datetime_from_arrow_now)
series_using_arrow.dtype
> datetime64[ns, tzutc()] # not the tz type we want

series_using_pandas = pandas.Series(pandas_now_timestamp)
series_using_pandas.dtype
> datetime64[ns, UTC] # this is the good one

# again the same error
series_using_arrow - series_using_pandas
> TypeError: DatetimeArray subtraction must have the same timezones or no timezones


# conversion using this pandas type changes the type of tz property
series_using_arrow_converted = series_using_arrow.astype(pandas.DatetimeTZDtype(tz='UTC'))
series_using_arrow_converted.dtype
> datetime64[ns, UTC] # that's what we want

# now the operation works
pandas.Series(pandas_now_timestamp) - pandas.Series(timestamp_from_datetime_from_arrow_now).astype(pandas.DatetimeTZDtype(tz='UTC'))
> 0   00:00:00.000096
 dtype: timedelta64[ns]

@jbrockmendel
Copy link
Member

Have anything been decided here?

PR would be welcome

@antoine-gallix
Copy link
Author

@jbrockmendel you mean me? I fear I'd be very inefficient at touching anything in the depth of pandas. For what I understand, the error comes from a check that verifies that the two tzinfo attributes are equal, which fail in our case because of two different types of utc timezone object. What's the plan? You seem to suggest the check would fail only if one of the two tzinfo is missing. What was tz_compare doing? I think it'll be faster if someone who understands the code and the planned change of strategy does it than if I start from scratch here.

@jbrockmendel
Copy link
Member

you mean me?

@antoine-gallix yes I do.

I fear I'd be very inefficient at touching anything in the depth of pandas [...] I think it'll be faster if someone who understands the code and the planned change of strategy does it than if I start from scratch here.

Everybody is a beginner at first. We're happy to help new contributors get up to speed.

You seem to suggest the check would fail only if one of the two tzinfo is missing. What was tz_compare doing?

Timestamp subtraction is defined in pandas._libs.tslibs.c_timestamp, L267. There is a chceck if not tz_compare(self.tzinfo, other.tzinfo) that needs to be changed to something like if (self.tzinfo is None and other.tzinfo is not None) or (self.tzinfo is not None and other.tzinfo is None) (maybe make a helper func since this is pretty verbose).

@vmarkovtsev
Copy link

There is also datetime.timezone.utc that is incompatible with pytz.UTC. It is particularly funny to debug because both show up as tz='UTC' and you have to look at the actual type.

@jbrockmendel
Copy link
Member

There is also datetime.timezone.utc that is incompatible with pytz.UTC. It is particularly funny to debug because both show up as tz='UTC' and you have to look at the actual type.

That looks like a bug in tz_compare

@jbrockmendel jbrockmendel added Numeric Operations Arithmetic, Comparison, and Logical operations Timezones Timezone data dtype labels Jun 5, 2020
AnjoMan added a commit to AnjoMan/pandas that referenced this issue Oct 22, 2020
@jreback
Copy link
Contributor

jreback commented Oct 23, 2020

There is also datetime.timezone.utc that is incompatible with pytz.UTC. It is particularly funny to debug because both show up as tz='UTC' and you have to look at the actual type.

That looks like a bug in tz_compare

I am ok with changing the UTC handling so its consistent.

Why would we allow different timezone subtraction operations? What does this buy except for confusion? we do not allow any operations between timezones now (except for some unfortunate mixing when indexing).

AnjoMan added a commit to AnjoMan/pandas that referenced this issue Oct 23, 2020
AnjoMan added a commit to AnjoMan/pandas that referenced this issue Oct 23, 2020
AnjoMan added a commit to AnjoMan/pandas that referenced this issue Oct 25, 2020
AnjoMan added a commit to AnjoMan/pandas that referenced this issue Feb 11, 2021
@mroeschke mroeschke added Bug Compat pandas objects compatability with Numpy or Python functions labels Jul 28, 2021
@jreback jreback added this to the 1.4 milestone Dec 28, 2021
mroeschke pushed a commit that referenced this issue Dec 28, 2021
* BUG: cannot subtract Timestamp with different timezones (#31793)

* TST: add index type coverage to timedelta subtraction tests (#31793)

* TST: 31739 - add coverage of subtracting datetimes w/differing timezones
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Numeric Operations Arithmetic, Comparison, and Logical operations Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants