Skip to content

Series.searchsorted with different timezones #30086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kenahoo opened this issue Dec 5, 2019 · 2 comments · Fixed by #33185
Closed

Series.searchsorted with different timezones #30086

kenahoo opened this issue Dec 5, 2019 · 2 comments · Fixed by #33185
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@kenahoo
Copy link

kenahoo commented Dec 5, 2019

Code Sample, a copy-pastable example if possible

DatetimeIndex.searchsorted works fine with mixed timezones:

>>> series_dt = pd.date_range('2015-02-19', periods=2, freq='1d', tz='UTC')
>>> dt = pd.to_datetime('2015-02-19T12:34:56').tz_localize('America/New_York')
>>> series_dt.searchsorted(dt)
1

However, Series.searchsorted does not:

>>> from io import StringIO
>>> df = pd.read_csv(StringIO("datetime\n2015-02-19T00:00:00Z\n2015-02-20T00:00:00Z"), parse_dates=['datetime'])
>>> df.datetime.searchsorted(dt)
Traceback (most recent call last):
  File "/Users/kwilliams/Library/Application Support/IntelliJIdea2019.2/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "venv/lib/python3.7/site-packages/pandas/core/series.py", line 2694, in searchsorted
    return algorithms.searchsorted(self._values, value, side=side, sorter=sorter)
  File "venv/lib/python3.7/site-packages/pandas/core/algorithms.py", line 1887, in searchsorted
    result = arr.searchsorted(value, side=side, sorter=sorter)
  File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py", line 666, in searchsorted
    self._check_compatible_with(value)
  File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 591, in _check_compatible_with
    own=self.tz, other=other.tz
ValueError: Timezones don't match. 'UTC != America/New_York'

Ostensibly, the underlying data vector is the same in both cases:

>>> series_dt.dtype
datetime64[ns, UTC]
>>> df.datetime.dtype
datetime64[ns, UTC]

Problem description

I believe the DatetimeIndex is correct (or at least more useful), because even if timezones don't agree, the underlying instants are well-ordered and compare fine. In fact, both versions compare fine using a simple > comparison:

>>> dt > series_dt
array([ True, False])
>>> dt > df.datetime
0     True
1    False
Name: datetime, dtype: bool

Expected Output

>>> df.datetime.searchsorted(dt)
1

A workaround is to wrap the column using pd.DatetimeIndex():

>>> pd.DatetimeIndex(df.datetime).searchsorted(dt)
1

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.0.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : en_US.UTF-8
pandas           : 0.25.1
numpy            : 1.17.2
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 19.3.1
setuptools       : 41.4.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
##teamcity[testStdOut timestamp='2019-12-05T12:38:05.901' flowId='test.test_driver.MyTestCase.test_driver' locationHint='python://test.test_driver.MyTestCase.test_driver' name='test_driver' nodeId='75' out='feather          : None|n' parentNodeId='74']
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : 0.3.5
scipy            : 1.3.1
sqlalchemy       : None
tables           : None
xarray           : None
##teamcity[testStdOut timestamp='2019-12-05T12:38:05.904' flowId='test.test_driver.MyTestCase.test_driver' locationHint='python://test.test_driver.MyTestCase.test_driver' name='test_driver' nodeId='75' out='xlrd             : None|n' parentNodeId='74']
xlwt             : None
xlsxwriter       : None
@jbrockmendel jbrockmendel added the Timezones Timezone data dtype label Dec 18, 2019
@jbrockmendel
Copy link
Member

This now works on master for me, can you confirm

@mroeschke
Copy link
Member

Confirmed. Could use a test

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Timezones Timezone data dtype labels Mar 30, 2020
@jreback jreback added this to the 1.1 milestone Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants