Skip to content

Inconsistent parsing for timestamp with timezone with read_sql_query #15119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thrasibule opened this issue Jan 12, 2017 · 6 comments · Fixed by #24500
Closed

Inconsistent parsing for timestamp with timezone with read_sql_query #15119

thrasibule opened this issue Jan 12, 2017 · 6 comments · Fixed by #24500
Labels
IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype
Milestone

Comments

@thrasibule
Copy link
Contributor

Code Sample, a copy-pastable example if possible

I'm creating a test table in postgresql as follows:

CREATE TABLE test(date timestamptz);

INSERT INTO test VALUES('2014-10-16 07:50:49-04'),('2014-11-06 07:42:06-05')

Notice that the dates are across the dst date, so that they have different offsets to utc.

df1 = pd.read_sql_query("SELECT date FROM test", engine,
                        index_col='date')

df2 = pd.read_sql_query("SELECT date FROM test WHERE date<='2014-10-17'",
                        engine,
                        index_col='date')

df2bis = pd.read_sql_query("SELECT date FROM test WHERE date<='2014-10-17'",
                           engine,
                           index_col='date',
                           parse_dates=['date'])
df3 = pd.read_sql_query("SELECT date FROM test", engine,
                        index_col='date',
                        parse_dates=['date'])

Problem description

In [159]: df1.index
Out[159]: 
Index([2014-10-16 07:50:49-04:00, 2014-11-06 07:42:06-05:00], dtype='object', name='date')

In [160]: df2.index
Out[160]: 
DatetimeIndex(['2014-10-16 11:50:49+00:00'], dtype='datetime64[ns, UTC]', name='date', freq=None)

In [162]: df2bis.index
Out[162]: 
DatetimeIndex(['2014-10-16 11:50:49+00:00'], dtype='datetime64[ns, UTC]', name='date', freq=None)

In [163]: df3.index
Out[163]: 
DatetimeIndex(['2014-10-16 11:50:49', '2014-11-06 12:42:06'], dtype='datetime64[ns]', name='date', freq=None)

Expected Output

The current behavior if I understand correctly is that if there is only one timezone offset: timestamps are parsed automatically to timezone aware timestamps.
If dates are across different offsets, they are not parsed automatically. We can force parsing using parse_dates, but then they are not timezone aware. I can make the conversion using tz_localize, but it's not very consistent. I would expect to have everything parsed automatically to timezone aware timestamps.

This is related to issue #7364. If there is a consensus on what should be done, I can try to make the required changes to make it more regular.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.8.13-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.utf8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 33.0.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0rc2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 1.1.4
pymysql: 0.7.9.None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

can you create a test using the existing test infrastructure (and either push a commit or paste here)

@jreback jreback added IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype labels Jan 12, 2017
@thrasibule
Copy link
Contributor Author

Working on it. I'll try to come up with something today.

@thrasibule
Copy link
Contributor Author

Turns out there is already a test for it, but it's too lenient on purpose: someone obviously thought about this before. Should I try to fix this? I think it should always return datetime64tz, but maybe it's the best that can be done at the moment.

diff --git a/pandas/io/tests/test_sql.py b/pandas/io/tests/test_sql.py
index cb08944e8..d8eff4ccf 100644
--- a/pandas/io/tests/test_sql.py
+++ b/pandas/io/tests/test_sql.py
@@ -1310,9 +1310,7 @@ class _TestSQLAlchemy(SQLAlchemyMixIn, PandasSQLTest):
         # even with the same versions of psycopg2 & sqlalchemy, possibly a
         # Postgrsql server version difference
         col = df.DateColWithTz
-        self.assertTrue(is_object_dtype(col.dtype) or
-                        is_datetime64_dtype(col.dtype) or
-                        is_datetime64tz_dtype(col.dtype),
+        self.assertTrue(is_datetime64tz_dtype(col.dtype),
                         "DateCol loaded with incorrect type -> {0}"
                         .format(col.dtype))

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

@thrasibule well sometimes we seemed to get object dtypes back and sometimes it was correctly returned as datetime64, so this is lenient on purpose. never could narrow down why sometimes it didn't work..

@jorisvandenbossche
Copy link
Member

@thrasibule See also the discussion in #11216 (comment).

This is a complex topic, so I should dive into it again, but I think we want to coerce all tz aware datetimes to UTC (so to keep the tz-aware part, but to have a consistent output). Based on the discussions in #7364 and #11216, I would also think it is this way, but apparently not (at least not in your case).

Certainly welcome to look into it!

@eszense
Copy link

eszense commented Aug 13, 2017

I encountered similar problem but on the to_sql() side of parsing

import pandas as pd
import sqlite3
import pytz
from datetime import datetime

tz=pytz.timezone('US/Eastern')
# tz=None

conn = sqlite3.connect(':memory:')

# Timezone aware DatetimeIndex raise error
df = pd.DataFrame(
    index=pd.DatetimeIndex(tz=tz, freq='D', start='2017-1-1', periods=0)
    )
df.to_sql('test',conn)

# Timezone aware data in column raise error
df = pd.DataFrame([datetime.now(tz)])
df.to_sql('test',conn)

conn.close()

Both section of the above code will raise exception if timezone-aware index/data is used.

Traceback (most recent call last):
  File "/home/ubuntu/project/playground.py", line 16, in <module>
    df.to_sql('test',conn)
  File "/home/ubuntu/project/venv/lib/python3.5/site-packages/pandas/core/generic.py", line 1362, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/home/ubuntu/project/venv/lib/python3.5/site-packages/pandas/io/sql.py", line 471, in to_sql
    chunksize=chunksize, dtype=dtype)
  File "/home/ubuntu/project/venv/lib/python3.5/site-packages/pandas/io/sql.py", line 1506, in to_sql
    table.insert(chunksize)
  File "/home/ubuntu/project/venv/lib/python3.5/site-packages/pandas/io/sql.py", line 644, in insert
    keys, data_list = self.insert_data()
  File "/home/ubuntu/project/venv/lib/python3.5/site-packages/pandas/io/sql.py", line 625, in insert_data
    d = b.values.astype('M8[us]').astype(object)
  File "/home/ubuntu/project/venv/lib/python3.5/site-packages/pandas/core/indexes/datetimes.py", line 864, in astype
    raise ValueError('Cannot cast DatetimeIndex to dtype %s' % dtype)
ValueError: Cannot cast DatetimeIndex to dtype datetime64[us]

Based on #7364 I also agree the correct behavior is to coerce to UTC.
This is also supported by Sqlite documentation

The date and time functions use UTC or "zulu" time internally, and so the "Z" suffix is a no-op. Any non-zero "HH:MM" suffix is subtracted from the indicated date and time in order to compute zulu time. For example, all of the following time strings are equivalent:

2013-10-07 08:23:19.120
2013-10-07T08:23:19.120Z
2013-10-07 04:23:19.120-04:00
2456572.84952685

@mroeschke mroeschke changed the title Inconsistent parsing for timestamp with timezone Inconsistent parsing for timestamp with timezone with read_sql_query Jul 26, 2018
mroeschke pushed a commit to mroeschke/pandas that referenced this issue Dec 30, 2018
@jreback jreback added this to the 0.24.0 milestone Dec 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants