Issue parsing '%z' in timestamps via pd.to_datetime #23599

bmironov · 2018-11-09T13:24:58Z

Code Sample, a copy-pastable example if possible

First, thanks a lot for this great library. It helps a lot in our day-to-day activities
I came across odd behaviour in processing timestamps from set of my CSV files.

s="2018-09-10 09:30:00.000894-04:00"
t1=datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S.%f%z")
print ("T1", t1)
t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
print ("T2", t2)
t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f")
print ("T3", t3)

Output:

T1 2018-09-10 09:30:00.000894-04:00
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 13:30:00.000894

Problem description

As we can see pd.to_datetime CAN properly parse timezone directive '%z'. If I omit parameter "errors" then I get exception:
ValueError: 'z' is a bad directive in format '%Y-%m-%d %H:%M:%S.%f%z'

I have 6.6M rows CSV file with column that have information in above format (upto microsecond precision with time zone information like -04:00). This column is my index for the dataframe. First, I load it via from_csv(filename) call [no extra parameters except filename]. When I use pd.to_datetime I get following performance results:

df2['timestamp'] = pd.to_datetime(df2['datetime'], format="%Y-%m-%d %H:%M:%S.%f")
takes 5 min to process
df2['datetime'] = df2['datetime'].str.replace('-04:00', '')
df2['timestamp'] = pd.to_datetime(df2['datetime'], format="%Y-%m-%d %H:%M:%S.%f")
takes 19 sec to complete
df2['timestamp'] = pd.to_datetime(df2['datetime'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
takes 3 sec to complete

All of the above calls produce expected result in the dataframe

I see that recently there was related issue of processing '%z': #13486

Expected Output

No exceptions and no need to ignore errors

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Thanks again and best regards,
Boris

The text was updated successfully, but these errors were encountered:

bmironov · 2018-11-09T16:03:36Z

Another strange performance metric:

%%time
tmp = pd.read_csv("data/my_data.csv",
                 engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.set_index('timestamp', inplace=True)

CPU times: user 395 ms, sys: 74.9 ms, total: 469 ms
Wall time: 495 ms

And what "should" work at least as fast:

%%time
tmp = pd.read_csv("data/my_data.csv",
                 parse_dates=['timestamp'],
                 date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S.%f%z', errors='ignore'),
                 engine='c').set_index('timestamp')

CPU times: user 5.47 s, sys: 206 ms, total: 5.67 s
Wall time: 6.47 s

CSV file has 6.6M rows of data

mroeschke · 2018-11-09T19:10:53Z

%z was introduced recently and will be apart of the v0.24.0 release.

In [6]: pd.__version__
Out[6]: '0.24.0.dev0+961.gefd1844da'

In [7]: t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
   ...: print ("T2", t2)
   ...: t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z")
   ...: print ("T3", t3)
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 09:30:00.000894-04:00

The performance hit you're experiencing is probably because we fall back on using dateutil if initial parsing attempts failed which tends to be slow.

If you're still experiencing a performance hit when parsing with %z on master, feel free to reopen.

bmironov · 2018-11-09T20:54:54Z

@mroeschke Thank you for quick reply.

Assuming that pd.to_datetime falls back to deteutil... Then why running pd.to_datetime outside of pd.read_csv is way faster than as "date_parser" parameter? If it is the same code then it should behave similarly, isn't it? It seems to me that performance hit is not just due to fallback to another piece of code. Could it be that running it inside read_csv is ignoring its "errors" parameter?

Thanks again,
Boris

mroeschke · 2018-11-09T21:35:38Z

I am not intimately familiar with read_csv but given the documentation for the date_parser function , it tries at least 3 different ways to use the date_parser function to parse the dates.

Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

bmironov · 2018-11-09T21:43:49Z

So maybe this issue (#23599) could be a good test case and we should keep it open until 0.24 development is done?

mroeschke · 2018-11-09T22:01:43Z

We have tests for the %z directive in our test suite.

In terms of performance, could you rerun your read_csv benchmark on master and see if the performance is still slow? I don't think we have performance benchmarks with parse_dates and read_csv IIRC. Might be worth adding if the performance is still slow.

bmironov · 2018-11-09T22:29:04Z

Sorry I'm not familiar with this process. How would I install dev version of pandas, run test and revert it back to official 0.23.4??

mroeschke · 2018-11-09T22:38:05Z

The process would be to create a separate virtual environment and following these instructions: https://pandas.pydata.org/pandas-docs/stable/contributing.html#working-with-the-code

bmironov · 2018-11-09T23:04:42Z

Here is tiny performance test:

import pytz
tz=pytz.timezone("US/Eastern")
t = tz.localize(datetime.datetime.now())
t

Output

datetime.datetime(2018, 11, 9, 17, 24, 19, 41172, tzinfo=<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>)

Create test CSV file (it is less than 26Mb on my filesystem):

N = 6000000
data = pd.DataFrame({'timestamp': pd.date_range(t, periods=N, freq='1S'), 'value': np.arange(0, N)})
data.set_index('timestamp', inplace=True)
data.to_csv('test.csv.gz', compression='gzip')

Test #1:

%%time
tmp = pd.read_csv("test.csv.gz", compression='gzip',
                 engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.set_index('timestamp', inplace=True)

Output

CPU times: user 11.3 s, sys: 1.57 s, total: 12.9 s
Wall time: 13.7 s

Test#2 (runs after previous test, hence we can expect that file is in OS cache now)

%%time
tmp = pd.read_csv("test.csv.gz", compression='gzip',
                  parse_dates=['timestamp'],
                  date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S.%f%z', errors='ignore'),
                  engine='c').set_index('timestamp')

Output

CPU times: user 4min 13s, sys: 7.49 s, total: 4min 20s
Wall time: 4min 49s

print(pd.__version__)

Output

0.23.4

bmironov · 2018-11-09T23:20:23Z

tmp = pd.read_csv("test.csv.gz", compression='gzip', engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000000 entries, 0 to 5999999
Data columns (total 2 columns):
timestamp    object
value        int64
dtypes: int64(1), object(1)
memory usage: 91.6+ MB

Interesting observation: timestamp column is NOT datetime64. But

tmp = pd.read_csv("test.csv.gz", compression='gzip', engine='c')
tmp['timestamp'] = tmp['timestamp'].str.replace('-04:00', '')
tmp['timestamp'] = tmp['timestamp'].str.replace('-05:00', '')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f").dt.tz_localize('US/Eastern')
tmp.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000000 entries, 0 to 5999999
Data columns (total 2 columns):
timestamp    datetime64[ns, US/Eastern]
value        int64
dtypes: datetime64[ns, US/Eastern](1), int64(1)
memory usage: 91.6 MB

Conclusion: with certain set of parameters it is possible to load CSV file fast enough, but data is not in expected datetime64.

bmironov · 2018-11-09T23:27:21Z

@mroeschke Sorry, just noticed your reply about setting up virtual environment.
Is there any chance you could run above tests under your setup and spare me from going through installation of virtual environment?

Thanks and best regards,
Boris

mroeschke closed this as completed Nov 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue parsing '%z' in timestamps via pd.to_datetime #23599

Issue parsing '%z' in timestamps via pd.to_datetime #23599

bmironov commented Nov 9, 2018 •

edited

Loading

INSTALLED VERSIONS

bmironov commented Nov 9, 2018 •

edited

Loading

mroeschke commented Nov 9, 2018 •

edited

Loading

bmironov commented Nov 9, 2018 •

edited

Loading

mroeschke commented Nov 9, 2018

bmironov commented Nov 9, 2018

mroeschke commented Nov 9, 2018

bmironov commented Nov 9, 2018

mroeschke commented Nov 9, 2018

bmironov commented Nov 9, 2018

bmironov commented Nov 9, 2018 •

edited

Loading

bmironov commented Nov 9, 2018

Issue parsing '%z' in timestamps via pd.to_datetime #23599

Issue parsing '%z' in timestamps via pd.to_datetime #23599

Comments

bmironov commented Nov 9, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

bmironov commented Nov 9, 2018 • edited Loading

mroeschke commented Nov 9, 2018 • edited Loading

bmironov commented Nov 9, 2018 • edited Loading

mroeschke commented Nov 9, 2018

bmironov commented Nov 9, 2018

mroeschke commented Nov 9, 2018

bmironov commented Nov 9, 2018

mroeschke commented Nov 9, 2018

bmironov commented Nov 9, 2018

bmironov commented Nov 9, 2018 • edited Loading

bmironov commented Nov 9, 2018

bmironov commented Nov 9, 2018 •

edited

Loading

Output of `pd.show_versions()`

bmironov commented Nov 9, 2018 •

edited

Loading

mroeschke commented Nov 9, 2018 •

edited

Loading

bmironov commented Nov 9, 2018 •

edited

Loading

bmironov commented Nov 9, 2018 •

edited

Loading