Skip to content

Issue parsing '%z' in timestamps via pd.to_datetime #23599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bmironov opened this issue Nov 9, 2018 · 11 comments
Closed

Issue parsing '%z' in timestamps via pd.to_datetime #23599

bmironov opened this issue Nov 9, 2018 · 11 comments

Comments

@bmironov
Copy link

bmironov commented Nov 9, 2018

Code Sample, a copy-pastable example if possible

First, thanks a lot for this great library. It helps a lot in our day-to-day activities
I came across odd behaviour in processing timestamps from set of my CSV files.

s="2018-09-10 09:30:00.000894-04:00"
t1=datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S.%f%z")
print ("T1", t1)
t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
print ("T2", t2)
t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f")
print ("T3", t3)

Output:

T1 2018-09-10 09:30:00.000894-04:00
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 13:30:00.000894

Problem description

As we can see pd.to_datetime CAN properly parse timezone directive '%z'. If I omit parameter "errors" then I get exception:
ValueError: 'z' is a bad directive in format '%Y-%m-%d %H:%M:%S.%f%z'

I have 6.6M rows CSV file with column that have information in above format (upto microsecond precision with time zone information like -04:00). This column is my index for the dataframe. First, I load it via from_csv(filename) call [no extra parameters except filename]. When I use pd.to_datetime I get following performance results:

  1. df2['timestamp'] = pd.to_datetime(df2['datetime'], format="%Y-%m-%d %H:%M:%S.%f")
    takes 5 min to process
  2. df2['datetime'] = df2['datetime'].str.replace('-04:00', '')
    df2['timestamp'] = pd.to_datetime(df2['datetime'], format="%Y-%m-%d %H:%M:%S.%f")
    takes 19 sec to complete
  3. df2['timestamp'] = pd.to_datetime(df2['datetime'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
    takes 3 sec to complete

All of the above calls produce expected result in the dataframe

I see that recently there was related issue of processing '%z': #13486

Expected Output

No exceptions and no need to ignore errors

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Thanks again and best regards,
Boris

@bmironov
Copy link
Author

bmironov commented Nov 9, 2018

Another strange performance metric:

%%time
tmp = pd.read_csv("data/my_data.csv",
                 engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.set_index('timestamp', inplace=True)
CPU times: user 395 ms, sys: 74.9 ms, total: 469 ms
Wall time: 495 ms

And what "should" work at least as fast:

%%time
tmp = pd.read_csv("data/my_data.csv",
                 parse_dates=['timestamp'],
                 date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S.%f%z', errors='ignore'),
                 engine='c').set_index('timestamp')
CPU times: user 5.47 s, sys: 206 ms, total: 5.67 s
Wall time: 6.47 s

CSV file has 6.6M rows of data

@mroeschke
Copy link
Member

mroeschke commented Nov 9, 2018

%z was introduced recently and will be apart of the v0.24.0 release.

In [6]: pd.__version__
Out[6]: '0.24.0.dev0+961.gefd1844da'

In [7]: t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
   ...: print ("T2", t2)
   ...: t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z")
   ...: print ("T3", t3)
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 09:30:00.000894-04:00

The performance hit you're experiencing is probably because we fall back on using dateutil if initial parsing attempts failed which tends to be slow.

If you're still experiencing a performance hit when parsing with %z on master, feel free to reopen.

@bmironov
Copy link
Author

bmironov commented Nov 9, 2018

@mroeschke Thank you for quick reply.

Assuming that pd.to_datetime falls back to deteutil... Then why running pd.to_datetime outside of pd.read_csv is way faster than as "date_parser" parameter? If it is the same code then it should behave similarly, isn't it? It seems to me that performance hit is not just due to fallback to another piece of code. Could it be that running it inside read_csv is ignoring its "errors" parameter?

Thanks again,
Boris

@mroeschke
Copy link
Member

I am not intimately familiar with read_csv but given the documentation for the date_parser function , it tries at least 3 different ways to use the date_parser function to parse the dates.

Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

@bmironov
Copy link
Author

bmironov commented Nov 9, 2018

So maybe this issue (#23599) could be a good test case and we should keep it open until 0.24 development is done?

@mroeschke
Copy link
Member

We have tests for the %z directive in our test suite.

In terms of performance, could you rerun your read_csv benchmark on master and see if the performance is still slow? I don't think we have performance benchmarks with parse_dates and read_csv IIRC. Might be worth adding if the performance is still slow.

@bmironov
Copy link
Author

bmironov commented Nov 9, 2018

Sorry I'm not familiar with this process. How would I install dev version of pandas, run test and revert it back to official 0.23.4??

@mroeschke
Copy link
Member

The process would be to create a separate virtual environment and following these instructions: https://pandas.pydata.org/pandas-docs/stable/contributing.html#working-with-the-code

@bmironov
Copy link
Author

bmironov commented Nov 9, 2018

Here is tiny performance test:

import pytz
tz=pytz.timezone("US/Eastern")
t = tz.localize(datetime.datetime.now())
t

Output

datetime.datetime(2018, 11, 9, 17, 24, 19, 41172, tzinfo=<DstTzInfo 'US/Eastern' EST-1 day, 19:00:00 STD>)

Create test CSV file (it is less than 26Mb on my filesystem):

N = 6000000
data = pd.DataFrame({'timestamp': pd.date_range(t, periods=N, freq='1S'), 'value': np.arange(0, N)})
data.set_index('timestamp', inplace=True)
data.to_csv('test.csv.gz', compression='gzip')

Test #1:

%%time
tmp = pd.read_csv("test.csv.gz", compression='gzip',
                 engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.set_index('timestamp', inplace=True)

Output

CPU times: user 11.3 s, sys: 1.57 s, total: 12.9 s
Wall time: 13.7 s

Test#2 (runs after previous test, hence we can expect that file is in OS cache now)

%%time
tmp = pd.read_csv("test.csv.gz", compression='gzip',
                  parse_dates=['timestamp'],
                  date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S.%f%z', errors='ignore'),
                  engine='c').set_index('timestamp')

Output

CPU times: user 4min 13s, sys: 7.49 s, total: 4min 20s
Wall time: 4min 49s
print(pd.__version__)

Output

0.23.4

@bmironov
Copy link
Author

bmironov commented Nov 9, 2018

tmp = pd.read_csv("test.csv.gz", compression='gzip', engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000000 entries, 0 to 5999999
Data columns (total 2 columns):
timestamp    object
value        int64
dtypes: int64(1), object(1)
memory usage: 91.6+ MB

Interesting observation: timestamp column is NOT datetime64. But

tmp = pd.read_csv("test.csv.gz", compression='gzip', engine='c')
tmp['timestamp'] = tmp['timestamp'].str.replace('-04:00', '')
tmp['timestamp'] = tmp['timestamp'].str.replace('-05:00', '')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f").dt.tz_localize('US/Eastern')
tmp.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000000 entries, 0 to 5999999
Data columns (total 2 columns):
timestamp    datetime64[ns, US/Eastern]
value        int64
dtypes: datetime64[ns, US/Eastern](1), int64(1)
memory usage: 91.6 MB

Conclusion: with certain set of parameters it is possible to load CSV file fast enough, but data is not in expected datetime64.

@bmironov
Copy link
Author

bmironov commented Nov 9, 2018

@mroeschke Sorry, just noticed your reply about setting up virtual environment.
Is there any chance you could run above tests under your setup and spare me from going through installation of virtual environment?

Thanks and best regards,
Boris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants