-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Issue parsing '%z' in timestamps via pd.to_datetime #23599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Another strange performance metric: %%time
tmp = pd.read_csv("data/my_data.csv",
engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.set_index('timestamp', inplace=True)
And what "should" work at least as fast: %%time
tmp = pd.read_csv("data/my_data.csv",
parse_dates=['timestamp'],
date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S.%f%z', errors='ignore'),
engine='c').set_index('timestamp')
CSV file has 6.6M rows of data |
The performance hit you're experiencing is probably because we fall back on using dateutil if initial parsing attempts failed which tends to be slow. If you're still experiencing a performance hit when parsing with |
@mroeschke Thank you for quick reply. Assuming that pd.to_datetime falls back to deteutil... Then why running pd.to_datetime outside of pd.read_csv is way faster than as "date_parser" parameter? If it is the same code then it should behave similarly, isn't it? It seems to me that performance hit is not just due to fallback to another piece of code. Could it be that running it inside read_csv is ignoring its "errors" parameter? Thanks again, |
I am not intimately familiar with
|
So maybe this issue (#23599) could be a good test case and we should keep it open until 0.24 development is done? |
We have tests for the In terms of performance, could you rerun your |
Sorry I'm not familiar with this process. How would I install dev version of pandas, run test and revert it back to official 0.23.4?? |
The process would be to create a separate virtual environment and following these instructions: https://pandas.pydata.org/pandas-docs/stable/contributing.html#working-with-the-code |
Here is tiny performance test: import pytz
tz=pytz.timezone("US/Eastern")
t = tz.localize(datetime.datetime.now())
t Output
Create test CSV file (it is less than 26Mb on my filesystem): N = 6000000
data = pd.DataFrame({'timestamp': pd.date_range(t, periods=N, freq='1S'), 'value': np.arange(0, N)})
data.set_index('timestamp', inplace=True)
data.to_csv('test.csv.gz', compression='gzip') Test #1: %%time
tmp = pd.read_csv("test.csv.gz", compression='gzip',
engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.set_index('timestamp', inplace=True) Output
Test#2 (runs after previous test, hence we can expect that file is in OS cache now) %%time
tmp = pd.read_csv("test.csv.gz", compression='gzip',
parse_dates=['timestamp'],
date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S.%f%z', errors='ignore'),
engine='c').set_index('timestamp') Output
print(pd.__version__) Output
|
tmp = pd.read_csv("test.csv.gz", compression='gzip', engine='c')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
tmp.info() Output:
Interesting observation: timestamp column is NOT datetime64. But tmp = pd.read_csv("test.csv.gz", compression='gzip', engine='c')
tmp['timestamp'] = tmp['timestamp'].str.replace('-04:00', '')
tmp['timestamp'] = tmp['timestamp'].str.replace('-05:00', '')
tmp['timestamp'] = pd.to_datetime(tmp['timestamp'], format="%Y-%m-%d %H:%M:%S.%f").dt.tz_localize('US/Eastern')
tmp.info() Output
Conclusion: with certain set of parameters it is possible to load CSV file fast enough, but data is not in expected datetime64. |
@mroeschke Sorry, just noticed your reply about setting up virtual environment. Thanks and best regards, |
Code Sample, a copy-pastable example if possible
First, thanks a lot for this great library. It helps a lot in our day-to-day activities
I came across odd behaviour in processing timestamps from set of my CSV files.
Output:
Problem description
As we can see pd.to_datetime CAN properly parse timezone directive '%z'. If I omit parameter "errors" then I get exception:
ValueError: 'z' is a bad directive in format '%Y-%m-%d %H:%M:%S.%f%z'
I have 6.6M rows CSV file with column that have information in above format (upto microsecond precision with time zone information like -04:00). This column is my index for the dataframe. First, I load it via from_csv(filename) call [no extra parameters except filename]. When I use pd.to_datetime I get following performance results:
takes 5 min to process
df2['timestamp'] = pd.to_datetime(df2['datetime'], format="%Y-%m-%d %H:%M:%S.%f")
takes 19 sec to complete
takes 3 sec to complete
All of the above calls produce expected result in the dataframe
I see that recently there was related issue of processing '%z': #13486
Expected Output
No exceptions and no need to ignore errors
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Thanks again and best regards,
Boris
The text was updated successfully, but these errors were encountered: