Skip to content

Write Excel Round-Trippable Timedeltas Regression #21502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nathanielatom opened this issue Jun 15, 2018 · 4 comments
Closed

Write Excel Round-Trippable Timedeltas Regression #21502

nathanielatom opened this issue Jun 15, 2018 · 4 comments
Labels
IO Excel read_excel, to_excel Timedelta Timedelta data type

Comments

@nathanielatom
Copy link

nathanielatom commented Jun 15, 2018

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
from io import BytesIO

pd.__version__ # 0.22 or 0.23.1

import openpyxl
openpyxl.__version__ # 2.5.4

import xlsxwriter
xlsxwriter.__version__ # 1.0.5

df = pd.DataFrame({'Times': pd.TimedeltaIndex(['00:01:00', '00:02:00', '00:03:00', np.nan])})
memfile = BytesIO()
df.to_excel(memfile, engine='openpyxl') # 'xlsxwriter' or 'openpyxl'
memfile.seek(0)
reread = pd.read_excel(memfile)
reread.iloc[0] == df.iloc[0]
# raises TypeError (reread has float type) when engine='xlsxwriter' or pd.__version__ == '0.23.1'
# if pd.__version__ == '0.22' and engine='openpyxl' evaluates to False

Problem description

Writing a Dataframe with timedeltas to excel using the openpyxl engine has changed behaviour from pandas version 0.22 to 0.23.1. Previously, the excel file produced could be read in as a Dataframe again with values matching the original Dataframe. If the xlsxwriter engine or pandas 0.23.1 is used, the timedeltas become floats in the re-read Dataframe.

Engine \ Pandas Version 0.22 0.23.1
openpyxl X
xlsxwriter X X

Expected Output

One the one hand, it's nice that pandas now has consistent behaviour between engines, however I think the desired behaviour would be for both engines to match the old openpyxl behaviour rather than the old xlsxwriter behaviour, as this would enable Dataframes written to excel to be used again when read back from excel.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.22.0 or 0.23.1
pytest: 3.6.1
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.14.5
scipy: 1.1.0
pyarrow: None
xarray: 0.10.3
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.4
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.8
pymysql: 0.8.1
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Jun 15, 2018

This would have been a result of #19900

Are you saying you were getting those objects back as Timedelta objects or Datetimes before? I think the main motivator to get rid of this in the first place was that it not only was inconsistent across the engines but would implicitly convert a Timedelta to a Datetime in the round-trip process, neither I would think are desirable

@WillAyd WillAyd added IO Excel read_excel, to_excel Timedelta Timedelta data type labels Jun 15, 2018
@nathanielatom
Copy link
Author

nathanielatom commented Jun 15, 2018

It looks like before they were object columns with datetime.time instances. These just got printed the same as timedeltas.

reread.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
Times    3 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes

reread.iloc[0].values
Out[8]: array([datetime.time(0, 1)], dtype=object)

Is it possible to make the original timedelta recoverable?

Found this to convert back, but it requires manually tracking the timedelta columns:

pd.TimedeltaIndex(reread['Times'], unit='D').round(freq='s')
Out[36]: TimedeltaIndex(['00:01:00', '00:02:00', '00:03:00', NaT], dtype='timedelta64[ns]', freq=None)

@WillAyd
Copy link
Member

WillAyd commented Jun 15, 2018

Yea I think that speaks to the danger that was present with the previous implementation - you could very easily mistake a Timedelta for a Time object and they are not the same.

I'm going to close this under the assumption that a lossless conversion of Timedelta information to/from Excel simply is not possible because Excel does not support Timedeltas natively. If you disagree and can think of a way to do that feel free to reopen - investigation and PRs are always welcome!

To answer your last question, you should be able to construct a Timedelta from a float value directly. For specific usage questions however, you are better asking on Stack Overflow both to get an answer and help others out in the future that may have the same question (we typically reserve this tracker for issues and enhancement requests)

@WillAyd WillAyd closed this as completed Jun 15, 2018
@WillAyd WillAyd added this to the No action milestone Jun 15, 2018
@nathanielatom
Copy link
Author

Ok, thank you. I didn't know if there was a way to use metadata to track which columns were timedeltas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Timedelta Timedelta data type
Projects
None yet
Development

No branches or pull requests

2 participants