Skip to content

PyTables dates don't work when you switch to a different time zone #2852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
darindillon opened this issue Feb 12, 2013 · 19 comments · Fixed by #4841
Closed

PyTables dates don't work when you switch to a different time zone #2852

darindillon opened this issue Feb 12, 2013 · 19 comments · Fixed by #4841
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@darindillon
Copy link

Create a dataframe with a date (not datetime) column as the index. Save that dataframe to the HDFStore. Now open that same file on a computer in a time zone that's behind you. (IE, write the file in New York and then read the file from Dallas). Result: all the dates are off by one.
Here is a code snippet to demonstrate the problem, and also a proposed fix to the pandas code. (But someone more qualified than me can review this fix).

To repro the problem:

write a file in New York time zone

import pandas
import datetime

def writeToFile(fileName) :
dates = [ datetime.date.today() , datetime.date.today(), datetime.date.today() ]
numbers = [ 1, 2, 3 ]
p = pandas.DataFrame({ "date" : dates, "number": numbers}, columns = ['date', 'number'])
p.set_index('date', inplace=True)
store = pandas.HDFStore(fileName)
store['obj1'] = p
store.close()

Then go to a dallas time zone and read that file:

def readFromFile(fileName) :
store = pandas.HDFStore(fileName)
return store['obj1']

Notice that the dates on the dataframe now show YESTERDAY when they ought to show today.

Proposed fix in pandas.io.pytables.py:

def _convert_index(index):
...
elif inferred_type == 'date':
#OLD LINE: CHANGE THIS LINE
#converted = np.array([time.mktime(v.timetuple()) for v in values],
# dtype=np.int32)
#NEW LINE:
converted = np.array([v.toordinal() for v in values], dtype=np.int32)

and then also change:

def _unconvert_index(data, kind):
...
elif kind == 'date':
# here we'll try from ordinal first. If the date was saved with the old
# mktime mechanism it'll throw an exception as it'll be out of bounds.
# in those cases we'll convert using the old method (with the bug!)
try:
index = np.array([date.fromordinal(v) for v in data], dtype=object)
except ValueError:
index = np.array([date.fromtimestamp(v) for v in data], dtype=object)

@jreback
Copy link
Contributor

jreback commented Feb 12, 2013

this is not well tested
nor are dates really well supported (datetimes are really the way to store as can represent as int64 internally)

but will take a look

thanks

On Feb 12, 2013, at 1:29 PM, tavistmorph [email protected] wrote:

Create a dataframe with a date (not datetime) column as the index. Save that dataframe to the HDFStore. Now open that same file on a computer in a time zone that's behind you. (IE, write the file in New York and then read the file from Dallas). Result: all the dates are off by one.
Here is a code snippet to demonstrate the problem, and also a proposed fix to the pandas code. (But someone more qualified than me can review this fix).

To repro the problem:

#write a file in New York time zone
import pandas
import datetime

def writeToFile(fileName) :
dates = [ datetime.date.today() , datetime.date.today(), datetime.date.today() ]
numbers = [ 1, 2, 3 ]
p = pandas.DataFrame({ "date" : dates, "number": numbers}, columns = ['date', 'number'])
p.set_index('date', inplace=True)
store = pandas.HDFStore(fileName)
store['obj1'] = p
store.close()

#Then go to a dallas time zone and read that file:

def readFromFile(fileName) :
store = pandas.HDFStore(fileName)
return store['obj1']

#Notice that the dates on the dataframe now show YESTERDAY when they ought to show today.

Proposed fix in pandas.io.pytables.py:

def _convert_index(index):
...
elif inferred_type == 'date':
#OLD LINE: CHANGE THIS LINE
#converted = np.array([time.mktime(v.timetuple()) for v in values],

dtype=np.int32)

#NEW LINE:
converted = np.array([v.toordinal() for v in values], dtype=np.int32)

and then also change:

def _unconvert_index(data, kind):
...
elif kind == 'date':

here we'll try from ordinal first. If the date was saved with the old

mktime mechanism it'll throw an exception as it'll be out of bounds.

in those cases we'll convert using the old method (with the bug!)

try:
index = np.array([date.fromordinal(v) for v in data], dtype=object)
except ValueError:
index = np.array([date.fromtimestamp(v) for v in data], dtype=object)


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2013

this solution looks reasonable

can you do a PR with this (and some tests?)

not sure how to 'fake' the timezone....prob have to do it explicity..

Also if the toordinal returns an int? then the column should be Int32
can they live in 32?

and then don't actually need the try: except: on the unconvert, instead you can check
the column type (you already now its a 'date', so if its a Time32 then
old method, Int32 new method)

@darindillon
Copy link
Author

Not sure what a PR is (I assume that's a Git thing, and I'm new to Git). But yes, we have tested the fix I described above significantly in our environment with our real-world dataframes and it's working fine for us.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2013

a pull-request
you can essentially contribute a patch to thhe main code-base and see that nothing else breaks
https://github.com/pydata/pandas/blob/master/CONTRIBUTING.md

@wesm
Copy link
Member

wesm commented Mar 12, 2013

Action here?

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

pushing till 0.12, have to store extra data for the column, so need to thing how general this needs to be (see #2391, @scottkidder wanted to work on this)

@jreback
Copy link
Contributor

jreback commented May 8, 2013

this is closed by #3531

@jreback jreback closed this as completed May 8, 2013
@numpand
Copy link

numpand commented Sep 13, 2013

This issue is not fixed in pandas 0.12.0. I just reproduced it by following the test case outlined in the description.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2013

this particular issue is untestable so closed. The store correctly stores/retrieves timezone aware datetimes in 0.12. What exactly are you reproducing?

@numpand
Copy link

numpand commented Sep 13, 2013

I am using dates (not datetimes). I stored dates in a dataframe index and then saved to an hdf store in Eastern timezone, and then retrieved it in Central timezone and got (date - 1) in the index. What exactly do you mean by this issue being untestable?

@jreback
Copy link
Contributor

jreback commented Sep 13, 2013

datetime.dates are not supported for this issue, only datetime.datetimes. and in general datetime.dates are not really supported by pandas for many reasons.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2013

what's untestable is resetting the default time zone on the computer (well it IS testable, but the datetimes work fine) so its not an issue. you are having a different issue, in that datetime.date are only really supported somewhat for historical reasons

@numpand
Copy link

numpand commented Sep 13, 2013

Well, at the end of the day, datetime.date is "supported" because there is an "if" statement checking for it and doing something based on that (something that's causing a problem when using the hdf store on two different timezones). Is there any harm in making the change suggested by the original submitter?

@jreback
Copy link
Contributor

jreback commented Sep 13, 2013

can you create a self-contained example for testing? (if so, then no problem)

@numpand
Copy link

numpand commented Sep 14, 2013

https://www.wakari.io/sharing/bundle/johnv/pandas_issue_2852 - shared an ipython notebook on wakari. Let me know if you have problems accessing it (code won't work on Windows because of missing tzset). I'll repeat the code below just in case:

import datetime
import os
import time
import pandas as pd

def setTZ(tz):
    os.environ['TZ']=tz
    time.tzset()

setTZ('EST5EDT')
today = datetime.date(2013,9,10)
df = pd.DataFrame([1,2,3], index = [today, today, today])
print df
filename = 'test.hdf5'
store = pd.HDFStore(filename)
store['obj1'] = df
store.close()

setTZ('CST6CDT')
store = pd.HDFStore(filename)
read_df = store['obj1']
store.close()
print read_df

df.index[0]==read_df.index[0]

@jreback
Copy link
Contributor

jreback commented Sep 14, 2013

@numpand thanks...see PR #4841

just note that storing a datetime.date in HDF or using it in pandas is quite inefficient as the data is pickled.

Even manipulating in pandas is very inefficient as these are treated as object dtype. You are basically using python for everything, rather than efficient numpy dtypes.

You should ALWAYS use Timestamp objects instead, which give a dtype of datetime64[ns], and DatetimeIndex. Doing it this way doesn't allow any features of date-like indexes.

Is there a reason you and @tavistmorph store things this way?

@darindillon
Copy link
Author

We want to be able to use Datetime.date because its a standard, standard, standard python type that's built in to the language and is needed to interact with hundreds of other non-pandas libraries. No question that datetime64 is more efficient, but for data less than a million rows, the efficiency gains are outweighted by the convenience of not having to convert datetime64 to date.datetime every time we want to interact with some other library.
In my case, the dataframe is built from dates provided by another library date.datetime is an official python standard so it seems odd not to support it.
At the very least, if we are not supporting it, we should at least raise a NotSupportedException or something rather than giving the wrong behavior described above. We either support it or we don't -- there is no middle ground.

On Sep 14, 2013, at 5:12 PM, jreback [email protected] wrote:

@numpand thanks...see PR #4841

just note that storing a datetime.date in HDF or using it in pandas is quite inefficient as the data is pickled.

Even manipulating in pandas is very inefficient as these are treated as object dtype. You are basically using python for everything, rather than efficient numpy dtypes.

You should ALWAYS use Timestamp objects instead, which give a dtype of datetime64[ns], and DatetimeIndex. Doing it this way doesn't allow any features of date-like indexes.

Is there a reason you and @tavistmorph store things this way?


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Sep 14, 2013

PR was just merged. you are welcome.

@numpand
Copy link

numpand commented Sep 14, 2013

@jreback thank you! I'll keep in mind the performance penalty and revisit the usage of datetime.date in DataFrames in existing code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
4 participants