Skip to content

API/BUG: inconsistent return in Timestamp/to_datetime for current year #3944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Jun 18, 2013 · 13 comments
Closed

API/BUG: inconsistent return in Timestamp/to_datetime for current year #3944

jreback opened this issue Jun 18, 2013 · 13 comments
Labels
API Design Bug Datetime Datetime data dtype
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jun 18, 2013

after #3890, still unresolved / inconsistent
non-trivial to fix, so pushing to 0.12

In [14]: Timestamp('2012')
Out[14]: Timestamp('2012-06-18 00:00:00', tz=None)

In [15]: to_datetime('2012')
Out[15]: Timestamp('2012-01-01 00:00:00', tz=None)
In [17]: pd.date_range('2014', '2015', freq='M')
Out[17]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-30, ..., 2015-03-31]
Length: 12, Freq: M, Timezone: None
@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2013

In [4]: pd.to_datetime([1, '1'])
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[1970-01-01 00:00:00.000000001, 2013-06-01 00:00:00]
Length: 2, Freq: None, Timezone: None

@hayd
Copy link
Contributor

hayd commented Jun 19, 2013

...having Timestamp(x) dependent on the current time seems a little strange to me.

@jreback
Copy link
Contributor Author

jreback commented Jun 19, 2013

that's the bug

it shouldn't but has to do with the parse order and fix a bit non trivial

@danbirken
Copy link
Contributor

I tried some small-ish fixes for this to just make both of them use the same set of parsing functions, but it turns out there is another wrinkle with this: Timestamp will parse out and use timezone strings, and to_datetime wont:

In [2]: pd.Timestamp('01-01-2012 00:00:00-08:00')
Out[2]: Timestamp('2012-01-01 00:00:00-0800', tz='tzoffset(None, -28800)')

In [3]: pd.Timestamp('01-01-2012 00:00:00-08:00').value
Out[3]: 1325404800000000000

In [4]: pd.to_datetime('01-01-2012 00:00:00-08:00')
Out[4]: Timestamp('2012-01-01 00:00:00', tz=None)

In [5]: pd.to_datetime('01-01-2012 00:00:00-08:00').value
Out[5]: 1325376000000000000

Since multiple tests count on the fact that Timestamp does this properly, I'm assuming Timestamp is right and to_datetime (and by proxy array_to_datetime) should handle the timezone offset. to_datetime in fact does actually properly deal with timezones if passed via a datetime:

In [11]: sample_dt
Out[11]: datetime.datetime(2012, 1, 1, 0, 0, tzinfo=tzoffset(None, -28800))

In [12]: pd.Timestamp(sample_dt).value
Out[12]: 1325404800000000000

In [13]: pd.to_datetime(sample_dt).value
False
Out[13]: 1325404800000000000

This behavior works by accident (you are supposed to have to pass in utc=True to to_datetime for this to happen), but either way it works.

So basically, if the two should have the same interface, then to_datetime needs to handle timezone offset's when given a datetime-like string. And this probably should require utc=True. Sound reasonable?

@jreback
Copy link
Contributor Author

jreback commented Jan 14, 2014

I think array_to_datetime when parsing a string does

_string_to_dts (impl in datetime.pxd)

it appears that this ignores the parsed timezone (the islocal).

I think you could grab that, then do a convert_to_tsobject(val, tz, unit) I believe

@danbirken
Copy link
Contributor

Alright, I have looked into this more and I don't think this issue is worth fixing.

The problem is that array_to_datetime doesn't care about timezones. All it cares about is turning a datetime-ish string into a datetime64 value. It does this very well. _string_to_dts actually does work properly with offsets. But... it just bakes the timezone information into the datetimestruct (which is agnostic to timezone) and calls it a day. So you can't tell the difference between 01-01-2012 08:00:00+08:00 and 01-01-2012 00:00:00-00:00.

However, dateutil.parser.parse_date nicely extracts the timezone information into a python datetime. This allows Timestamp to properly extract the timezone information.

So, if you wanted to unify the two, you'd have to do one of the following:

  1. Update _string_to_dts (and all of the sub-functions) to pass back the parsed timezone information somehow. Pros: Solves the problems, Cons: This is a lot of work and complexity for a very minor problem.

  2. Ditch _string_to_dts and just use the slower dateutil.parser.parse_date function in array_to_datetime. Pros: Solves the problem, Cons: Drastically slows down a common operation (importing a bunch of iso8601 compliant datetime strings).

I don't think either is worth it. However if somebody put a gun to my head and made me choose, I think adding tzoffset to datetimestruct and then updating a bunch of stuff to work with that would be my favorite solution. Honestly it seems to me like datetimestruct should have the tzoffset in there. However, my guess is that would be a fairly large update and certainly would deserve its own issue.

@danbirken
Copy link
Contributor

I am attaching a very minor pull request (#5958) to deal with an inconsistency in how to_datetime parsed timezone offsets. With certain formats (ones that used _string_to_dts) the offsets were used, with all other formats the tz offset was ignored. With this change the tz offset is used properly for all inputs.

@cancan101
Copy link
Contributor

I am not sure if this is a related issue.
This works:

In [400]: 
pd.to_datetime([pd.Timestamp("2013-1-1", tz=pytz.timezone('US/Eastern'))])

Out[400]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00]
Length: 1, Freq: None, Timezone: US/Eastern

but this (adding an NaT) does not. At the very least, the error message is misleading:

In [401]:
pd.to_datetime([pd.Timestamp("2013-1-1", tz=pytz.timezone('US/Eastern')), pd.NaT])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-401-3b5b8a67d2e1> in <module>()
----> 1 pd.to_datetime([pd.Timestamp("2013-1-1", tz=pytz.timezone('US/Eastern')), pd.NaT])

/usr/local/lib/python2.7/dist-packages/pandas-0.13.0rc1_78_g142ca62-py2.7-linux-x86_64.egg/pandas/tseries/tools.pyc in to_datetime(arg, errors, dayfirst, utc, box, format, coerce, unit)
    137         return Series(values, index=arg.index, name=arg.name)
    138     elif com.is_list_like(arg):
--> 139         return _convert_listlike(arg, box=box)
    140 
    141     return _convert_listlike(np.array([ arg ]), box=box)[0]

/usr/local/lib/python2.7/dist-packages/pandas-0.13.0rc1_78_g142ca62-py2.7-linux-x86_64.egg/pandas/tseries/tools.pyc in _convert_listlike(arg, box)
    127                 return DatetimeIndex._simple_new(values, None, tz=tz)
    128             except (ValueError, TypeError):
--> 129                 raise e
    130 
    131     if arg is None:

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

@cancan101
Copy link
Contributor

Okay. I tracked down this issue to a bug in datetime_to_datetime64

When iterating over the elements in datetime_to_datetime64 the check for nullness is util._checknull(val) which is False for NaT.

The correct null check is to use checknull from lib

@cancan101
Copy link
Contributor

I can submit a PR for this if there are no issues.

@cancan101
Copy link
Contributor

cancan101@d28a3fd

@cancan101
Copy link
Contributor

See: #5961

danbirken added a commit to danbirken/pandas that referenced this issue Jan 19, 2014
Currently for certain formats of datetime strings, the tz offset will
just be ignored.
ghost pushed a commit that referenced this issue Feb 4, 2014
BUG: Fix to_datetime to properly deal with tz offsets #3944
@ghost
Copy link

ghost commented Feb 4, 2014

closed by #5958

@ghost ghost closed this as completed Feb 4, 2014
jreback pushed a commit to jreback/pandas that referenced this issue Feb 4, 2014
Currently for certain formats of datetime strings, the tz offset will
just be ignored.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

5 participants