Skip to content

ENH: Adding origin parameter in pd.to_datetime #11470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10 changes: 10 additions & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,16 @@ New features

.. _whatsnew_0190.dev_api:

to_datetime can be used with Offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_datetime has gained an origin kwarg.

don't call this Offset which is a very specific meaning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the doc-string this is a 'reference date', so I would use that here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``pd.to_datetime`` has a new parameter, ``origin``, to define an offset for ``DatetimeIndex``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say its a starting offset

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


.. ipython:: python

to_datetime([1,2,3], unit='D', origin=pd.Timestamp('1960-01-01'))

The above code would return days with offset from origin as defined by timestamp set by origin.

pandas development API
^^^^^^^^^^^^^^^^^^^^^^

Expand Down
1 change: 0 additions & 1 deletion doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ New features




.. _whatsnew_0200.enhancements.other:

Other enhancements
Expand Down
42 changes: 42 additions & 0 deletions pandas/tseries/tests/test_timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -772,6 +772,48 @@ def test_to_datetime_unit(self):
result = to_datetime([1, 2, 111111111], unit='D', errors='coerce')
tm.assert_index_equal(result, expected)

def test_to_datetime_origin(self):
units = ['D', 's', 'ms', 'us', 'ns']
# Addresses Issue Number 11276, 11745
# for origin as julian
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

julian_dates = pd.date_range(
'2014-1-1', periods=10).to_julian_date().values
result = Series(pd.to_datetime(
julian_dates, unit='D', origin='julian'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit the Series( ..) here and in the 'expected' below as well. It will then return indexes (so need to use assert_index_equal instead of series), but it will make the test a little bit easier to follow

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using assert_index_equal raises AssertionError.

AssertionError: [index] Expected type <class 'pandas.indexes.base.Index'>, found

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, sorry. It then probably returns a numpy array and not an Index. Then you can leave it as is.

expected = Series(pd.to_datetime(
julian_dates - pd.Timestamp(0).to_julian_date(), unit='D'))
assert_series_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for invalid when origin='julian'. IOW non-integer, out-of-range julian dates.


# checking for invalid combination of origin='julian' and unit != D
for unit in units:
if unit == 'D':
continue
with self.assertRaises(ValueError):
pd.to_datetime(julian_dates, unit=unit, origin='julian')

# for origin as 1960-01-01
epoch_1960 = pd.Timestamp('1960-01-01')
epoch_timestamp_convertible = [epoch_1960, epoch_1960.to_datetime(),
epoch_1960.to_datetime64(),
str(epoch_1960)]
invalid_origins = ['random_string', '13-24-1990']
units_from_epoch = [0, 1, 2, 3, 4]

for unit in units:
for epoch in epoch_timestamp_convertible:
expected = Series(
[pd.Timedelta(x, unit=unit) +
epoch_1960 for x in units_from_epoch])
result = Series(pd.to_datetime(
units_from_epoch, unit=unit, origin=epoch))
assert_series_equal(result, expected)

# check for invalid origins
for origin in invalid_origins:
with self.assertRaises(ValueError):
pd.to_datetime(units_from_epoch, unit=unit,
origin=origin)

def test_series_ctor_datetime64(self):
rng = date_range('1/1/2000 00:00:00', '1/1/2000 1:59:50', freq='10s')
dates = np.asarray(rng)
Expand Down
70 changes: 53 additions & 17 deletions pandas/tseries/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ def _guess_datetime_format_for_array(arr, **kwargs):
mapping={True: 'coerce', False: 'raise'})
def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False,
utc=None, box=True, format=None, exact=True, coerce=None,
unit=None, infer_datetime_format=False):
unit=None, infer_datetime_format=False, origin='epoch'):
"""
Convert argument to datetime.

Expand Down Expand Up @@ -238,6 +238,19 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False,
datetime strings, and if it can be inferred, switch to a faster
method of parsing them. In some cases this can increase the parsing
speed by ~5-10x.
origin : scalar convertible to Timestamp / string ('julian', 'epoch'),
default 'epoch'.
Define reference date. The numeric values would be parsed as number
of units (defined by `unit`) since this reference date.

- If 'epoch', origin is set to 1970-01-01.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider other standard offsets: excel, Jan 0 1900, stata Jan 0, 1960, matlab, Jan 0, 0000? Just an idea

- If 'julian', unit must be 'D', and origin is set to beginning of
Julian Calendar. Julian day number 0 is assigned to the day starting
at noon on January 1, 4713 BC.
- If Timestamp convertible, origin is set to Timestamp identified by
origin.

.. versionadded: 0.19.0

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set what origin means if its not 'julian'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rectified.

-------
Expand Down Expand Up @@ -294,8 +307,14 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False,
>>> %timeit pd.to_datetime(s,infer_datetime_format=False)
1 loop, best of 3: 471 ms per loop

"""
Using non-epoch origins to parse date

>>> pd.to_datetime([1,2,3], unit='D', origin=pd.Timestamp('1960-01-01'))
0 1960-01-02
1 1960-01-03
2 1960-01-04

"""
from pandas.tseries.index import DatetimeIndex

tz = 'utc' if utc else None
Expand Down Expand Up @@ -406,22 +425,39 @@ def _convert_listlike(arg, box, format, name=None, tz=tz):
except (ValueError, TypeError):
raise e

if arg is None:
return arg
elif isinstance(arg, tslib.Timestamp):
return arg
elif isinstance(arg, ABCSeries):
from pandas import Series
values = _convert_listlike(arg._values, False, format)
return Series(values, index=arg.index, name=arg.name)
elif isinstance(arg, (ABCDataFrame, MutableMapping)):
return _assemble_from_unit_mappings(arg, errors=errors)
elif isinstance(arg, ABCIndexClass):
return _convert_listlike(arg, box, format, name=arg.name)
elif is_list_like(arg):
return _convert_listlike(arg, box, format)
def result_without_offset(arg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset -> origin

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if origin == 'julian':
if unit != 'D':
raise ValueError("unit must be 'D' for origin='julian'")
arg = arg - tslib.Timestamp(0).to_julian_date()
Copy link
Contributor

@jreback jreback Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg needs validation here, it MUST be an integer> it actually must be a valid julian date, though I think check that elsewhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how to proceed with the aforementioned check.

if arg is None:
return arg
elif isinstance(arg, tslib.Timestamp):
return arg
elif isinstance(arg, ABCSeries):
from pandas import Series
values = _convert_listlike(arg._values, False, format)
return Series(values, index=arg.index, name=arg.name)
elif isinstance(arg, (ABCDataFrame, MutableMapping)):
return _assemble_from_unit_mappings(arg, errors=errors)
elif isinstance(arg, ABCIndexClass):
return _convert_listlike(arg, box, format, name=arg.name)
elif is_list_like(arg):
return _convert_listlike(arg, box, format)
return _convert_listlike(np.array([arg]), box, format)[0]

result = result_without_offset(arg)

offset = None
if origin != 'epoch' and origin != 'julian':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if origin not in ['epoch', 'julian']:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

try:
offset = tslib.Timestamp(origin) - tslib.Timestamp(0)
except ValueError:
raise ValueError("Invalid 'origin' or 'origin' Out of Bound")

return _convert_listlike(np.array([arg]), box, format)[0]
if offset is not None:
result = result + offset
return result

# mappings for assembling units
_unit_map = {'year': 'year',
Expand Down