ENH: allow construction of datetimes from columns in a DataFrame #12967

jreback · 2016-04-23T14:38:45Z

See the references SO questions in the issue. but allows highly performant construction of datetime series from specified DataFrame columns with a minimal syntax

In [12]: pd.options.display.max_rows=10

In [13]: year = np.arange(2010, 2020)

In [14]: months = np.arange(1, 13)

In [15]: days = np.arange(1, 29)

In [16]: y, m, d = map(np.ravel, np.broadcast_arrays(*np.ix_(year, months, days)))

In [17]: df = DataFrame({'year' : y, 'month' : m, 'day' : d})

In [18]: df
Out[18]: 
      day  month  year
0       1      1  2010
1       2      1  2010
2       3      1  2010
3       4      1  2010
4       5      1  2010
...   ...    ...   ...
3355   24     12  2019
3356   25     12  2019
3357   26     12  2019
3358   27     12  2019
3359   28     12  2019

[3360 rows x 3 columns]

In [19]: pd.to_datetime(df, unit={ c:c for c in df.columns })
Out[19]: 
0      2010-01-01
1      2010-01-02
2      2010-01-03
3      2010-01-04
4      2010-01-05
          ...    
3355   2019-12-24
3356   2019-12-25
3357   2019-12-26
3358   2019-12-27
3359   2019-12-28
dtype: datetime64[ns]

In [20]: %timeit pd.to_datetime(df, unit={ c:c for c in df.columns })
100 loops, best of 3: 2.33 ms per loop

# we are passing a dict of mapping from the df columns to their units.
# obviously also includes hours, min, seconds, ms, etc. as well as aliases for
# these (e.g. H for 'hours'). I wrote them out to avoid confusion of ``M``, is this Month or Minute. 
# could also accept ``%Y`` for the strptime mappings.
In [21]: { c:c for c in df.columns }
Out[21]: {'day': 'day', 'month': 'month', 'year': 'year'}

jreback · 2016-04-23T14:40:51Z

cc @unutbu
cc @jakevdp

@shoyer @jorisvandenbossche

max-sixty · 2016-04-23T19:01:10Z

pandas/tseries/tools.py

+    Series
+    """
+
+    if not isinstance(unit, dict):


Small issue but this could be is_dict_like or Mapping

shoyer · 2016-04-25T16:31:57Z

Rather than the pd.to_datetime(df, unit={'year': 'year', 'month': 'month', 'day': 'day'}), why not spell this pd.to_datetime({'year': df.year, 'month': df.month, 'day': df.day}) (pd.to_datetime(df[['year', 'month', 'day']]) also works)?

jakevdp · 2016-04-25T17:22:13Z

+1 on @shoyer's suggestion. Seems very clean and intuitive for the common case of converting columns to a date.

jreback · 2016-04-25T17:43:02Z

@shoyer ok that is a nice spelling. Only concern is the orderings of a DataFrame columns when they are positionally based

IOW,

pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute', 'second', 'millisecond', 'microsecond', 'nanosecond']]), but could spell that out in the doc-string.

I suppose for the common case this makes sense.

shoyer · 2016-04-25T17:45:35Z

Why does the order matter? I haven't looked at your implementation yet, but my approach would be to simply try to pull columns out of the DataFrame/dict by name (e.g., hour = arg.get('hour', 0))

jreback · 2016-04-25T17:51:19Z

@shoyer ahh good point, but that requires they be named exactly right. Ok can just treat it like a dict then. Last issue is what if MORE things (e.g. unrecognized are passed), e.g.

{'year' : ..., 'month': ..., 'day' : ..., 'the_hour' : ...}. I think this should raise as this is a user error.

jakevdp · 2016-04-25T18:05:41Z

Last issue is what if MORE things (e.g. unrecognized are passed)

I think an error is appropriate in this case. It would force users to be explicit about which columns they want used, which is good.

jreback · 2016-04-25T18:45:22Z

In [1]:         df = DataFrame({'year': [2015, 2016],
                        'month': [2, 3],
                        'day': [4, 5],
                        'hour': [6, 7],
                        'minute': [58, 59],
                        'second': [10, 11],
                        'ms': [1, 1],
                        'us': [2, 2],
                        'ns': [3, 3]})

In [2]: df
Out[2]: 
   day  hour  minute  month  ms  ns  second  us  year
0    4     6      58      2   1   3      10   2  2015
1    5     7      59      3   1   3      11   2  2016

# I know a typo....
In [3]: pd.to_datetime(df.assign(foo=1))
ValueError: extra columns have been passedto the datetime assemblage: [foo]

In [4]: pd.to_datetime(df[['month','year','day']])
Out[4]: 
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

In [5]: pd.to_datetime(df)
Out[5]: 
0   2015-02-04 06:58:10.001002003
1   2016-03-05 07:59:11.001002003
dtype: datetime64[ns]

TomAugspurger · 2016-04-25T20:24:36Z

@jreback haven't looked yet, but what happens on pd.to_datetime(df[['year', 'month', 'day']]) when one (or more) of those columns have duplicates? We should catch that early and raise.

jreback · 2016-04-25T21:45:10Z

@TomAugspurger pushed a nicer error msg if dups are passed.

closes pandas-dev#8158

codecov-io · 2016-04-26T05:50:42Z

Current coverage is 83.86%

Merging #12967 into master will increase coverage by +<.01%

@@             master     #12967   diff @@
==========================================
  Files           136        136          
  Lines         49711      49750    +39   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          41679      41720    +41   
+ Misses         8032       8030     -2   
  Partials          0          0

File ...das/tseries/tools.py was modified. more
- Misses -4
- Partials 0
- Hits +4

Powered by Codecov. Last updated by 521b8af

jorisvandenbossche · 2016-04-26T08:53:52Z

doc/source/timeseries.rst

+
+   pd.to_datetime(df[['year', 'month', 'day']])
+
+.. _whatsnew_0181.other:


This line is copied over wrongly from the whatsnew file I think

jreback · 2016-04-26T14:55:40Z

ok, the only thing missing is a nice way to tell the user acceptable units (maybe in the docs). I trivially did it in the doc-string (Example), but should we add a list somewhere?

jorisvandenbossche · 2016-04-26T15:30:56Z

pandas/tseries/tests/test_timeseries.py

+        assert_series_equal(result, expected)
+
+        # coerce back to int
+        result = to_datetime(df.astype(str), unit=d)


Why is unit=d passed here?

hmm, might be some old code, though it ignores unit here.

yep, prob from the initial implementation which used unit

I agree with Joris that we should restrict this method to very specific,
verbose names for each component.

On Tue, Apr 26, 2016 at 8:38 AM, Joris Van den Bossche <
[email protected]> wrote:

In pandas/tseries/tests/test_timeseries.py
#12967 (comment):

'month': 'month',

'day': 'd',

'hour': 'h',

'minute': 'm',

'second': 's',

'ms': 'ms',

'us': 'us',

'ns': 'ns'}

result = to_datetime(df.rename(columns=d))

expected = Series([Timestamp('20150204 06:58:10.001002003'),

Timestamp('20160305 07:59:11.001002003')])

assert_series_equal(result, expected)

# coerce back to int

result = to_datetime(df.astype(str), unit=d)

yep, prob from the initial implementation which used unit

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12967/files/e18c9cc7dcbc6e17c843517a05f348e878a68bcc#r61109133

jorisvandenbossche · 2016-04-26T15:32:22Z

Some further comments after merge: +1, this is a nice enhancement!

But, my main comment is that I would restrict the possibilities. I would just let users have to provide very explicit dict keys/column names (only year, month, day, hour, minute, ... + maybe the plural forms). This makes the method less magical (and solves the documentation issue you noted), and also avoids possible confusing about m not allowed/being month or minute.

jreback · 2016-04-26T15:35:26Z

@jorisvandenbossche ok, let me see with a followup for that.

xref pandas-dev#12967

xref #12967 closes #12996

jreback added Enhancement Datetime Datetime data dtype labels Apr 23, 2016

jreback added this to the 0.18.1 milestone Apr 23, 2016

jreback force-pushed the dates branch from d03b90a to c4b0d14 Compare April 23, 2016 15:25

max-sixty reviewed Apr 23, 2016
View reviewed changes

jreback force-pushed the dates branch 2 times, most recently from 802420d to 5dcb063 Compare April 23, 2016 20:11

jreback modified the milestones: 0.18.2, 0.18.1 Apr 25, 2016

max-sixty mentioned this pull request Apr 25, 2016

ENH: classmethods for pd.period_range & pd.date_range #12984

Closed

jreback force-pushed the dates branch from 5dcb063 to aef9280 Compare April 25, 2016 18:41

jreback force-pushed the dates branch from aef9280 to 70a4602 Compare April 25, 2016 18:46

jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016

jreback force-pushed the dates branch 2 times, most recently from dce0d18 to 0f4e95a Compare April 25, 2016 19:39

jreback force-pushed the dates branch from 0f4e95a to 960f1e5 Compare April 25, 2016 21:44

ENH: allow construction of datetimes from columns in a DataFrame

7dc9406

closes pandas-dev#8158

jreback force-pushed the dates branch from 2ae6d76 to 9f4c2a5 Compare April 25, 2016 22:14

jorisvandenbossche reviewed Apr 26, 2016
View reviewed changes

TST: move .to_datetime() tests to new testing class

e18c9cc

jreback force-pushed the dates branch from 9f4c2a5 to e18c9cc Compare April 26, 2016 13:18

jreback closed this in 59082e9 Apr 26, 2016

jorisvandenbossche reviewed Apr 26, 2016
View reviewed changes

jreback added a commit to jreback/pandas that referenced this pull request Apr 26, 2016

BUG/DOC: restrict possibilities for assembly of dates using a DataFrame

4b80346

xref pandas-dev#12967

jreback mentioned this pull request Apr 26, 2016

BUG/DOC: restrict possibilities for assembly of dates using a DataFrame #12996

Closed

jreback added a commit that referenced this pull request Apr 26, 2016

BUG/DOC: restrict possibilities for assembly of dates using a DataFrame

db6d009

xref #12967 closes #12996


		pd.to_datetime(df[['year', 'month', 'day']])

		.. _whatsnew_0181.other:

Uh oh!

ENH: allow construction of datetimes from columns in a DataFrame #12967

ENH: allow construction of datetimes from columns in a DataFrame #12967

Uh oh!

Conversation

jreback commented Apr 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Apr 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Apr 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakevdp commented Apr 25, 2016

Uh oh!

jreback commented Apr 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Apr 25, 2016

Uh oh!

jreback commented Apr 25, 2016

Uh oh!

jakevdp commented Apr 25, 2016

Uh oh!

jreback commented Apr 25, 2016

Uh oh!

TomAugspurger commented Apr 25, 2016

Uh oh!

jreback commented Apr 25, 2016

Uh oh!

codecov-io commented Apr 26, 2016

Current coverage is 83.86%

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 26, 2016

Uh oh!

jreback commented Apr 26, 2016

Uh oh!

Uh oh!

jreback commented Apr 23, 2016 •

edited

Loading

shoyer commented Apr 25, 2016 •

edited

Loading

jreback commented Apr 25, 2016 •

edited

Loading