Skip to content

ENH: allow construction of datetimes from columns in a DataFrame #12967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 23, 2016

closes #8158

See the references SO questions in the issue. but allows highly performant construction of datetime series from specified DataFrame columns with a minimal syntax

In [12]: pd.options.display.max_rows=10

In [13]: year = np.arange(2010, 2020)

In [14]: months = np.arange(1, 13)

In [15]: days = np.arange(1, 29)

In [16]: y, m, d = map(np.ravel, np.broadcast_arrays(*np.ix_(year, months, days)))

In [17]: df = DataFrame({'year' : y, 'month' : m, 'day' : d})

In [18]: df
Out[18]: 
      day  month  year
0       1      1  2010
1       2      1  2010
2       3      1  2010
3       4      1  2010
4       5      1  2010
...   ...    ...   ...
3355   24     12  2019
3356   25     12  2019
3357   26     12  2019
3358   27     12  2019
3359   28     12  2019

[3360 rows x 3 columns]

In [19]: pd.to_datetime(df, unit={ c:c for c in df.columns })
Out[19]: 
0      2010-01-01
1      2010-01-02
2      2010-01-03
3      2010-01-04
4      2010-01-05
          ...    
3355   2019-12-24
3356   2019-12-25
3357   2019-12-26
3358   2019-12-27
3359   2019-12-28
dtype: datetime64[ns]

In [20]: %timeit pd.to_datetime(df, unit={ c:c for c in df.columns })
100 loops, best of 3: 2.33 ms per loop

# we are passing a dict of mapping from the df columns to their units.
# obviously also includes hours, min, seconds, ms, etc. as well as aliases for
# these (e.g. H for 'hours'). I wrote them out to avoid confusion of ``M``, is this Month or Minute. 
# could also accept ``%Y`` for the strptime mappings.
In [21]: { c:c for c in df.columns }
Out[21]: {'day': 'day', 'month': 'month', 'year': 'year'}

@jreback jreback added Enhancement Datetime Datetime data dtype labels Apr 23, 2016
@jreback jreback added this to the 0.18.1 milestone Apr 23, 2016
@jreback
Copy link
Contributor Author

jreback commented Apr 23, 2016

Series
"""

if not isinstance(unit, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small issue but this could be is_dict_like or Mapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@jreback jreback force-pushed the dates branch 2 times, most recently from 802420d to 5dcb063 Compare April 23, 2016 20:11
@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 25, 2016
@shoyer
Copy link
Member

shoyer commented Apr 25, 2016

Rather than the pd.to_datetime(df, unit={'year': 'year', 'month': 'month', 'day': 'day'}), why not spell this pd.to_datetime({'year': df.year, 'month': df.month, 'day': df.day}) (pd.to_datetime(df[['year', 'month', 'day']]) also works)?

@jakevdp
Copy link
Contributor

jakevdp commented Apr 25, 2016

+1 on @shoyer's suggestion. Seems very clean and intuitive for the common case of converting columns to a date.

@jreback
Copy link
Contributor Author

jreback commented Apr 25, 2016

@shoyer ok that is a nice spelling. Only concern is the orderings of a DataFrame columns when they are positionally based

IOW,

pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute', 'second', 'millisecond', 'microsecond', 'nanosecond']]), but could spell that out in the doc-string.

I suppose for the common case this makes sense.

@shoyer
Copy link
Member

shoyer commented Apr 25, 2016

Why does the order matter? I haven't looked at your implementation yet, but my approach would be to simply try to pull columns out of the DataFrame/dict by name (e.g., hour = arg.get('hour', 0))

@jreback
Copy link
Contributor Author

jreback commented Apr 25, 2016

@shoyer ahh good point, but that requires they be named exactly right. Ok can just treat it like a dict then. Last issue is what if MORE things (e.g. unrecognized are passed), e.g.

{'year' : ..., 'month': ..., 'day' : ..., 'the_hour' : ...}. I think this should raise as this is a user error.

@jakevdp
Copy link
Contributor

jakevdp commented Apr 25, 2016

Last issue is what if MORE things (e.g. unrecognized are passed)

I think an error is appropriate in this case. It would force users to be explicit about which columns they want used, which is good.

@jreback
Copy link
Contributor Author

jreback commented Apr 25, 2016

In [1]:         df = DataFrame({'year': [2015, 2016],
                        'month': [2, 3],
                        'day': [4, 5],
                        'hour': [6, 7],
                        'minute': [58, 59],
                        'second': [10, 11],
                        'ms': [1, 1],
                        'us': [2, 2],
                        'ns': [3, 3]})

In [2]: df
Out[2]: 
   day  hour  minute  month  ms  ns  second  us  year
0    4     6      58      2   1   3      10   2  2015
1    5     7      59      3   1   3      11   2  2016

# I know a typo....
In [3]: pd.to_datetime(df.assign(foo=1))
ValueError: extra columns have been passedto the datetime assemblage: [foo]

In [4]: pd.to_datetime(df[['month','year','day']])
Out[4]: 
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

In [5]: pd.to_datetime(df)
Out[5]: 
0   2015-02-04 06:58:10.001002003
1   2016-03-05 07:59:11.001002003
dtype: datetime64[ns]

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016
@jreback jreback force-pushed the dates branch 2 times, most recently from dce0d18 to 0f4e95a Compare April 25, 2016 19:39
@TomAugspurger
Copy link
Contributor

@jreback haven't looked yet, but what happens on pd.to_datetime(df[['year', 'month', 'day']]) when one (or more) of those columns have duplicates? We should catch that early and raise.

@jreback
Copy link
Contributor Author

jreback commented Apr 25, 2016

@TomAugspurger pushed a nicer error msg if dups are passed.

@codecov-io
Copy link

Current coverage is 83.86%

Merging #12967 into master will increase coverage by +<.01%

@@             master     #12967   diff @@
==========================================
  Files           136        136          
  Lines         49711      49750    +39   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          41679      41720    +41   
+ Misses         8032       8030     -2   
  Partials          0          0          
  1. File ...das/tseries/tools.py was modified. more
    • Misses -4
    • Partials 0
    • Hits +4

Sunburst

Powered by Codecov. Last updated by 521b8af


pd.to_datetime(df[['year', 'month', 'day']])

.. _whatsnew_0181.other:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is copied over wrongly from the whatsnew file I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jreback
Copy link
Contributor Author

jreback commented Apr 26, 2016

ok, the only thing missing is a nice way to tell the user acceptable units (maybe in the docs). I trivially did it in the doc-string (Example), but should we add a list somewhere?

@jreback jreback closed this in 59082e9 Apr 26, 2016
assert_series_equal(result, expected)

# coerce back to int
result = to_datetime(df.astype(str), unit=d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is unit=d passed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, might be some old code, though it ignores unit here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, prob from the initial implementation which used unit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Joris that we should restrict this method to very specific,
verbose names for each component.

On Tue, Apr 26, 2016 at 8:38 AM, Joris Van den Bossche <
[email protected]> wrote:

In pandas/tseries/tests/test_timeseries.py
#12967 (comment):

  •         'month': 'month',
    
  •         'day': 'd',
    
  •         'hour': 'h',
    
  •         'minute': 'm',
    
  •         'second': 's',
    
  •         'ms': 'ms',
    
  •         'us': 'us',
    
  •         'ns': 'ns'}
    
  •    result = to_datetime(df.rename(columns=d))
    
  •    expected = Series([Timestamp('20150204 06:58:10.001002003'),
    
  •                       Timestamp('20160305 07:59:11.001002003')])
    
  •    assert_series_equal(result, expected)
    
  •    # coerce back to int
    
  •    result = to_datetime(df.astype(str), unit=d)
    

yep, prob from the initial implementation which used unit


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12967/files/e18c9cc7dcbc6e17c843517a05f348e878a68bcc#r61109133

@jorisvandenbossche
Copy link
Member

Some further comments after merge: +1, this is a nice enhancement!

But, my main comment is that I would restrict the possibilities. I would just let users have to provide very explicit dict keys/column names (only year, month, day, hour, minute, ... + maybe the plural forms). This makes the method less magical (and solves the documentation issue you noted), and also avoids possible confusing about m not allowed/being month or minute.

@jreback
Copy link
Contributor Author

jreback commented Apr 26, 2016

@jorisvandenbossche ok, let me see with a followup for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: convert datetime components (year, month, day, ...) in different columns to datetime
7 participants