-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: allow construction of datetimes from columns in a DataFrame #12967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Series | ||
""" | ||
|
||
if not isinstance(unit, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small issue but this could be is_dict_like
or Mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
802420d
to
5dcb063
Compare
Rather than the |
+1 on @shoyer's suggestion. Seems very clean and intuitive for the common case of converting columns to a date. |
@shoyer ok that is a nice spelling. Only concern is the orderings of a DataFrame columns when they are positionally based IOW,
I suppose for the common case this makes sense. |
Why does the order matter? I haven't looked at your implementation yet, but my approach would be to simply try to pull columns out of the DataFrame/dict by name (e.g., |
@shoyer ahh good point, but that requires they be named exactly right. Ok can just treat it like a dict then. Last issue is what if MORE things (e.g. unrecognized are passed), e.g.
|
I think an error is appropriate in this case. It would force users to be explicit about which columns they want used, which is good. |
|
dce0d18
to
0f4e95a
Compare
@jreback haven't looked yet, but what happens on |
@TomAugspurger pushed a nicer error msg if dups are passed. |
Current coverage is 83.86%@@ master #12967 diff @@
==========================================
Files 136 136
Lines 49711 49750 +39
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 41679 41720 +41
+ Misses 8032 8030 -2
Partials 0 0
|
|
||
pd.to_datetime(df[['year', 'month', 'day']]) | ||
|
||
.. _whatsnew_0181.other: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is copied over wrongly from the whatsnew file I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
ok, the only thing missing is a nice way to tell the user acceptable units (maybe in the docs). I trivially did it in the doc-string (Example), but should we add a list somewhere? |
assert_series_equal(result, expected) | ||
|
||
# coerce back to int | ||
result = to_datetime(df.astype(str), unit=d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is unit=d
passed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, might be some old code, though it ignores unit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, prob from the initial implementation which used unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Joris that we should restrict this method to very specific,
verbose names for each component.
On Tue, Apr 26, 2016 at 8:38 AM, Joris Van den Bossche <
[email protected]> wrote:
In pandas/tseries/tests/test_timeseries.py
#12967 (comment):
'month': 'month',
'day': 'd',
'hour': 'h',
'minute': 'm',
'second': 's',
'ms': 'ms',
'us': 'us',
'ns': 'ns'}
result = to_datetime(df.rename(columns=d))
expected = Series([Timestamp('20150204 06:58:10.001002003'),
Timestamp('20160305 07:59:11.001002003')])
assert_series_equal(result, expected)
# coerce back to int
result = to_datetime(df.astype(str), unit=d)
yep, prob from the initial implementation which used unit
—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12967/files/e18c9cc7dcbc6e17c843517a05f348e878a68bcc#r61109133
Some further comments after merge: +1, this is a nice enhancement! But, my main comment is that I would restrict the possibilities. I would just let users have to provide very explicit dict keys/column names (only year, month, day, hour, minute, ... + maybe the plural forms). This makes the method less magical (and solves the documentation issue you noted), and also avoids possible confusing about |
@jorisvandenbossche ok, let me see with a followup for that. |
closes #8158
See the references SO questions in the issue. but allows highly performant construction of datetime series from specified DataFrame columns with a minimal syntax