ENH? to_datetime: ability to infer unit for numeric values #17860

cailiang9 · 2017-10-13T06:56:16Z

I implement a simple function for reference:

def infer_as_datetime(dtarr):
    for unit in ['ns', 'us', 'ms', 's']: # for 'YYYY-MM-dd hh:mm:ss.msec' unit='ns' will succeed.
        dt = pd.to_datetime(dtarr[-3:], unit=unit) # tail values have higher chance with later time not in 1970
        if (dt.year > 1970).any(): return pd.to_datetime(dtarr, unit=unit)
    return pd.to_datetime(dtarr, unit='D')

print(infer_as_datetime([0,365,366]),infer_as_datetime(['1971-12-30 12:34:01.314','2015-01-30 22',]),infer_as_datetime([1442315569.315]),infer_as_datetime([1442315569000]),infer_as_datetime([1442315569000000]),infer_as_datetime([1442315569000000000]))

Expected Output

Output of `infer_as_datetime(dtarr)`

DatetimeIndex(['1970-01-01', '1971-01-01', '1971-01-02'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['1971-12-30 12:34:01.314000', '2015-01-30 22:00:00'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49.315000'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49'], dtype='datetime64[ns]', freq=None) DatetimeIndex(['2015-09-15 11:12:49'], dtype='datetime64[ns]', freq=None)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-10-13T09:04:34Z

@cailiang9 Can you be a bit more explanation what this issue is about?

cailiang9 · 2017-10-16T04:58:01Z

the pd.to_datetime(infer=True) does not work smartly. e.g.: it can not automatically infer the unit, so we have to try with unit='s' for 1442315569.315 or unit='ms' for 1442315569315 .
In the function infer_as_datetime I wrote, it does not need to specify any thing like unit. In most cases, returned dt.year=1970 means the unit is incorrect. infer_as_datetime uses this cue. Because we have a semantic infer=True here, we can guess unit for convenience here. Suprizingly, infer_as_datetime also supports many kinds of format such as 'YYYY-MM-dd hh:mm:ss'.
I hope that pd.to_datetime(infer=True) can support function like in infer_as_datetime. So that, when a array of datetime string or unix timestamp comes, we can simply use pd.to_datetime(infer=True) to transform into datetime without specifying anything.
On the other hand, if a user feel this way is unsafe, they just need to remove infer and specify more concrete parameters.

jorisvandenbossche · 2017-10-16T06:55:30Z

I suppose you refer to the infer_datetime_format keyword? (there is no infer keyword) This keyword is meant to infer the format of strings:

infer_datetime_format : boolean, default False
If True and no format is given, attempt to infer the format of the
datetime strings, and if it can be inferred, switch to a faster
method of parsing them. In some cases this can increase the parsing
speed by ~5-10x.

And it deliberately does not handle inferring the unit for numeric values.
Doing that is much less 'clear', as a numeric value can be valid for different units (which is most of the time not the case for string formats):

In [20]: pd.to_datetime(1442315569.315)
Out[20]: Timestamp('1970-01-01 00:00:01.442315569')

In [21]: pd.to_datetime(1442315569.315, unit='ms')
Out[21]: Timestamp('1970-01-17 16:38:35.569315')

In [22]: pd.to_datetime(1442315569.315, unit='s')
Out[22]: Timestamp('2015-09-15 11:12:49.315000')

The above results are all valid.
I agree that that it is likely that the one that has not year 1970 is in many cases the probable one (as you noted), but I am not sure we can use that logic in the to_datetime function. Personally I think it is better to ask to explicitly specify the unit.

jreback · 2017-10-16T10:23:04Z

you can in general only do this w/o fail when the input overflows, so you can start with a lower precision unit (s) and proceed to a higher one (ns); we do something like this in read_json. But as a general rule would be -1.

cailiang9 · 2017-10-18T11:56:06Z

@jorisvandenbossche infer_datetime_format keyword is just an addition. Or we can print out the selected unit and a warning to tell a user to specify unit explicitly in the function .
Users can always explicitly specify the unit if they want.
In practical use when seeing numeric values, people also have no idea which unit to use and have to try one by one. So infer_datetime_format keyword would be very helpful.

shuaitang5 · 2018-09-09T16:16:29Z

I know this is an old ticket, but since it's still open, I would like to add something here.

I recently ran into a problem that the dataframe contains a column of epoch timestamp strings. While most of them are 13-digits integers (1536507914000, unit in ms), but there are a few cases that are 16-digits integers (1536507914123000, unit in us). In other words, with me explicitly passing unit as ms, the rows that have 16-digits epoch timestamps will have a NaT timestamp, which is not what I wanted later on.

>>> df = pd.DataFrame({
...     'ts': [1536507914000, 1536507915000, 1536507916000, 1536507917123000, 1536507918000],
...     'val': [1.2, 2.01, 2.0, 3.9, 1.9]})
>>> df
                 ts   val
0     1536507914000  1.20
1     1536507915000  2.01
2     1536507916000  2.00
3  1536507917123000  3.90
4     1536507918000  1.90
>>> pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
0   1970-01-01 00:25:36.507914
1   1970-01-01 00:25:36.507915
2   1970-01-01 00:25:36.507916
3   1970-01-18 18:48:27.917123
4   1970-01-01 00:25:36.507918
Name: ts, dtype: datetime64[ns]
>>> pd.to_datetime(df['ts'], unit='ms', errors='coerce')
0   2018-09-09 15:45:14
1   2018-09-09 15:45:15
2   2018-09-09 15:45:16
3                   NaT
4   2018-09-09 15:45:18
Name: ts, dtype: datetime64[ns]
>>> pd.to_datetime(df['ts'], unit='us', errors='coerce')
0   1970-01-18 18:48:27.914
1   1970-01-18 18:48:27.915
2   1970-01-18 18:48:27.916
3   2018-09-09 15:45:17.123
4   1970-01-18 18:48:27.918
Name: ts, dtype: datetime64[ns]

The data I'm working with has many millions of records, so it's hard for me to spot a few hundred records with 16 digits epoch timestamps, until today I accidentally found the bug.

So I agree with @cailiang9 that auto inferring unit of timestamps would be very helpful, especially for epoch timestamps. @jorisvandenbossche Could we add some sort of length check when parsing epoch timestamps, so that the pd._todatetime will automatically infer the unit based on length?

jbrockmendel · 2023-02-12T20:34:53Z

xref #15836 looks very similar.

jbrockmendel · 2023-05-25T15:04:20Z

Discussed in yesterday's dev call, there was consensus that we don't want to be guessing here. Closing.

jorisvandenbossche added the Needs Info Clarification about behavior needed to assess issue label Oct 13, 2017

jorisvandenbossche changed the title ~~improve infer function in to_datetime function~~ ENH? to_datetime: ability to infer unit for numeric values Oct 16, 2017

jorisvandenbossche added Datetime Datetime data dtype Needs Discussion Requires discussion from core team before further action and removed Needs Info Clarification about behavior needed to assess issue labels Oct 16, 2017

mroeschke added the Enhancement label Apr 1, 2020

jbrockmendel closed this as completed May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH? to_datetime: ability to infer unit for numeric values #17860

ENH? to_datetime: ability to infer unit for numeric values #17860

cailiang9 commented Oct 13, 2017

jorisvandenbossche commented Oct 13, 2017

cailiang9 commented Oct 16, 2017

jorisvandenbossche commented Oct 16, 2017

jreback commented Oct 16, 2017

cailiang9 commented Oct 18, 2017

shuaitang5 commented Sep 9, 2018 •

edited

Loading

jbrockmendel commented Feb 12, 2023

jbrockmendel commented May 25, 2023

ENH? to_datetime: ability to infer unit for numeric values #17860

ENH? to_datetime: ability to infer unit for numeric values #17860

Comments

cailiang9 commented Oct 13, 2017

Expected Output

Output of infer_as_datetime(dtarr)

jorisvandenbossche commented Oct 13, 2017

cailiang9 commented Oct 16, 2017

jorisvandenbossche commented Oct 16, 2017

jreback commented Oct 16, 2017

cailiang9 commented Oct 18, 2017

shuaitang5 commented Sep 9, 2018 • edited Loading

jbrockmendel commented Feb 12, 2023

jbrockmendel commented May 25, 2023

Output of `infer_as_datetime(dtarr)`

shuaitang5 commented Sep 9, 2018 •

edited

Loading