Skip to content

API: to_datetime with ordinals and no unit #15836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chris-b1 opened this issue Mar 29, 2017 · 8 comments
Open

API: to_datetime with ordinals and no unit #15836

chris-b1 opened this issue Mar 29, 2017 · 8 comments
Labels
Datetime Datetime data dtype Enhancement Error Reporting Incorrect or improved errors from pandas Needs Discussion Requires discussion from core team before further action

Comments

@chris-b1
Copy link
Contributor

xref #15828

Currently, if ordinals are passed to to_datetime without a unit, a ns resolution is assumed. Given relative rarity of ns ordinals in the wild, I've been tripped up by this, e.g., below. I wonder if it would be better to raise in the case of no unit, and force ordinal parsing to always be explicit?

ordinals = pd.Series([1483228800000, 1483315200000, 1483401600000])

pd.to_datetime(ordinals)  # wrong
Out[49]: 
0   1970-01-01 00:24:43.228800
1   1970-01-01 00:24:43.315200
2   1970-01-01 00:24:43.401600
dtype: datetime64[ns]

pd.to_datetime(ordinals, unit='ms')
Out[51]: 
0   2017-01-01
1   2017-01-02
2   2017-01-03
dtype: datetime64[ns]
@chris-b1 chris-b1 added API Design Needs Discussion Requires discussion from core team before further action Datetime Datetime data dtype labels Mar 29, 2017
@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

I do almost exactly that (but only if origin is specfied) in #15828

unit=None is actually the default (though the docs say ns). would need something like this: f5361ec#L464

so +1 from me to require a unit if the arg is numeric

@jreback jreback added this to the 0.20.0 milestone Mar 29, 2017
@atbd
Copy link
Contributor

atbd commented Apr 4, 2017

Hi, i'm working on it and have few questions.
Do I have to take long numbers into account (and so, python2/3 compatibility) ? I think there are too big to be supported.
Plus, with my added tests I found a bug

In [3]: pd.to_datetime(True, unit='ms')
Out[3]: Timestamp('1970-01-01 00:00:00.001000')

In [4]: pd.to_datetime(False, unit='ms')
Out[4]: Timestamp('1970-01-01 00:00:00')

Does somebody confirms it ?

@chris-b1
Copy link
Contributor Author

chris-b1 commented Apr 4, 2017

Numbers than don't fit into int64 space should already raise, I wouldn't expect this change to hit that, though maybe you've found a corner case.

pd.to_datetime([np.iinfo('i8').max + 1])
OverflowError: int too big to convert

And I agree that's a bug - the no unit case seem to be caught ok.

pd.to_datetime(True)
TypeError: <class 'bool'> is not convertible to datetime

@atbd
Copy link
Contributor

atbd commented Apr 4, 2017

Okay, thanks.

The bug comes from array_with_unit_to_datetime which calls array_to_datetime with unit=None | ns but not in the other unit cases.
For now, I check if the arg type is either bool (must be done because bool inherits from int) or numerical and raise the right error. But this makes conflicts with errors='coerce' for example.

My actual check:

if arg is None:                                                                       
    return None                                                                       
elif isinstance(arg, (bool,)):                                                        
    raise TypeError("{0} is not convertible to datetime"
                    .format(type(arg)))                                               
elif isinstance(arg, (int, float)) and unit is None:                                  
    raise ValueError("a unit is required in case of numerical arg")

Do I keep this behaviour or skip the bool case and raise the same numerical error ?

@chris-b1
Copy link
Contributor Author

chris-b1 commented Apr 4, 2017

I'd suggest opening a PR with your changes, even if work-in-progress, that will be easier to review.

I'm not sure I completely understand your question, but with errors='coerce', I suppose the right answer is to return an array full of NaTs , although it admittedly looks strange.

In [98]: pd.to_datetime([1, 2], errors='coerce') 
Out[98]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)



@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

pd.to_datetime([1, 2], errors='coerce')

this would raise, because you haven't specified a unit and you have a numeric entry (which requires a unit).

The check for unit needs to done before any conversions. Its kind of like the origin= check (which also requires a unit to be specified and a numeric type (int/float)).

atbd pushed a commit to atbd/pandas that referenced this issue Apr 4, 2017
* add test_to_datetime_numerical_input
* check arg for numerical type
atbd pushed a commit to atbd/pandas that referenced this issue Apr 4, 2017
* add test_to_datetime_numerical_input
* check arg for numerical type
atbd pushed a commit to atbd/pandas that referenced this issue Apr 5, 2017
* add test_to_datetime_numerical_input
* check arg for numerical type
atbd added a commit to atbd/pandas that referenced this issue Apr 6, 2017
* Tests
  * add test_to_datetime_numerical_input
  * add fixtures
  * update bool tests
* Check argument
  * for numerical type
  * DataFrame with unit
@atbd
Copy link
Contributor

atbd commented Apr 6, 2017

I made the changes to take into account all cases and updated the tests.
Nevertheless, I still have issues with boolean type when mixed with other types (numpy.object dtype) because the errors raising is non-consistent between ns and the other units.
I thought about making something like this:

def check_bool(arg):
    arg = np.asarray(arg)
    return ((arg == True) | (arg == False)).sum()

In case of > 0 raising a ValueError but it seems a little overkill to me

@jreback jreback modified the milestones: Next Major Release, 0.20.0 Apr 12, 2017
atbd added a commit to atbd/pandas that referenced this issue Apr 25, 2017
* Tests
  * add test_to_datetime_numerical_input
  * add fixtures
  * update bool tests
* Check argument
  * for numerical type
  * DataFrame with unit
atbd added a commit to atbd/pandas that referenced this issue Apr 25, 2017
* Tests
  * add test_to_datetime_numerical_input
  * add fixtures
  * update bool tests
* Check argument
  * for numerical type
  * DataFrame with unit
atbd added a commit to atbd/pandas that referenced this issue Apr 26, 2017
* Tests
  * add test_to_datetime_numerical_input
  * add fixtures
  * update bool tests
* Check argument
  * for numerical type
  * DataFrame with unit
atbd added a commit to atbd/pandas that referenced this issue May 15, 2017
* Tests
  * add test_to_datetime_numerical_input
  * add fixtures
  * update bool tests
* Check argument
  * for numerical type
  * DataFrame with unit
@jorisvandenbossche
Copy link
Member

Related issue: #14350

atbd added a commit to atbd/pandas that referenced this issue Jul 11, 2017
All numerical types now need a unit in to_datetime call.

* Update tests
* Add bool verification in tslib.pyx
* Bug correction with bool and unit in ['ms', 'us', 's', 'D']
  * to_datetime(True, unit='ms') did run without error
  * to_datetime(True, unit='ms', errors='ignore') returned None

Issue: pandas-dev#15836
atbd added a commit to atbd/pandas that referenced this issue Jul 12, 2017
All numerical types now need a unit in to_datetime call.

* Update tests
* Add bool verification in tslib.pyx
* Bug correction with bool and unit in ['ms', 'us', 's', 'D']
  * to_datetime(True, unit='ms') did run without error
  * to_datetime(True, unit='ms', errors='ignore') returned None

Issue: pandas-dev#15836
atbd added a commit to atbd/pandas that referenced this issue Jul 12, 2017
All numerical types now need a unit in to_datetime call.

* Update tests
* Add bool verification in tslib.pyx
* Bug correction with bool and unit in ['ms', 'us', 's', 'D']
  * to_datetime(True, unit='ms') did run without error
  * to_datetime(True, unit='ms', errors='ignore') returned None

Issue: pandas-dev#15836
atbd added a commit to atbd/pandas that referenced this issue Aug 30, 2017
All numerical types now need a unit in to_datetime call.

* Update tests
* Add bool verification in tslib.pyx
* Bug correction with bool and unit in ['ms', 'us', 's', 'D']
  * to_datetime(True, unit='ms') did run without error
  * to_datetime(True, unit='ms', errors='ignore') returned None

Issue: pandas-dev#15836
@mroeschke mroeschke added the Bug label Apr 1, 2020
@mroeschke mroeschke added Enhancement Error Reporting Incorrect or improved errors from pandas and removed API Design Bug good first issue labels May 8, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Error Reporting Incorrect or improved errors from pandas Needs Discussion Requires discussion from core team before further action
Projects
None yet
7 participants