Skip to content

Fix/empty string datetimelike conversion/issue 36550 #36834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nrebena
Copy link
Contributor

@nrebena nrebena commented Oct 3, 2020

Empty string conversion

The actual comportment of Timedelta, Timestamp and to_datetime is to return NaT.

import pandas as pd
pd.to_datetime("", errors="raise")
# NaT
pd.to_datetime("", errors="coerce")
# NaT
pd.to_datetime("", errors="ignore")
# NaT
pd.Timedelta("")
# NaT
pd.Timestamp("")
# NaT

I would propose to raise a ValueError instead, same as for string that are not datetimelike, as follow:

import pandas as pd
import pytest

with pytest.raises(ValueError):
   pd.to_datetime("", errors="raise")
pd.to_datetime("", errors="coerce")
# NaT
pd.to_datetime("", errors="ignore")
# ''

with pytest.raises(ValueError):
   pd.Timedelta("")
with pytest.raises(ValueError):
   pd.Timestamp("")

Tests

I added the relevant test for the expected comportment.
I still will have to fix some test that rely on the current behaviors.

Benchmark

I ran the following command
asv continuous -E virtualenv -f 1.1 upstream/master HEAD -b "^tslibs.(timestamp|timedelta|tslib)"

       before           after         ratio
     [8f6ec1e8]       [4393cc9e]
     <master>         <fix/empty-string-datetimelike-conversion/issue-36550>
+         173±1ns         202±10ns     1.17  tslibs.timestamp.TimestampProperties.time_is_leap_year(tzlocal(), 'B')
+         162±2ns          188±4ns     1.16  tslibs.timestamp.TimestampProperties.time_is_quarter_start(tzlocal(), None)
+         178±1ns         205±20ns     1.15  tslibs.timestamp.TimestampProperties.time_dayofweek(tzfile('/usr/share/zoneinfo/US/Central'), 'B')
+         163±2ns         186±10ns     1.15  tslibs.timestamp.TimestampProperties.time_is_year_end(tzlocal(), None)
+         162±1ns         185±10ns     1.15  tslibs.timestamp.TimestampProperties.time_is_year_end(tzfile('/usr/share/zoneinfo/US/Central'), None)
+         162±2ns         185±10ns     1.14  tslibs.timestamp.TimestampProperties.time_is_year_end(<UTC>, None)
+         169±2ns         193±10ns     1.14  tslibs.timestamp.TimestampProperties.time_is_quarter_end(tzlocal(), None)
+         164±2ns          186±4ns     1.14  tslibs.timestamp.TimestampProperties.time_is_quarter_start(tzutc(), None)
+     4.60±0.03μs       5.16±0.3μs     1.12  tslibs.timestamp.TimestampProperties.time_is_year_end(datetime.timezone(datetime.timedelta(seconds=3600)), 'B')
+         147±1ns          163±4ns     1.11  tslibs.timestamp.TimestampProperties.time_microsecond(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')
+         170±2ns          188±7ns     1.11  tslibs.timestamp.TimestampProperties.time_is_quarter_end(tzfile('/usr/share/zoneinfo/US/Central'), None)
+       179±0.7ns         197±10ns     1.10  tslibs.timestamp.TimestampProperties.time_dayofweek(datetime.timezone(datetime.timedelta(seconds=3600)), None)
+       174±0.3ns         191±10ns     1.10  tslibs.timestamp.TimestampProperties.time_is_leap_year(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, 'B')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jreback
Copy link
Contributor

jreback commented Oct 3, 2020

why this change? likely to break test parsing generally and would have to see a lot of benefit

@nrebena
Copy link
Contributor Author

nrebena commented Oct 3, 2020

Considering empty string equivalent to NaT as a huge influence on all method that try to cast data into datetimelike (happen in groupby, merge, join, equal, probably others methods)
Globaly, I think we should refrain from implicit conversion of string to datetimelike, and this PR goes one step in this direction.
An other issue related to mixing string and datetimelike can be found here #36545

For information the broken test were the following:

pandas/tests/io/parser/test_parse_dates.py::test_parse_dates_empty_string[c_high]
pandas/tests/io/parser/test_parse_dates.py::test_parse_dates_empty_string[c_low]
pandas/tests/io/parser/test_parse_dates.py::test_parse_dates_empty_string[python]
pandas/tests/plotting/test_datetimelike.py::TestTSPlot::test_ts_plot_with_tz['UTC']
pandas/tests/tools/test_to_datetime.py::TestToDatetimeMisc::test_to_datetime_types[True]
pandas/tests/tools/test_to_datetime.py::TestToDatetimeMisc::test_to_datetime_types[False]
pandas/tests/tools/test_to_timedelta.py::TestTimedeltas::test_to_timedelta

If empty string should be always considered as NaT, then I'll close this PR.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this affect read_csv? do we have tests for this?

this needs a fairly big section i the whatsnew as this can break existing code

@@ -104,8 +104,13 @@ def test_identity(klass, value):
@pytest.mark.parametrize("klass", [Timestamp, Timedelta, Period])
@pytest.mark.parametrize("value", ["", "nat", "NAT", None, np.nan])
def test_equality(klass, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would instead remove the empty and put this in a new test for that case (and assert the error)

these are duplicating the constructor tests so should have either / or (i think they are fine here)

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Oct 16, 2020
@jreback
Copy link
Contributor

jreback commented Oct 16, 2020

does this actually close the referenced issue?

@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

can you merge master

@arw2019
Copy link
Member

arw2019 commented Nov 27, 2020

Closing for now. @nrebena let us know if you'd like to start this up again and we'll reopen

@arw2019 arw2019 closed this Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants