Skip to content

PERF: pd.to_datetime, unit='s' much slower for float64 than for int64 #35027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
Sep 19, 2020
Merged
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
bb56553
add values.dtype.kind==f branch to array_with_unit_datetime
arw2019 Jun 27, 2020
de81148
remove unnecessary styling changes
arw2019 Jun 27, 2020
9803670
added cast_from_unit definition for float
arw2019 Jun 27, 2020
a224e19
to_datetime: added astyping for floats
arw2019 Jun 29, 2020
a7bb0d1
revert changes
arw2019 Jun 29, 2020
20162fe
revert changes
arw2019 Jun 29, 2020
a332e37
revert styling change
arw2019 Jun 29, 2020
41f22fa
_libs/tslib.pyx added comments
arw2019 Jun 29, 2020
0617b2a
fixed string quotes
arw2019 Jul 1, 2020
a501aa0
removed xfail tests
arw2019 Jul 8, 2020
9be1567
change _libs/tslib.pyx
arw2019 Jul 8, 2020
1030374
revert merge error
arw2019 Jul 8, 2020
ea932a9
revert merge error
arw2019 Jul 8, 2020
9d47f14
simplified 'if not need_to_iterate' branch
arw2019 Jul 8, 2020
a959535
update whatsnew
arw2019 Jul 8, 2020
efbd6ba
fixed string quotes
arw2019 Jul 8, 2020
859b9a5
removed trailing whitespace
arw2019 Jul 8, 2020
a4606a0
rebase tslib.pyx to master
arw2019 Jul 24, 2020
1597253
clean up + NPY_NAT->np.nan
arw2019 Jul 24, 2020
28397b0
added benchmarks
arw2019 Jul 24, 2020
1888681
revert changes to whatsnew
arw2019 Jul 24, 2020
ba5d3b5
fixes merge conflicts
arw2019 Aug 1, 2020
c6d7746
fixes merge conflicts
arw2019 Aug 1, 2020
eb81beb
Merge remote-tracking branch 'upstream/master' into pd-todatetime-uni…
arw2019 Aug 2, 2020
7f68448
rewrote cast in analogy to precision_from_unit
arw2019 Aug 2, 2020
d9fb88f
use np.isnan for floats
arw2019 Aug 2, 2020
b2119b7
revert to fill in mask in final result
arw2019 Aug 2, 2020
2c39cd3
fix sas tests
arw2019 Aug 2, 2020
64c94fb
rewrite cast, rounding, missing values
arw2019 Aug 3, 2020
dd519da
change json test_date_unit
arw2019 Aug 4, 2020
5e5976d
Merge remote-tracking branch 'upstream/master' into pd-todatetime-uni…
arw2019 Sep 10, 2020
b69df7a
revert changes to tests
arw2019 Sep 10, 2020
d37b45c
more refactoring
arw2019 Sep 10, 2020
05fab52
switch np.float -> np.float_
arw2019 Sep 10, 2020
38a533f
rounding now works
arw2019 Sep 10, 2020
b1d8149
rewrite rounding step in array_with_unit_to_datetime
arw2019 Sep 10, 2020
a6d8d9e
Update pandas/_libs/tslib.pyx
arw2019 Sep 11, 2020
e2e600b
Update pandas/_libs/tslib.pyx
arw2019 Sep 11, 2020
c7a3b08
fix typo
arw2019 Sep 11, 2020
c0c31ca
silence numpy-dev warning
arw2019 Sep 11, 2020
111abb7
Merge remote-tracking branch 'upstream/master' into pd-todatetime-uni…
arw2019 Sep 12, 2020
59290a0
feedback
arw2019 Sep 14, 2020
611dad0
fix handling of iNaT with astype(float)
arw2019 Sep 14, 2020
63fa94b
fix floating point errors in sas datetime test
arw2019 Sep 14, 2020
76cd0eb
round floating point error manually in test
arw2019 Sep 15, 2020
46f25a4
Merge remote-tracking branch 'upstream/master' into pd-todatetime-uni…
arw2019 Sep 15, 2020
b308ba7
add note in whatsnew 1.2
arw2019 Sep 15, 2020
1aa7bb2
remove trailing whitespaces
arw2019 Sep 15, 2020
a3f42df
fix typo in added benchmark
arw2019 Sep 15, 2020
8837ff4
flake8 asv_bench
arw2019 Sep 15, 2020
6f9caeb
Merge remote-tracking branch 'upstream/master' into pd-todatetime-uni…
arw2019 Sep 15, 2020
1ff89d4
reorder imports
arw2019 Sep 15, 2020
8084caf
Merge remote-tracking branch 'upstream/master' into pd-todatetime-uni…
arw2019 Sep 18, 2020
c238cec
styling fixes
arw2019 Sep 18, 2020
416035b
restore/add comments re: floating point errors
arw2019 Sep 18, 2020
47c2b5f
rewrote added benchmark
arw2019 Sep 18, 2020
f216a43
typo
arw2019 Sep 18, 2020
5f76f48
merge with upstream
arw2019 Sep 19, 2020
bb8c35b
feedback
arw2019 Sep 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions asv_bench/benchmarks/timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,29 @@ def time_lookup_and_cleanup(self):
self.ts.index._cleanup()


class ToDatetimeFromIntsFloats:
def setup(self):
self.ts_sec = Series(range(1521080307, 1521685107), dtype="int64")
self.ts_sec_float = self.ts_sec.astype("float64")

self.ts_nanosec = 1_000_000 * self.ts_sec
self.ts_nanosec_float = self.ts_nanosec.astype("float64")

# speed of int64 and float64 paths should be comparable

def time_nanosec_int64(self):
to_datetime(self.ts_nanosec, unit="ns")

def time_nanosec_float64(self):
to_datetime(self.ts_nanosec_float, unit="ns")

def time_sec_int64(self):
to_datetime(self.ts_sec, unit="s")

def time_sec_float64(self):
to_datetime(self.ts_sec_float, unit="s")


class ToDatetimeYYYYMMDD:
def setup(self):
rng = date_range(start="1/1/2000", periods=10000, freq="D")
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,7 @@ Performance improvements
- Performance improvement in :meth:`GroupBy.agg` with the ``numba`` engine (:issue:`35759`)
- Performance improvements when creating :meth:`pd.Series.map` from a huge dictionary (:issue:`34717`)
- Performance improvement in :meth:`GroupBy.transform` with the ``numba`` engine (:issue:`36240`)
- Performance improvement in :meth:`pd.to_datetime` with non-`ns` time unit for `float` `dtype` columns (:issue:`20445`)

.. ---------------------------------------------------------------------------

Expand Down
46 changes: 29 additions & 17 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ from pandas._libs.tslibs.conversion cimport (
cast_from_unit,
convert_datetime_to_tsobject,
get_datetime64_nanos,
precision_from_unit,
)
from pandas._libs.tslibs.nattype cimport (
NPY_NAT,
Expand Down Expand Up @@ -205,6 +206,7 @@ def array_with_unit_to_datetime(
cdef:
Py_ssize_t i, j, n=len(values)
int64_t m
int prec = 0
ndarray[float64_t] fvalues
bint is_ignore = errors=='ignore'
bint is_coerce = errors=='coerce'
Expand All @@ -217,38 +219,48 @@ def array_with_unit_to_datetime(

assert is_ignore or is_coerce or is_raise

if unit == 'ns':
if issubclass(values.dtype.type, np.integer):
result = values.astype('M8[ns]')
if unit == "ns":
if issubclass(values.dtype.type, (np.integer, np.float_)):
result = values.astype("M8[ns]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does astype of floats directly to M8 work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it does (as you use it below), but do we have a test specifically for float with unit='ns'?

also can try .astype(..., copy=False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does astype of floats directly to M8 work?

would it be better to do, here and below:

ivalues = values.view("i8")
result = ivalues.astype("M8[ns]")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it does (as you use it below), but do we have a test specifically for float with unit='ns'?

I'll look some more but I think we don't. Will add unless I find one

else:
result, tz = array_to_datetime(values.astype(object), errors=errors)
return result, tz

m = cast_from_unit(None, unit)
m, p = precision_from_unit(unit)

if is_raise:

# try a quick conversion to i8
# try a quick conversion to i8/f8
# if we have nulls that are not type-compat
# then need to iterate
if values.dtype.kind == "i":
# Note: this condition makes the casting="same_kind" redundant
iresult = values.astype('i8', casting='same_kind', copy=False)
# fill by comparing to NPY_NAT constant

if values.dtype.kind == "i" or values.dtype.kind == "f":
iresult = values.astype("i8", copy=False)
# fill missing values by comparing to NPY_NAT
mask = iresult == NPY_NAT
iresult[mask] = 0
fvalues = iresult.astype('f8') * m
fvalues = iresult.astype("f8") * m
need_to_iterate = False

# check the bounds
if not need_to_iterate:

if ((fvalues < Timestamp.min.value).any()
or (fvalues > Timestamp.max.value).any()):
# check the bounds
if (fvalues < Timestamp.min.value).any() or (
(fvalues > Timestamp.max.value).any()
):
raise OutOfBoundsDatetime(f"cannot convert input with unit '{unit}'")
result = (iresult * m).astype('M8[ns]')
iresult = result.view('i8')

if values.dtype.kind == "i":
result = (iresult * m).astype("M8[ns]")

if values.dtype.kind == "f":
fresult = (values * m).astype("f8")
fresult[mask] = 0
if prec:
fresult = round(fresult, prec)
result = fresult.astype("M8[ns]", copy=False)

iresult = result.view("i8")
iresult[mask] = NPY_NAT

return result, tz

result = np.empty(n, dtype='M8[ns]')
Expand Down
1 change: 1 addition & 0 deletions pandas/_libs/tslibs/conversion.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,6 @@ cdef int64_t get_datetime64_nanos(object val) except? -1

cpdef datetime localize_pydatetime(datetime dt, object tz)
cdef int64_t cast_from_unit(object ts, str unit) except? -1
cpdef (int64_t, int) precision_from_unit(str unit)

cdef int64_t normalize_i8_stamp(int64_t local_val) nogil
4 changes: 2 additions & 2 deletions pandas/tests/io/sas/data/datetime.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Date1,Date2,DateTime,DateTimeHi,Taiw
1677-09-22,1677-09-22,1677-09-21 00:12:44,1677-09-21 00:12:43.145226,1912-01-01
1677-09-22,1677-09-22,1677-09-21 00:12:44,1677-09-21 00:12:43.145225,1912-01-01
1960-01-01,1960-01-01,1960-01-01 00:00:00,1960-01-01 00:00:00.000000,1960-01-01
2016-02-29,2016-02-29,2016-02-29 23:59:59,2016-02-29 23:59:59.123456,2016-02-29
2262-04-11,2262-04-11,2262-04-11 23:47:16,2262-04-11 23:47:16.854774,2262-04-11
2262-04-11,2262-04-11,2262-04-11 23:47:16,2262-04-11 23:47:16.854775,2262-04-11
8 changes: 5 additions & 3 deletions pandas/tests/tools/test_to_datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -1217,10 +1217,10 @@ def test_unit_mixed(self, cache):

@pytest.mark.parametrize("cache", [True, False])
def test_unit_rounding(self, cache):
# GH 14156: argument will incur floating point errors but no
# premature rounding
# GH 14156 & GH 20445: argument will incur floating point errors
# but no premature rounding
result = pd.to_datetime(1434743731.8770001, unit="s", cache=cache)
expected = pd.Timestamp("2015-06-19 19:55:31.877000093")
expected = pd.Timestamp("2015-06-19 19:55:31.877000192")
assert result == expected

@pytest.mark.parametrize("cache", [True, False])
Expand Down Expand Up @@ -1454,6 +1454,8 @@ def test_to_datetime_unit(self):
]
+ [NaT]
)
# GH20455 argument will incur floating point errors but no premature rounding
result = result.round("ms")
tm.assert_series_equal(result, expected)

s = pd.concat(
Expand Down