Fix accidental loss-of-precision for to_datetime(str, unit=...) #57548

QuLogic · 2024-02-21T12:08:47Z

In Pandas 1.5.3, the float(val) cast was inline to the cast_from_unit call in array_with_unit_to_datetime. This caused the intermediate (unnamed) value to be a Python float.

Since #50301, a temporary variable was added to avoid multiple casts, but with explicit type cdef float, which defines a Cython float. This type is 32-bit, and causes a loss of precision, and a regression in parsing from 1.5.3.

Since cast_from_unit takes an object, not a more specific Cython type, remove the explicit type from the temporary fval variable entirely. This will cause it to be a (64-bit) Python float, and thus not lose precision.

Fixes #57051

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
[n/a] Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

WillAyd · 2024-02-21T12:23:47Z

Can this be declared as double instead? Using a PyObject is unnecessary overhead

QuLogic · 2024-02-21T23:48:19Z

As noted, cast_from_unit takes an object, so that doesn't seem to save anything. Here's the diff of the generated code when you add the type (vs having it commented out since that reduces line number changes in the diff):

--- untyped/pandas/_libs/tslib.pyx.c	2024-02-21 18:27:51.686387390 -0500
+++ typed/pandas/_libs/tslib.pyx.c	2024-02-21 18:24:14.824227898 -0500
@@ -25828,9 +25828,9 @@
   int __pyx_v_is_raise;
   PyArrayObject *__pyx_v_iresult = 0;
   PyDateTime_TZInfo *__pyx_v_tz = 0;
+  double __pyx_v_fval;
   PyObject *__pyx_v_result = NULL;
   PyObject *__pyx_v_val = NULL;
-  double __pyx_v_fval;
   PyObject *__pyx_v_err = NULL;
   __Pyx_LocalBuf_ND __pyx_pybuffernd_iresult;
   __Pyx_Buffer __pyx_pybuffer_iresult;
@@ -25919,14 +25919,14 @@
  *         bint is_raise = errors == "raise"
  *         ndarray[int64_t] iresult
  *         tzinfo tz = None             # <<<<<<<<<<<<<<
- *         # double fval
+ *         double fval
  * 
  */
   __Pyx_INCREF(Py_None);
   __pyx_v_tz = ((PyDateTime_TZInfo *)Py_None);
 
   /* "pandas/_libs/tslib.pyx":280
- *         # double fval
+ *         double fval
  * 
  *     assert is_coerce or is_raise             # <<<<<<<<<<<<<<
  * 
@@ -32771,7 +32771,7 @@
  *     ndarray[object] values,
  *     str unit,
  */
-  __pyx_tuple__37 = PyTuple_Pack(13, __pyx_n_s_values, __pyx_n_s_unit, __pyx_n_s_errors, __pyx_n_s_i, __pyx_n_s_n, __pyx_n_s_is_coerce, __pyx_n_s_is_raise, __pyx_n_s_iresult, __pyx_n_s_tz, __pyx_n_s_result, __pyx_n_s_val, __pyx_n_s_fval, __pyx_n_s_err); if (unlikely(!__pyx_tuple__37)) __PYX_ERR(0, 239, __pyx_L1_error)
+  __pyx_tuple__37 = PyTuple_Pack(13, __pyx_n_s_values, __pyx_n_s_unit, __pyx_n_s_errors, __pyx_n_s_i, __pyx_n_s_n, __pyx_n_s_is_coerce, __pyx_n_s_is_raise, __pyx_n_s_iresult, __pyx_n_s_tz, __pyx_n_s_fval, __pyx_n_s_result, __pyx_n_s_val, __pyx_n_s_err); if (unlikely(!__pyx_tuple__37)) __PYX_ERR(0, 239, __pyx_L1_error)
   __Pyx_GOTREF(__pyx_tuple__37);
   __Pyx_GIVEREF(__pyx_tuple__37);
   __pyx_codeobj__38 = (PyObject*)__Pyx_PyCode_New(3, 0, 0, 13, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__37, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_home_elliott_code_pandas_pandas, __pyx_n_s_array_with_unit_to_datetime, 239, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__38)) __PYX_ERR(0, 239, __pyx_L1_error)

WillAyd · 2024-02-22T00:19:04Z

Fair point on the existing structure of the code. But if it works should still add it - generally the more typing in Cython the better. And if this gets refactored in the future the next developer won't have to think about it

WillAyd

lgtm

WillAyd · 2024-02-22T00:56:51Z

If you see us using float anywhere else probably worth changing to double everywhere in Cython (in a follow up PR)

mroeschke · 2024-02-22T17:12:58Z

pandas/tests/tools/test_to_datetime.py

+    def test_unit_str(self, cache):
+        # GH 57051
+        # Test that strs aren't dropping precision to 32-bit accidentally.
+        with pytest.warns(FutureWarning):


Suggested change

with pytest.warns(FutureWarning):

with tm.assert_produces_warning(FutureWarning):

Done, and rebased.

abhijeetbodas2001 · 2024-02-22T17:16:51Z

Thanks a lot for your investigation and fix for this @QuLogic!

github-actions · 2024-03-25T00:06:11Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

pandas/tests/tools/test_to_datetime.py

mroeschke · 2024-03-25T18:21:26Z

doc/source/whatsnew/v2.2.1.rst

@@ -50,6 +50,7 @@ Fixed regressions
 - Fixed regression in :meth:`Series.pct_change` raising a ``ValueError`` for an empty :class:`Series` (:issue:`57056`)
 - Fixed regression in :meth:`Series.to_numpy` when dtype is given as float and the data contains NaNs (:issue:`57121`)
 - Fixed regression in addition or subtraction of :class:`DateOffset` objects with millisecond components to ``datetime64`` :class:`Index`, :class:`Series`, or :class:`DataFrame` (:issue:`57529`)
+- Fixed regression in precision of :func:`to_datetime` with string and ``unit`` input (:issue:`57051`)


Could you move this to v2.2.2.rst?

In Pandas 1.5.3, the `float(val)` cast was inline to the `cast_from_unit` call in `array_with_unit_to_datetime`. This caused the intermediate (unnamed) value to be a Python float. Since pandas-dev#50301, a temporary variable was added to avoid multiple casts, but with explicit type `cdef float`, which defines a _Cython_ float. This type is 32-bit, and causes a loss of precision, and a regression in parsing from 1.5.3. So widen the explicit type of the temporary `fval` variable to (64-bit) `double`, which will not lose precision. Fixes pandas-dev#57051

mroeschke · 2024-03-27T16:52:01Z

Thanks @QuLogic

…_datetime(str, unit=...)

…for to_datetime(str, unit=...)) (#58034) Backport PR #57548: Fix accidental loss-of-precision for to_datetime(str, unit=...) Co-authored-by: Elliott Sales de Andrade <[email protected]>

…as-dev#57548) In Pandas 1.5.3, the `float(val)` cast was inline to the `cast_from_unit` call in `array_with_unit_to_datetime`. This caused the intermediate (unnamed) value to be a Python float. Since pandas-dev#50301, a temporary variable was added to avoid multiple casts, but with explicit type `cdef float`, which defines a _Cython_ float. This type is 32-bit, and causes a loss of precision, and a regression in parsing from 1.5.3. So widen the explicit type of the temporary `fval` variable to (64-bit) `double`, which will not lose precision. Fixes pandas-dev#57051

QuLogic requested a review from WillAyd as a code owner February 21, 2024 12:08

QuLogic force-pushed the datetime-str-precision branch from f56b22f to 6829727 Compare February 22, 2024 00:22

WillAyd approved these changes Feb 22, 2024

View reviewed changes

mroeschke reviewed Feb 22, 2024

View reviewed changes

mroeschke added the Datetime Datetime data dtype label Feb 22, 2024

mroeschke added this to the 2.2.1 milestone Feb 22, 2024

QuLogic force-pushed the datetime-str-precision branch from 6829727 to 0d36edf Compare February 22, 2024 21:32

lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024

github-actions bot added the Stale label Mar 25, 2024

mroeschke reviewed Mar 25, 2024

View reviewed changes

pandas/tests/tools/test_to_datetime.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 25, 2024

View reviewed changes

QuLogic force-pushed the datetime-str-precision branch from 3b072e7 to 84572a1 Compare March 27, 2024 06:52

mroeschke approved these changes Mar 27, 2024

View reviewed changes

mroeschke merged commit a5c003d into pandas-dev:main Mar 27, 2024
46 checks passed

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 27, 2024

Backport PR pandas-dev#57548: Fix accidental loss-of-precision for to…

9d350db

…_datetime(str, unit=...)

meeseeksmachine mentioned this pull request Mar 27, 2024

Backport PR #57548 on branch 2.2.x (Fix accidental loss-of-precision for to_datetime(str, unit=...)) #58034

Merged

QuLogic deleted the datetime-str-precision branch March 28, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix accidental loss-of-precision for to_datetime(str, unit=...) #57548

Fix accidental loss-of-precision for to_datetime(str, unit=...) #57548

QuLogic commented Feb 21, 2024

WillAyd commented Feb 21, 2024

QuLogic commented Feb 21, 2024 •

edited

Loading

WillAyd commented Feb 22, 2024

WillAyd left a comment

WillAyd commented Feb 22, 2024

mroeschke Feb 22, 2024

QuLogic Feb 22, 2024

abhijeetbodas2001 commented Feb 22, 2024

github-actions bot commented Mar 25, 2024

mroeschke Mar 25, 2024

QuLogic Mar 27, 2024

mroeschke commented Mar 27, 2024

	with pytest.warns(FutureWarning):
	with tm.assert_produces_warning(FutureWarning):

Fix accidental loss-of-precision for to_datetime(str, unit=...) #57548

Fix accidental loss-of-precision for to_datetime(str, unit=...) #57548

Conversation

QuLogic commented Feb 21, 2024

WillAyd commented Feb 21, 2024

QuLogic commented Feb 21, 2024 • edited Loading

WillAyd commented Feb 22, 2024

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd commented Feb 22, 2024

mroeschke Feb 22, 2024

Choose a reason for hiding this comment

QuLogic Feb 22, 2024

Choose a reason for hiding this comment

abhijeetbodas2001 commented Feb 22, 2024

github-actions bot commented Mar 25, 2024

mroeschke Mar 25, 2024

Choose a reason for hiding this comment

QuLogic Mar 27, 2024

Choose a reason for hiding this comment

mroeschke commented Mar 27, 2024

QuLogic commented Feb 21, 2024 •

edited

Loading