Skip to content

Fix accidental loss-of-precision for to_datetime(str, unit=...) #57548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 27, 2024

Conversation

QuLogic
Copy link
Contributor

@QuLogic QuLogic commented Feb 21, 2024

In Pandas 1.5.3, the float(val) cast was inline to the cast_from_unit call in array_with_unit_to_datetime. This caused the intermediate (unnamed) value to be a Python float.

Since #50301, a temporary variable was added to avoid multiple casts, but with explicit type cdef float, which defines a Cython float. This type is 32-bit, and causes a loss of precision, and a regression in parsing from 1.5.3.

Since cast_from_unit takes an object, not a more specific Cython type, remove the explicit type from the temporary fval variable entirely. This will cause it to be a (64-bit) Python float, and thus not lose precision.

Fixes #57051

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • [n/a] Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@QuLogic QuLogic requested a review from WillAyd as a code owner February 21, 2024 12:08
@WillAyd
Copy link
Member

WillAyd commented Feb 21, 2024

Can this be declared as double instead? Using a PyObject is unnecessary overhead

@QuLogic
Copy link
Contributor Author

QuLogic commented Feb 21, 2024

As noted, cast_from_unit takes an object, so that doesn't seem to save anything. Here's the diff of the generated code when you add the type (vs having it commented out since that reduces line number changes in the diff):

--- untyped/pandas/_libs/tslib.pyx.c	2024-02-21 18:27:51.686387390 -0500
+++ typed/pandas/_libs/tslib.pyx.c	2024-02-21 18:24:14.824227898 -0500
@@ -25828,9 +25828,9 @@
   int __pyx_v_is_raise;
   PyArrayObject *__pyx_v_iresult = 0;
   PyDateTime_TZInfo *__pyx_v_tz = 0;
+  double __pyx_v_fval;
   PyObject *__pyx_v_result = NULL;
   PyObject *__pyx_v_val = NULL;
-  double __pyx_v_fval;
   PyObject *__pyx_v_err = NULL;
   __Pyx_LocalBuf_ND __pyx_pybuffernd_iresult;
   __Pyx_Buffer __pyx_pybuffer_iresult;
@@ -25919,14 +25919,14 @@
  *         bint is_raise = errors == "raise"
  *         ndarray[int64_t] iresult
  *         tzinfo tz = None             # <<<<<<<<<<<<<<
- *         # double fval
+ *         double fval
  * 
  */
   __Pyx_INCREF(Py_None);
   __pyx_v_tz = ((PyDateTime_TZInfo *)Py_None);
 
   /* "pandas/_libs/tslib.pyx":280
- *         # double fval
+ *         double fval
  * 
  *     assert is_coerce or is_raise             # <<<<<<<<<<<<<<
  * 
@@ -32771,7 +32771,7 @@
  *     ndarray[object] values,
  *     str unit,
  */
-  __pyx_tuple__37 = PyTuple_Pack(13, __pyx_n_s_values, __pyx_n_s_unit, __pyx_n_s_errors, __pyx_n_s_i, __pyx_n_s_n, __pyx_n_s_is_coerce, __pyx_n_s_is_raise, __pyx_n_s_iresult, __pyx_n_s_tz, __pyx_n_s_result, __pyx_n_s_val, __pyx_n_s_fval, __pyx_n_s_err); if (unlikely(!__pyx_tuple__37)) __PYX_ERR(0, 239, __pyx_L1_error)
+  __pyx_tuple__37 = PyTuple_Pack(13, __pyx_n_s_values, __pyx_n_s_unit, __pyx_n_s_errors, __pyx_n_s_i, __pyx_n_s_n, __pyx_n_s_is_coerce, __pyx_n_s_is_raise, __pyx_n_s_iresult, __pyx_n_s_tz, __pyx_n_s_fval, __pyx_n_s_result, __pyx_n_s_val, __pyx_n_s_err); if (unlikely(!__pyx_tuple__37)) __PYX_ERR(0, 239, __pyx_L1_error)
   __Pyx_GOTREF(__pyx_tuple__37);
   __Pyx_GIVEREF(__pyx_tuple__37);
   __pyx_codeobj__38 = (PyObject*)__Pyx_PyCode_New(3, 0, 0, 13, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__37, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_home_elliott_code_pandas_pandas, __pyx_n_s_array_with_unit_to_datetime, 239, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__38)) __PYX_ERR(0, 239, __pyx_L1_error)

@WillAyd
Copy link
Member

WillAyd commented Feb 22, 2024

Fair point on the existing structure of the code. But if it works should still add it - generally the more typing in Cython the better. And if this gets refactored in the future the next developer won't have to think about it

@QuLogic QuLogic force-pushed the datetime-str-precision branch from f56b22f to 6829727 Compare February 22, 2024 00:22
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@WillAyd
Copy link
Member

WillAyd commented Feb 22, 2024

If you see us using float anywhere else probably worth changing to double everywhere in Cython (in a follow up PR)

def test_unit_str(self, cache):
# GH 57051
# Test that strs aren't dropping precision to 32-bit accidentally.
with pytest.warns(FutureWarning):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.warns(FutureWarning):
with tm.assert_produces_warning(FutureWarning):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and rebased.

@mroeschke mroeschke added the Datetime Datetime data dtype label Feb 22, 2024
@mroeschke mroeschke added this to the 2.2.1 milestone Feb 22, 2024
@abhijeetbodas2001
Copy link

Thanks a lot for your investigation and fix for this @QuLogic!

@QuLogic QuLogic force-pushed the datetime-str-precision branch from 6829727 to 0d36edf Compare February 22, 2024 21:32
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 25, 2024
@@ -50,6 +50,7 @@ Fixed regressions
- Fixed regression in :meth:`Series.pct_change` raising a ``ValueError`` for an empty :class:`Series` (:issue:`57056`)
- Fixed regression in :meth:`Series.to_numpy` when dtype is given as float and the data contains NaNs (:issue:`57121`)
- Fixed regression in addition or subtraction of :class:`DateOffset` objects with millisecond components to ``datetime64`` :class:`Index`, :class:`Series`, or :class:`DataFrame` (:issue:`57529`)
- Fixed regression in precision of :func:`to_datetime` with string and ``unit`` input (:issue:`57051`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to v2.2.2.rst?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

In Pandas 1.5.3, the `float(val)` cast was inline to the
`cast_from_unit` call in `array_with_unit_to_datetime`. This caused the
intermediate (unnamed) value to be a Python float.

Since pandas-dev#50301, a temporary variable was added to avoid multiple casts,
but with explicit type `cdef float`, which defines a _Cython_ float.
This type is 32-bit, and causes a loss of precision, and a regression in
parsing from 1.5.3.

So widen the explicit type of the temporary `fval` variable to (64-bit)
`double`, which will not lose precision.

Fixes pandas-dev#57051
@QuLogic QuLogic force-pushed the datetime-str-precision branch from 3b072e7 to 84572a1 Compare March 27, 2024 06:52
@mroeschke mroeschke merged commit a5c003d into pandas-dev:main Mar 27, 2024
46 checks passed
@mroeschke
Copy link
Member

Thanks @QuLogic

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 27, 2024
mroeschke pushed a commit that referenced this pull request Mar 27, 2024
…for to_datetime(str, unit=...)) (#58034)

Backport PR #57548: Fix accidental loss-of-precision for to_datetime(str, unit=...)

Co-authored-by: Elliott Sales de Andrade <[email protected]>
@QuLogic QuLogic deleted the datetime-str-precision branch March 28, 2024 09:28
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…as-dev#57548)

In Pandas 1.5.3, the `float(val)` cast was inline to the
`cast_from_unit` call in `array_with_unit_to_datetime`. This caused the
intermediate (unnamed) value to be a Python float.

Since pandas-dev#50301, a temporary variable was added to avoid multiple casts,
but with explicit type `cdef float`, which defines a _Cython_ float.
This type is 32-bit, and causes a loss of precision, and a regression in
parsing from 1.5.3.

So widen the explicit type of the temporary `fval` variable to (64-bit)
`double`, which will not lose precision.

Fixes pandas-dev#57051
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent to_datetime behaviour
5 participants