-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix accidental loss-of-precision for to_datetime(str, unit=...) #57548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can this be declared as double instead? Using a PyObject is unnecessary overhead |
As noted, --- untyped/pandas/_libs/tslib.pyx.c 2024-02-21 18:27:51.686387390 -0500
+++ typed/pandas/_libs/tslib.pyx.c 2024-02-21 18:24:14.824227898 -0500
@@ -25828,9 +25828,9 @@
int __pyx_v_is_raise;
PyArrayObject *__pyx_v_iresult = 0;
PyDateTime_TZInfo *__pyx_v_tz = 0;
+ double __pyx_v_fval;
PyObject *__pyx_v_result = NULL;
PyObject *__pyx_v_val = NULL;
- double __pyx_v_fval;
PyObject *__pyx_v_err = NULL;
__Pyx_LocalBuf_ND __pyx_pybuffernd_iresult;
__Pyx_Buffer __pyx_pybuffer_iresult;
@@ -25919,14 +25919,14 @@
* bint is_raise = errors == "raise"
* ndarray[int64_t] iresult
* tzinfo tz = None # <<<<<<<<<<<<<<
- * # double fval
+ * double fval
*
*/
__Pyx_INCREF(Py_None);
__pyx_v_tz = ((PyDateTime_TZInfo *)Py_None);
/* "pandas/_libs/tslib.pyx":280
- * # double fval
+ * double fval
*
* assert is_coerce or is_raise # <<<<<<<<<<<<<<
*
@@ -32771,7 +32771,7 @@
* ndarray[object] values,
* str unit,
*/
- __pyx_tuple__37 = PyTuple_Pack(13, __pyx_n_s_values, __pyx_n_s_unit, __pyx_n_s_errors, __pyx_n_s_i, __pyx_n_s_n, __pyx_n_s_is_coerce, __pyx_n_s_is_raise, __pyx_n_s_iresult, __pyx_n_s_tz, __pyx_n_s_result, __pyx_n_s_val, __pyx_n_s_fval, __pyx_n_s_err); if (unlikely(!__pyx_tuple__37)) __PYX_ERR(0, 239, __pyx_L1_error)
+ __pyx_tuple__37 = PyTuple_Pack(13, __pyx_n_s_values, __pyx_n_s_unit, __pyx_n_s_errors, __pyx_n_s_i, __pyx_n_s_n, __pyx_n_s_is_coerce, __pyx_n_s_is_raise, __pyx_n_s_iresult, __pyx_n_s_tz, __pyx_n_s_fval, __pyx_n_s_result, __pyx_n_s_val, __pyx_n_s_err); if (unlikely(!__pyx_tuple__37)) __PYX_ERR(0, 239, __pyx_L1_error)
__Pyx_GOTREF(__pyx_tuple__37);
__Pyx_GIVEREF(__pyx_tuple__37);
__pyx_codeobj__38 = (PyObject*)__Pyx_PyCode_New(3, 0, 0, 13, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__37, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_home_elliott_code_pandas_pandas, __pyx_n_s_array_with_unit_to_datetime, 239, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__38)) __PYX_ERR(0, 239, __pyx_L1_error) |
Fair point on the existing structure of the code. But if it works should still add it - generally the more typing in Cython the better. And if this gets refactored in the future the next developer won't have to think about it |
f56b22f
to
6829727
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
If you see us using float anywhere else probably worth changing to double everywhere in Cython (in a follow up PR) |
def test_unit_str(self, cache): | ||
# GH 57051 | ||
# Test that strs aren't dropping precision to 32-bit accidentally. | ||
with pytest.warns(FutureWarning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.warns(FutureWarning): | |
with tm.assert_produces_warning(FutureWarning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and rebased.
Thanks a lot for your investigation and fix for this @QuLogic! |
6829727
to
0d36edf
Compare
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
doc/source/whatsnew/v2.2.1.rst
Outdated
@@ -50,6 +50,7 @@ Fixed regressions | |||
- Fixed regression in :meth:`Series.pct_change` raising a ``ValueError`` for an empty :class:`Series` (:issue:`57056`) | |||
- Fixed regression in :meth:`Series.to_numpy` when dtype is given as float and the data contains NaNs (:issue:`57121`) | |||
- Fixed regression in addition or subtraction of :class:`DateOffset` objects with millisecond components to ``datetime64`` :class:`Index`, :class:`Series`, or :class:`DataFrame` (:issue:`57529`) | |||
- Fixed regression in precision of :func:`to_datetime` with string and ``unit`` input (:issue:`57051`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to v2.2.2.rst
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
In Pandas 1.5.3, the `float(val)` cast was inline to the `cast_from_unit` call in `array_with_unit_to_datetime`. This caused the intermediate (unnamed) value to be a Python float. Since pandas-dev#50301, a temporary variable was added to avoid multiple casts, but with explicit type `cdef float`, which defines a _Cython_ float. This type is 32-bit, and causes a loss of precision, and a regression in parsing from 1.5.3. So widen the explicit type of the temporary `fval` variable to (64-bit) `double`, which will not lose precision. Fixes pandas-dev#57051
3b072e7
to
84572a1
Compare
Thanks @QuLogic |
…_datetime(str, unit=...)
…for to_datetime(str, unit=...)) (#58034) Backport PR #57548: Fix accidental loss-of-precision for to_datetime(str, unit=...) Co-authored-by: Elliott Sales de Andrade <[email protected]>
…as-dev#57548) In Pandas 1.5.3, the `float(val)` cast was inline to the `cast_from_unit` call in `array_with_unit_to_datetime`. This caused the intermediate (unnamed) value to be a Python float. Since pandas-dev#50301, a temporary variable was added to avoid multiple casts, but with explicit type `cdef float`, which defines a _Cython_ float. This type is 32-bit, and causes a loss of precision, and a regression in parsing from 1.5.3. So widen the explicit type of the temporary `fval` variable to (64-bit) `double`, which will not lose precision. Fixes pandas-dev#57051
In Pandas 1.5.3, the
float(val)
cast was inline to thecast_from_unit
call inarray_with_unit_to_datetime
. This caused the intermediate (unnamed) value to be a Python float.Since #50301, a temporary variable was added to avoid multiple casts, but with explicit type
cdef float
, which defines a Cython float. This type is 32-bit, and causes a loss of precision, and a regression in parsing from 1.5.3.Since
cast_from_unit
takes anobject
, not a more specific Cython type, remove the explicit type from the temporaryfval
variable entirely. This will cause it to be a (64-bit) Python float, and thus not lose precision.Fixes #57051
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.