-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Timestamp fails when fold is passed with positional args #52221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Timestamp fails when fold is passed with positional args #52221
Conversation
Now I see what happened. The error check treats an |
As I mentioned in #52117 , we should also fix the fact that when we use positional arguments currently, the year gets put into |
@mroeschke Would you mind taking a look?
Scratch that, this way of fixing the bug is stupid. Let's try fixing the core issue instead, so we don't need two PRs where one will suffice. |
We can also attempt to deal with the overarching problem in this PR. If we make sure that |
pandas/_libs/tslibs/timestamps.pyx
Outdated
if (ts_input is not _no_input and not ( | ||
# GH#52117 note that currently we end up with the year in the ts_input | ||
# variable when using positional arguments. We should fix this. | ||
from_positional = (ts_input is not None and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the ts_input check be for _no_input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, yes. We have changed the code since pandas 1.5. [Before] (
pandas/pandas/_libs/tslibs/timestamps.pyx
Line 1572 in 778ab82
if tzinfo is not None: |
if tzinfo is not None:
# GH#17690 tzinfo must be a datetime.tzinfo object, ensured
# by the cython annotation.
if tz is not None:
if (is_integer_object(tz)
and is_integer_object(ts_input)
and is_integer_object(freq)
):
# GH#31929 e.g. Timestamp(2019, 3, 4, 5, 6, tzinfo=foo)
# TODO(GH#45307): this will still be fragile to
# mixed-and-matched positional/keyword arguments
ts_input = datetime(
ts_input,
freq,
tz,
unit or 0,
year or 0,
month or 0,
day or 0,
fold=fold or 0,
)
nanosecond = hour
tz = tzinfo
return cls(ts_input, nanosecond=nanosecond, tz=tz)
raise ValueError('Can provide at most one of tz, tzinfo')
# User passed tzinfo instead of tz; avoid silently ignoring
tz, tzinfo = tzinfo, None
This is what it we do now instead:
if tzinfo is not None:
# GH#17690 tzinfo must be a datetime.tzinfo object, ensured
# by the cython annotation.
if tz is not None:
raise ValueError("Can provide at most one of tz, tzinfo")
# User passed tzinfo instead of tz; avoid silently ignoring
tz, tzinfo = tzinfo, None
Let me try to fix this by shifting the args one to the right instead.
pandas/_libs/tslibs/timestamps.pyx
Outdated
# are shifted to the left. Let's shift back | ||
if (isinstance(ts_input, int) and | ||
isinstance(year, int) and isinstance(month, int)): | ||
ts_input, year, month, day, hour, minute, second, microsecond = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be incorrect if a user passed Timestamp(2023, 1, 1, minute=1)
by assigning minute to second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a good point. We get Timestamp('2023-01-01 00:00:01')
.
I hate it, but should I roll back to the bugfix version where I didn't touch the args, but instead just checked whether the first three args were filled or not and prevented the raise?
# GH#52117 note that currently we end up with the year in the ts_input
# variable when using positional arguments. We should fix this.
from_positional = (ts_input is not None and
year is not None and month is not None)
if (ts_input is not _no_input and not from_positional and not (
PyDateTime_Check(ts_input) and
getattr(ts_input, "tzinfo", None) is None)):
raise ValueError(
"Cannot pass fold with possibly unambiguous input: int, "
"float, numpy.datetime64, str, or timezone-aware "
"datetime-like. Pass naive datetime-like or build "
"Timestamp from components."
)
Another solution would be to make all the args up to second positional-only, since Python 3.8 allows that, but it would deviate from the datetime API and be a breaking change. That would need a separate PR and a deprecation cycle, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, there might be a different (non-breaking, non-horrible) solution, give me a day or so to try something out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would suggest tackling the init argument complexity in a follow up unless it's a truly a blocker for this fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke Thanks for giving that example. Turns out, there is another bug caused by the same anomaly in the data. We currently have on main:
pd.Timestamp(2020, 12, 31, second=1)
> Timestamp('2020-12-31 00:00:00.000001')
This looks like a bug to me. In addition, we are testing for this unexpected behavior here:
@pytest.mark.parametrize("kwd", ["nanosecond", "microsecond", "second", "minute"])
def test_constructor_positional_keyword_mixed_with_tzinfo(self, kwd, request):
# TODO: if we passed microsecond with a keyword we would mess up
# xref GH#45307
if kwd != "nanosecond":
# nanosecond is keyword-only as of 2.0, others are not
mark = pytest.mark.xfail(reason="GH#45307")
request.node.add_marker(mark)
kwargs = {kwd: 4}
ts = Timestamp(2020, 12, 31, tzinfo=timezone.utc, **kwargs)
td_kwargs = {kwd + "s": 4}
td = Timedelta(**td_kwargs)
expected = Timestamp("2020-12-31", tz=timezone.utc) + td
assert ts == expected
My proposed fix makes it so that the fails here no longer happen and that:
pd.Timestamp(2020, 12, 31, second=1)
> Timestamp('2020-12-31 00:00:01')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would suggest tackling the init argument complexity in a follow up unless it's a truly a blocker for this fix
That's one option but please review the current proposed solution. I think it's not terribly costly and makes the constructor easier to maintain and less vulnerable to bugs.
Is the long-term solution here to use |
@jbrockmendel @mroeschke I've pushed a working solution, please take a look. I ran asv benchmarks on it, and there is a performance hit:
If I did something dumb performance-wise, please tell me. I think, we can ignore In my opinion, fixing this by restoring proper |
Great, the code passed the checks on the second try, now ready for a review. |
I would be okay with going with this solution. The decorator method is clever but I think refactoring using args, kwargs is a little more clear |
Long-term I would also go with args and kwargs. The decorator, I think, is okay for now, but it is still a kludge. As much as I try to avoid relying on args and kwargs (because of their propensity toward spaghetti code and all the documentation and sanity checking that's always needed to make sure that everything is safe, clean and readable), I see no clean way to fulfill our promise of supporting all the datetime construction conventions without moving to *args and *kwargs. My suggestion is to accept this fix, and then make a separate ticket to refactor the class to use |
why can't u do this inside new itself this is quite a kludge and the perf hit is not nothing |
Hello, @jreback , happy to talk to you again. We need to detect whether an argument was passed via positional args or kwargs, or we run into the bug that @mroeschke pointed out here: import pandas as pd
ts1 = pd.Timestamp(2023, 3, 30, 10, None, None, 30)
ts2 = pd.Timestamp(2023, 3, 30, 10, second=30)
print(ts1)
print(ts2)
> 2023-03-30 10:00:00.000030
> 2023-03-30 10:00:00.000030 On I was not able to find a way to determine from within a function, whether an argument was passed positionally or via kwargs. The only ways seem to be via a decorator or by refactoring the function to take If I've missed something, please tell me. I'd love to get the right variable contents from within |
not what i mean we have named args here already |
In order to see whether an argument needs to be shifted to avoid bugs, we need to determine if it was passed positionally or by kwarg. That's because the variable state at the start of pd.Timestamp(2023, 3, 30, 10, None, None, 30)
pd.Timestamp(2023, 3, 30, 10, second=30) @jreback , what am I missing? |
def _fix_positional_arguments(cls): | ||
original_new = cls.__new__ | ||
|
||
def updated_new(cls, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a decorator seems strictly worse than just changing __new__
directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to change __new__
directly if anyone has a suggestion how to tell the difference between the two calls above:
pd.Timestamp(2023, 3, 30, 10, None, None, 30)
pd.Timestamp(2023, 3, 30, 10, second=30)
If we conclude that the only acceptable way to fix this bug is to refactor the function to take in *args, **kwargs
right now, I can close this PR, open a new issue with all the bugs currently present because of the broken variable contents, and work on refactoring.
What I am against is doing a patchwork fix inside __new__
that leaves the variable contents with the wrong arguments, because it turns out to bring a bunch of bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to fix this right now with a decorator, then I open a new issue, refactor the function and make this PR obsolete in a few weeks to a month it will take to refactor and discuss the implementation.
I'm fine with skipping this easy, costly performance-wise, short-term bugfix if you guys aren't comfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to jump directly to implementing the args, kwargs solution then if it's the strategy being considered for the original issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll close the PR and open the new issue tomorrow then. Please note that this causes bugs on 1.5.x
too, not just on 2.x
.
Closing as a different approach and a broader bug definition is needed. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.