Skip to content

BUG: Timestamp fails when fold is passed with positional args #52221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

AlexKirko
Copy link
Member

@AlexKirko
Copy link
Member Author

Now I see what happened. The error check treats an int in ts_input as time since epoch, so it raises an error. In 1.5.3 we had a timezone-aware datetime by that point, but now we have an int. Let's see how we can fix it.

@AlexKirko
Copy link
Member Author

As I mentioned in #52117 , we should also fix the fact that when we use positional arguments currently, the year gets put into ts_input and the rest of arguments are shifted by 1. This is very non-intuitive and is a problem for maintaining the code.

@AlexKirko AlexKirko marked this pull request as ready for review March 26, 2023 16:22
@AlexKirko AlexKirko requested a review from mroeschke March 26, 2023 16:22
@AlexKirko
Copy link
Member Author

AlexKirko commented Mar 26, 2023

@mroeschke Would you mind taking a look?

The bugfix works by checking whether we have passed positional args or not. Currently, that means that the year ends up in ts_input. My suggestion would be to merge this safe fix and then figure out the best way to make sure that the mess with variable names and their content gets fixed.

Scratch that, this way of fixing the bug is stupid. Let's try fixing the core issue instead, so we don't need two PRs where one will suffice.

@AlexKirko
Copy link
Member Author

We can also attempt to deal with the overarching problem in this PR. If we make sure that ts_input is set to _no_input for positional args, that would also be a fix. I'd rather do it in a separate PR though, because I'm not sure yet, what the reason for the changes was.

if (ts_input is not _no_input and not (
# GH#52117 note that currently we end up with the year in the ts_input
# variable when using positional arguments. We should fix this.
from_positional = (ts_input is not None and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the ts_input check be for _no_input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, yes. We have changed the code since pandas 1.5. [Before] (

if tzinfo is not None:
), we were building a datetime where we now only raise:

        if tzinfo is not None:
            # GH#17690 tzinfo must be a datetime.tzinfo object, ensured
            #  by the cython annotation.
            if tz is not None:
                if (is_integer_object(tz)
                    and is_integer_object(ts_input)
                    and is_integer_object(freq)
                ):
                    # GH#31929 e.g. Timestamp(2019, 3, 4, 5, 6, tzinfo=foo)
                    # TODO(GH#45307): this will still be fragile to
                    #  mixed-and-matched positional/keyword arguments
                    ts_input = datetime(
                        ts_input,
                        freq,
                        tz,
                        unit or 0,
                        year or 0,
                        month or 0,
                        day or 0,
                        fold=fold or 0,
                    )
                    nanosecond = hour
                    tz = tzinfo
                    return cls(ts_input, nanosecond=nanosecond, tz=tz)

                raise ValueError('Can provide at most one of tz, tzinfo')

            # User passed tzinfo instead of tz; avoid silently ignoring
            tz, tzinfo = tzinfo, None

This is what it we do now instead:

        if tzinfo is not None:
            # GH#17690 tzinfo must be a datetime.tzinfo object, ensured
            #  by the cython annotation.
            if tz is not None:
                raise ValueError("Can provide at most one of tz, tzinfo")

            # User passed tzinfo instead of tz; avoid silently ignoring
            tz, tzinfo = tzinfo, None

Let me try to fix this by shifting the args one to the right instead.

# are shifted to the left. Let's shift back
if (isinstance(ts_input, int) and
isinstance(year, int) and isinstance(month, int)):
ts_input, year, month, day, hour, minute, second, microsecond = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be incorrect if a user passed Timestamp(2023, 1, 1, minute=1) by assigning minute to second?

Copy link
Member Author

@AlexKirko AlexKirko Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good point. We get Timestamp('2023-01-01 00:00:01').

I hate it, but should I roll back to the bugfix version where I didn't touch the args, but instead just checked whether the first three args were filled or not and prevented the raise?

# GH#52117 note that currently we end up with the year in the ts_input
            # variable when using positional arguments. We should fix this.
            from_positional = (ts_input is not None and
                               year is not None and month is not None)

            if (ts_input is not _no_input and not from_positional and not (
                    PyDateTime_Check(ts_input) and
                    getattr(ts_input, "tzinfo", None) is None)):
                raise ValueError(
                    "Cannot pass fold with possibly unambiguous input: int, "
                    "float, numpy.datetime64, str, or timezone-aware "
                    "datetime-like. Pass naive datetime-like or build "
                    "Timestamp from components."
                )

Another solution would be to make all the args up to second positional-only, since Python 3.8 allows that, but it would deviate from the datetime API and be a breaking change. That would need a separate PR and a deprecation cycle, I think.

Copy link
Member Author

@AlexKirko AlexKirko Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, there might be a different (non-breaking, non-horrible) solution, give me a day or so to try something out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would suggest tackling the init argument complexity in a follow up unless it's a truly a blocker for this fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Thanks for giving that example. Turns out, there is another bug caused by the same anomaly in the data. We currently have on main:

pd.Timestamp(2020, 12, 31, second=1)

> Timestamp('2020-12-31 00:00:00.000001')

This looks like a bug to me. In addition, we are testing for this unexpected behavior here:

    @pytest.mark.parametrize("kwd", ["nanosecond", "microsecond", "second", "minute"])
    def test_constructor_positional_keyword_mixed_with_tzinfo(self, kwd, request):
        # TODO: if we passed microsecond with a keyword we would mess up
        #  xref GH#45307
        if kwd != "nanosecond":
            # nanosecond is keyword-only as of 2.0, others are not
            mark = pytest.mark.xfail(reason="GH#45307")
            request.node.add_marker(mark)

        kwargs = {kwd: 4}
        ts = Timestamp(2020, 12, 31, tzinfo=timezone.utc, **kwargs)

        td_kwargs = {kwd + "s": 4}
        td = Timedelta(**td_kwargs)
        expected = Timestamp("2020-12-31", tz=timezone.utc) + td
        assert ts == expected

My proposed fix makes it so that the fails here no longer happen and that:

pd.Timestamp(2020, 12, 31, second=1)

> Timestamp('2020-12-31 00:00:01')

Copy link
Member Author

@AlexKirko AlexKirko Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would suggest tackling the init argument complexity in a follow up unless it's a truly a blocker for this fix

That's one option but please review the current proposed solution. I think it's not terribly costly and makes the constructor easier to maintain and less vulnerable to bugs.

@mroeschke mroeschke added Timezones Timezone data dtype Timestamp pd.Timestamp and associated methods labels Mar 27, 2023
@jbrockmendel
Copy link
Member

Is the long-term solution here to use *args, **kwargs so we can disambiguate what was passed positionally vs by keyword?

@AlexKirko
Copy link
Member Author

AlexKirko commented Mar 28, 2023

@jbrockmendel @mroeschke
As a medium-term solution to fix both #52117 and the bug I discovered and outlined here, I propose we wrap the Timestamp class with a decorator to fix the args contents before calling __new__.

I've pushed a working solution, please take a look. I ran asv benchmarks on it, and there is a performance hit:

     <main>           <gh-51117-positional-fold-bug>
+         264±5ns          379±2ns     1.43  tslibs.timestamp.TimestampConstruction.time_from_pd_timestamp
+     1.07±0.02μs      1.26±0.01μs     1.17  tslibs.timestamp.TimestampConstruction.time_fromordinal
+        836±10ns         952±10ns     1.14  tslibs.timestamp.TimestampConstruction.time_from_datetime_aware
+     1.05±0.01μs      1.17±0.01μs     1.12  tslibs.timestamp.TimestampConstruction.time_parse_iso8601_no_tz
+         886±4ns          989±4ns     1.12  tslibs.timestamp.TimestampConstruction.time_from_npdatetime64

If I did something dumb performance-wise, please tell me. I think, we can ignore time_from_pd_timestamp, because that's a super-fast shortcut, so the overhead there is very noticeable for asv but will be negligible for users. For more realistic cases, the decorator looks to slow down construction by 10-20%.

In my opinion, fixing this by restoring proper args is worth the performance hit, but that's mostly me not liking that we have values not matching argument names inside a core class and making the code vulnerable to bugs.

@AlexKirko
Copy link
Member Author

Great, the code passed the checks on the second try, now ready for a review.

@mroeschke
Copy link
Member

Is the long-term solution here to use *args, **kwargs so we can disambiguate what was passed positionally vs by keyword?

I would be okay with going with this solution. The decorator method is clever but I think refactoring using args, kwargs is a little more clear

@AlexKirko
Copy link
Member Author

AlexKirko commented Mar 30, 2023

I would be okay with going with this solution. The decorator method is clever but I think refactoring using args, kwargs is a little more clear

Long-term I would also go with args and kwargs. The decorator, I think, is okay for now, but it is still a kludge.

As much as I try to avoid relying on args and kwargs (because of their propensity toward spaghetti code and all the documentation and sanity checking that's always needed to make sure that everything is safe, clean and readable), I see no clean way to fulfill our promise of supporting all the datetime construction conventions without moving to *args and *kwargs.

My suggestion is to accept this fix, and then make a separate ticket to refactor the class to use *args, **kwargs (I'll be able to take the ticket, but I anticipate it taking a bit to implement and document), but I'll go with what you two think.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2023

why can't u do this inside new itself

this is quite a kludge and the perf hit is not nothing

@AlexKirko
Copy link
Member Author

why can't u do this inside new itself

this is quite a kludge and the perf hit is not nothing

Hello, @jreback , happy to talk to you again.

We need to detect whether an argument was passed via positional args or kwargs, or we run into the bug that @mroeschke pointed out here:
Timestamp(2023, 1, 1, minute=1)
Do we shift here or not? A better illustrative example on main might be this:

import pandas as pd

ts1 = pd.Timestamp(2023, 3, 30, 10, None, None, 30)
ts2 = pd.Timestamp(2023, 3, 30, 10, second=30)

print(ts1)
print(ts2)
> 2023-03-30 10:00:00.000030
> 2023-03-30 10:00:00.000030

On main these give the same answer, because the variable content ends up the same.

I was not able to find a way to determine from within a function, whether an argument was passed positionally or via kwargs. The only ways seem to be via a decorator or by refactoring the function to take *args, **kwargs and then add the logic to parse them into variables.

If I've missed something, please tell me. I'd love to get the right variable contents from within __new__ without kludges.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2023

not what i mean

we have named args here already
all you would need to see if they are shifted right?

@AlexKirko
Copy link
Member Author

AlexKirko commented Mar 31, 2023

In order to see whether an argument needs to be shifted to avoid bugs, we need to determine if it was passed positionally or by kwarg. That's because the variable state at the start of __new__ is the same for the following two cases, but in the first one microsecond ended up in second and needs to be shifted, and in the second case second is where it belongs, because it was passed through kwargs:

pd.Timestamp(2023, 3, 30, 10, None, None, 30)
pd.Timestamp(2023, 3, 30, 10, second=30)

@jreback , what am I missing?

def _fix_positional_arguments(cls):
original_new = cls.__new__

def updated_new(cls, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a decorator seems strictly worse than just changing __new__ directly

Copy link
Member Author

@AlexKirko AlexKirko Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to change __new__ directly if anyone has a suggestion how to tell the difference between the two calls above:

pd.Timestamp(2023, 3, 30, 10, None, None, 30)
pd.Timestamp(2023, 3, 30, 10, second=30)

If we conclude that the only acceptable way to fix this bug is to refactor the function to take in *args, **kwargs right now, I can close this PR, open a new issue with all the bugs currently present because of the broken variable contents, and work on refactoring.

What I am against is doing a patchwork fix inside __new__ that leaves the variable contents with the wrong arguments, because it turns out to bring a bunch of bugs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to fix this right now with a decorator, then I open a new issue, refactor the function and make this PR obsolete in a few weeks to a month it will take to refactor and discuss the implementation.

I'm fine with skipping this easy, costly performance-wise, short-term bugfix if you guys aren't comfortable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to jump directly to implementing the args, kwargs solution then if it's the strategy being considered for the original issue

Copy link
Member Author

@AlexKirko AlexKirko Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll close the PR and open the new issue tomorrow then. Please note that this causes bugs on 1.5.x too, not just on 2.x.

@AlexKirko
Copy link
Member Author

Closing as a different approach and a broader bug definition is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timestamp pd.Timestamp and associated methods Timezones Timezone data dtype
Projects
None yet
4 participants