Skip to content

BUG: Timestamp fails when fold is passed and building from components as positional args (Pandas 2.0 regression) #52117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
lafrech opened this issue Mar 22, 2023 · 10 comments
Assignees
Labels
Bug Timestamp pd.Timestamp and associated methods

Comments

@lafrech
Copy link

lafrech commented Mar 22, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime as dt
import pandas as pd

pd.Timestamp(2020, 1, 1, 0, 0, tzinfo=dt.timezone.utc, fold=0)

Issue Description

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/_libs/tslibs/timestamps.pyx", line 1584, in pandas._libs.tslibs.timestamps.Timestamp.__new__
ValueError: Cannot pass fold with possibly unambiguous input: int, float, numpy.datetime64, str, or timezone-aware datetime-like. Pass naive datetime-like or build Timestamp from components.

Code works with Pandas 1.5. Possible Pandas 2.0 regression?

Search yielded #44378 except I do build from components.

When passing components as kwargs, it works:

pd.Timestamp(year=2020, month=1, day=1, hour=0, minute=0, tzinfo=dt.timezone.utc, fold=0)
Timestamp('2020-01-01 00:00:00+0000', tz='UTC')

I figured maybe I'm not using the constructor correctly but I couldn't find a mention in the docs preventing this usage.

The only test I found regarding this is this one:

@pytest.mark.parametrize("tz", ["dateutil/Europe/London", None])
@pytest.mark.parametrize("fold", [0, 1])
def test_timestamp_constructor_retain_fold(tz, fold):
# Test for GH#25057
# Check that we retain fold
ts = Timestamp(year=2019, month=10, day=27, hour=1, minute=30, tz=tz, fold=fold)
result = ts.fold
expected = fold
assert result == expected

It is this test that led me to the "pass all as kwargs" solution.

Is this a bug or a known limitation?

Expected Behavior

Timestamp('2020-01-01 00:00:00+0000', tz='UTC')

Installed Versions

INSTALLED VERSIONS

commit : c2a7f1a
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-21-amd64
Version : #1 SMP Debian 5.10.162-1 (2023-01-21)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 2.0.0rc1
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.3.0
pip : 23.0.1
Cython : None
pytest : 7.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

@lafrech lafrech added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 22, 2023
@mroeschke
Copy link
Member

From the original implementation in 6e04264, I think the error message could have been clarified as components means from keyword only arguments

@lafrech
Copy link
Author

lafrech commented Mar 23, 2023

Perhaps. This could be a won'tfix, then. Still it is a bit surprising.

# 1/ This works, following calls are equivalent AFAIU
Timestamp(2019, 10, 27, 1, 30, tz=tz)
Timestamp(year=2019, month=10, day=27, hour=1, minute=30, tz=tz)

# 2/ This works
Timestamp(year=2019, month=10, day=27, hour=1, minute=30, tz=tz, fold=fold)

# 3/ This does not work
Timestamp(2019, 10, 27, 1, 30, tz=tz, fold=fold)

Why? Pandas understands the positional args, so how come 3/ is not equivalent to 2/? I may lack time/skills to dig into the code and understand why, but it is not intuitive to me.

Besides, it worked with Pandas 1.5.

In any case, the error message is misleading. And probably the docs too.

@mroeschke
Copy link
Member

@AlexKirko do you happen to remember when adding fold, were other Timestamp components supposed to be added as keyword only arguments when specifying fold?

I'm trying to determine if originally in #31563 if 3/ was never supposed to work even in 1.5

@AlexKirko
Copy link
Member

Hello, this is what I can remember or could dig up:

In the examples that we documented then, we instruct the user to build from a naive datetime or keywords:

   pd.Timestamp(datetime.datetime(2019, 10, 27, 1, 30, 0, 0),
                tz='dateutil/Europe/London', fold=0)
   pd.Timestamp(year=2019, month=10, day=27, hour=1, minute=30,
                tz='dateutil/Europe/London', fold=1)

We never say that the user can't build from positional arguments.

Now, the code responsible for the check is here:

        if fold is not None:
            if fold not in [0, 1]:
                raise ValueError(
                    "Valid values for the fold argument are None, 0, or 1."
                )

            if (ts_input is not _no_input and not (
                    PyDateTime_Check(ts_input) and
                    getattr(ts_input, "tzinfo", None) is None)):
                raise ValueError(
                    "Cannot pass fold with possibly unambiguous input: int, "
                    "float, numpy.datetime64, str, or timezone-aware "
                    "datetime-like. Pass naive datetime-like or build "
                    "Timestamp from components."
                )

From what I remember and see in the code, we intended to raise an exception here only when we are trying to build from a non-ambiguous datetime-like input: one that passes PyDateTime_Check and has tzinfo (or does not pass PyDateTime_Check, I suppose).

I would say, the behavior in 2.0 is a bug. If we support the positional argument convention to mimic datetime, then passing the equivalent of a naive datetime through positional arguments should behave the same as a naive datetime.

My guess would be that something in 2.0 makes us pass a ts_input that is not _no_input in @lafrech 's example? Could try to chase this down during the weekend. @mroeschke , what do you think?

@lafrech
Copy link
Author

lafrech commented Mar 23, 2023

If we support the positional argument convention to mimic datetime, then passing the equivalent of a naive datetime through positional arguments should behave the same as a naive datetime.

Spot on. That would be consistent, indeed.

Could try to chase this down during the weekend.

Thanks @AlexKirko.

@AlexKirko AlexKirko added Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 23, 2023
@AlexKirko AlexKirko self-assigned this Mar 25, 2023
@AlexKirko
Copy link
Member

AlexKirko commented Mar 26, 2023

Hm, on 1.5.3 fold is actually None before this check (on main it's 0):

        # Allow fold only for unambiguous input
        if fold is not None:
            if fold not in [0, 1]:
                raise ValueError(
                    "Valid values for the fold argument are None, 0, or 1."
                )

Also, on 1.5.3 ts_input is a datetime.datetime object, and on main it's an int (just has the year in it). We then try to get a timezone from an int, and this is where we error out.

I'll go through the constructor code and see what changed.

@AlexKirko
Copy link
Member

AlexKirko commented Mar 26, 2023

Somehow we used to go into this if in this situation:

        if tzinfo is not None:
            # GH#17690 tzinfo must be a datetime.tzinfo object, ensured
            #  by the cython annotation.
            if tz is not None:
                if (is_integer_object(tz)
                    and is_integer_object(ts_input)
                    and is_integer_object(freq)
                ):
                    # GH#31929 e.g. Timestamp(2019, 3, 4, 5, 6, tzinfo=foo)
                    # TODO(GH#45307): this will still be fragile to
                    #  mixed-and-matched positional/keyword arguments
                    ts_input = datetime(
                        ts_input,
                        freq,
                        tz,
                        unit or 0,
                        year or 0,
                        month or 0,
                        day or 0,
                        fold=fold or 0,
                    )
                    nanosecond = hour
                    tz = tzinfo
                    print("auxilary constructor")
                    print(ts_input)
                    return cls(ts_input, nanosecond=nanosecond, tz=tz)

                raise ValueError('Can provide at most one of tz, tzinfo')

Now we no longer do, which looks correct, but where before we would pass a timezone-aware datetime without the fold argument, we now keep going and pass positional arguments.

@AlexKirko
Copy link
Member

AlexKirko commented Mar 26, 2023

Looks like the culprit is the predicate. We currently have:

            if (ts_input is not _no_input and not (
                    PyDateTime_Check(ts_input) and
                    getattr(ts_input, "tzinfo", None) is None)):

Which is equivalent to:

            if (ts_input is not _no_input and (
                    not PyDateTime_Check(ts_input) or
                    getattr(ts_input, "tzinfo", None) is not None)):

I think we should have:

            if (ts_input is not _no_input and
                    PyDateTime_Check(ts_input) and
                    getattr(ts_input, "tzinfo", None) is not None):

This fixes the OP's problem, let me see is this passes the test suite.

@AlexKirko
Copy link
Member

Of course not. Anyway, I'll find a predicate variant that does what we want and passes the tests.

@AlexKirko
Copy link
Member

Also, this looks like a kludge to me:

        elif is_integer_object(year):
            # User passed positional arguments:
            # Timestamp(year, month, day[, hour[, minute[, second[,
            # microsecond[, tzinfo]]]]])
            ts_input = datetime(ts_input, year, month, day or 0,
                                hour or 0, minute or 0, second or 0, fold=fold or 0)
            unit = None

In fact, ts_input is the year (which is what the datetime constructor expects).

print(ts_input)
print(_date_attributes)
> 2020
> [1, 1, 0, 0, None, None, None, None]

This should probably be carefully fixed in a separate PR, if it's not just me finding weird that month is assigned to the variable named year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants