Skip to content

API: DatetimeIndex union vs join different behavior with mismatched tzs #39328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Jan 21, 2021 · 3 comments · Fixed by #41458
Closed

API: DatetimeIndex union vs join different behavior with mismatched tzs #39328

jbrockmendel opened this issue Jan 21, 2021 · 3 comments · Fixed by #41458
Labels
API - Consistency Internal Consistency of API/Behavior Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@jbrockmendel
Copy link
Member

For union when we have mismatched tzs or mismatched tzawareness, we cast to object.

For join when we have mismatched tzawareness we raise TypeError, with mismatched tzs we cast to UTC.

It would be convenient if these had the same behavior.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 21, 2021
@jorisvandenbossche
Copy link
Member

This is a bit related to #37605 ? If we decide there (for setitem) that matching tzawareness (but not tz-equality) is enough, then I would expect also both union and join to preserve a tzaware datetime dtype (which means casting to UTC, I suppose).

@jbrockmendel
Copy link
Member Author

One way to implement this would be to make find_common_type for mismatched dt64tzs return UTC. Doing this would change the setop behavior in a pretty reasonable way. the sticking point is in a couple of groupby tests:

    def test_transform_lambda_with_datetimetz():
        # GH 27496
        df = DataFrame(
            {
                "time": [
                    Timestamp("2010-07-15 03:14:45"),
                    Timestamp("2010-11-19 18:47:06"),
                ],
                "timezone": ["Etc/GMT+4", "US/Eastern"],
            }
        )
        result = df.groupby(["timezone"])["time"].transform(
            lambda x: x.dt.tz_localize(x.name)
        )
        expected = Series(
            [
                Timestamp("2010-07-15 03:14:45", tz="Etc/GMT+4"),
                Timestamp("2010-11-19 18:47:06", tz="US/Eastern"),
            ],
            name="time",
        )
>       tm.assert_series_equal(result, expected)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  datetime64[ns, UTC]
E       [right]: object

    def test_groupby_multi_timezone(self):
    
        # combining multiple / different timezones yields UTC
    
        dates = [
            "2000-01-28 16:47:00",
            "2000-01-29 16:48:00",
            "2000-01-30 16:49:00",
            "2000-01-31 16:50:00",
            "2000-01-01 16:50:00",
        ]
        tzs = [
            "America/Chicago",
            "America/Chicago",
            "America/Los_Angeles",
            "America/Chicago",
            "America/New_York",
        ]
        df = DataFrame({"value": range(5), "date": dates, "tz": tzs})
    
        result = df.groupby("tz").date.apply(
            lambda x: pd.to_datetime(x).dt.tz_localize(x.name)
        )
    
        expected = Series(
            [
                Timestamp("2000-01-28 16:47:00-0600", tz="America/Chicago"),
                Timestamp("2000-01-29 16:48:00-0600", tz="America/Chicago"),
                Timestamp("2000-01-30 16:49:00-0800", tz="America/Los_Angeles"),
                Timestamp("2000-01-31 16:50:00-0600", tz="America/Chicago"),
                Timestamp("2000-01-01 16:50:00-0500", tz="America/New_York"),
            ],
            name="date",
            dtype=object,
        )
>       tm.assert_series_equal(result, expected)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  datetime64[ns, UTC]
E       [right]: object

In both of these examples, its pretty reasonable to think the user would want the object-dtype result

@simonjayhawkins simonjayhawkins added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Datetime Datetime data dtype Timezones Timezone data dtype API - Consistency Internal Consistency of API/Behavior and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jan 26, 2021
@jorisvandenbossche
Copy link
Member

I think for those two cases you bring up, it makes sense to have UTC tz-aware as output (AFAIU that's also what the original reporter wanted in #27496)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants