Skip to content

groupby/apply behaves different if a column is timezone-aware or not #27212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jordansamuels opened this issue Jul 3, 2019 · 5 comments · Fixed by #35225 or #34998
Closed

groupby/apply behaves different if a column is timezone-aware or not #27212

jordansamuels opened this issue Jul 3, 2019 · 5 comments · Fixed by #35225 or #34998
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jordansamuels
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd

dates = ['2001-01-01'] * 2 + ['2001-01-02'] * 2 + ['2001-01-03'] * 2
index_no_tz = pd.DatetimeIndex(dates)
index_tz = pd.DatetimeIndex(dates, tz='UTC')
df1 = pd.DataFrame({'x': list(range(2))*3, 'y': range(6), 't': index_no_tz})
df2 = pd.DataFrame({'x': list(range(2))*3, 'y': range(6), 't': index_tz})

result1 = df1.groupby('x', group_keys=False).apply(lambda df: df[['x','y']].copy())
result2 = df2.groupby('x', group_keys=False).apply(lambda df: df[['x','y']].copy())

try:
    pd.testing.assert_frame_equal(result1, result2)
except:
    print("does not work")

Problem description

It appears that groupby/apply appears to sort differently for the two dataframes in the example; the only difference between them is the timezone-awareness of the t column.

(Note: I posted this in the pydata/pandas gitter, and @jorisvandenbossche suggested I open an issue here for it.)

Expected Output

We would expect the assertion to pass, i.e. the does not work message would not appear.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.6.3
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.0
pyarrow: 0.13.0
xarray: None
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.4
bs4: None
html5lib: None
sqlalchemy: 1.3.5
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: 0.2.1
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@mroeschke
Copy link
Member

Looks specific to when copy is called:

In [8]: df1.groupby('x', group_keys=False).apply(lambda df: df[['x','y']].copy())
Out[8]:
   x  y
0  0  0
2  0  2
4  0  4
1  1  1
3  1  3
5  1  5

In [9]: df2.groupby('x', group_keys=False).apply(lambda df: df[['x','y']].copy())
Out[9]:
   x  y
0  0  0
1  1  1
2  0  2
3  1  3
4  0  4
5  1  5

In [10]: df1.groupby('x', group_keys=False).apply(lambda df: df[['x','y']])
Out[10]:
   x  y
0  0  0
1  1  1
2  0  2
3  1  3
4  0  4
5  1  5

In [11]: df2.groupby('x', group_keys=False).apply(lambda df: df[['x','y']])
Out[11]:
   x  y
0  0  0
1  1  1
2  0  2
3  1  3
4  0  4
5  1  5

Might be shades of #14927

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Groupby labels Jul 3, 2019
@jordansamuels
Copy link
Author

@mroeschke agree, seems related. There is also the observation that group_keys seems to be respected only when the t column doesn't have a timezone:

In [2]: df1.groupby('x', group_keys=True).apply(lambda df: df.copy())
Out[2]:
     x  y          t
x
0 0  0  0 2001-01-01
  2  0  2 2001-01-02
  4  0  4 2001-01-03
1 1  1  1 2001-01-01
  3  1  3 2001-01-02
  5  1  5 2001-01-03

In [3]: df2.groupby('x', group_keys=True).apply(lambda df: df.copy())
Out[3]:
   x  y                         t
0  0  0 2001-01-01 00:00:00+00:00
1  1  1 2001-01-01 00:00:00+00:00
2  0  2 2001-01-02 00:00:00+00:00
3  1  3 2001-01-02 00:00:00+00:00
4  0  4 2001-01-03 00:00:00+00:00
5  1  5 2001-01-03 00:00:00+00:00

Should I mark that as a separate issue?

I'm mentioning this assuming that calling copy is indeed the "right" way. #19175 seems to imply it is.

@mroeschke
Copy link
Member

Usually there isn't a need to call copy in apply. You can leave it here in this issue.

@mroeschke
Copy link
Member

This looks fixed on master. Could use a test

In [155]: import pandas as pd
     ...:
     ...: dates = ['2001-01-01'] * 2 + ['2001-01-02'] * 2 + ['2001-01-03'] * 2
     ...: index_no_tz = pd.DatetimeIndex(dates)
     ...: index_tz = pd.DatetimeIndex(dates, tz='UTC')
     ...: df1 = pd.DataFrame({'x': list(range(2))*3, 'y': range(6), 't': index_no_tz})
     ...: df2 = pd.DataFrame({'x': list(range(2))*3, 'y': range(6), 't': index_tz})
     ...:
     ...: result1 = df1.groupby('x', group_keys=False).apply(lambda df: df[['x','y']].copy())
     ...: result2 = df2.groupby('x', group_keys=False).apply(lambda df: df[['x','y']].copy())

In [156]: result1
Out[156]:
   x  y
0  0  0
1  1  1
2  0  2
3  1  3
4  0  4
5  1  5

In [157]: result2
Out[157]:
   x  y
0  0  0
1  1  1
2  0  2
3  1  3
4  0  4
5  1  5

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Groupby labels Jun 28, 2020
@jreback jreback added this to the 1.1 milestone Jul 11, 2020
@TomAugspurger
Copy link
Contributor

This is being reverted in #35306 and will be re-fixed in #34998 in pandas 1.2.

@TomAugspurger TomAugspurger reopened this Jul 16, 2020
@TomAugspurger TomAugspurger modified the milestones: 1.1, 1.2 Jul 16, 2020
@jreback jreback modified the milestones: 1.2, 1.3 Nov 29, 2020
@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@jreback jreback added this to the 1.5 milestone Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
5 participants