Skip to content

Series with NAMED period index raise error on groupby index.month (pandas 1.0 specific) #32108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daxid opened this issue Feb 19, 2020 · 14 comments · Fixed by #33105
Closed

Series with NAMED period index raise error on groupby index.month (pandas 1.0 specific) #32108

daxid opened this issue Feb 19, 2020 · 14 comments · Fixed by #33105
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions Period Period data type
Milestone

Comments

@daxid
Copy link
Contributor

daxid commented Feb 19, 2020

edit from @TomAugspurger: this is fixed on master, but the example below needs to be added as a unit test. The test can probably go in groupby/test_groupby.py.

Description

With the pandas 1.0.1 (full version with dependencies at the end), series with NAMED period index raise error on groupby index.month

There is no error if the index is not named.

There was no error wit pandas 0.25.3

Code Sample

import pandas as pd

index = pd.period_range(start='2018-01', periods=24, freq='M')
periodSerie = pd.Series(range(24),index=index)
periodSerie.index.name = 'Month'
periodSerie.groupby(periodSerie.index.month).sum()

Error

It seems to me that pandas tries to interpret the index name as if it were part of the index itself.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4410             try:
-> 4411                 return libindex.get_value_at(s, key)
   4412             except IndexError:

pandas/_libs/index.pyx in pandas._libs.index.get_value_at()

pandas/_libs/index.pyx in pandas._libs.index.get_value_at()

pandas/_libs/util.pxd in pandas._libs.util.get_value_at()

pandas/_libs/util.pxd in pandas._libs.util.validate_indexer()

TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/indexes/period.py in get_value(self, series, key)
    516         try:
--> 517             value = super().get_value(s, key)
    518         except (KeyError, IndexError):

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4418                 else:
-> 4419                     raise e1
   4420             except Exception:

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4404         try:
-> 4405             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4406         except KeyError as e1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Month'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso()

pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.dateutil_parse()

ValueError: Unknown datetime string format, unable to parse: Month

During handling of the above exception, another exception occurred:

DateParseError                            Traceback (most recent call last)
<ipython-input-6-a3e948d22d88> in <module>
      5 periodSerie = pd.Series(range(24),index=index)
      6 periodSerie.index.name = 'Month'
----> 7 periodSerie.groupby(periodSerie.index.month).sum()

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/series.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   1676         axis = self._get_axis_number(axis)
   1677 
-> 1678         return groupby_generic.SeriesGroupBy(
   1679             obj=self,
   1680             keys=by,

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    400             from pandas.core.groupby.grouper import get_grouper
    401 
--> 402             grouper, exclusions, obj = get_grouper(
    403                 obj,
    404                 keys,

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    583     for i, (gpr, level) in enumerate(zip(keys, levels)):
    584 
--> 585         if is_in_obj(gpr):  # df.groupby(df['name'])
    586             in_axis, name = True, gpr.name
    587             exclusions.append(name)

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in is_in_obj(gpr)
    577             return False
    578         try:
--> 579             return gpr is obj[gpr.name]
    580         except (KeyError, IndexError):
    581             return False

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
    869         key = com.apply_if_callable(key, self)
    870         try:
--> 871             result = self.index.get_value(self, key)
    872 
    873             if not is_scalar(result):

~/.virtualenvs/dev/lib/python3.8/site-packages/pandas/core/indexes/period.py in get_value(self, series, key)
    518         except (KeyError, IndexError):
    519             if isinstance(key, str):
--> 520                 asdt, parsed, reso = parse_time_string(key, self.freq)
    521                 grp = resolution.Resolution.get_freq_group(reso)
    522                 freqn = resolution.get_freq_group(self.freq)

pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_time_string()

pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso()

DateParseError: Unknown datetime string format, unable to parse: Month

Expected Output

With pandas 0.25.3, the following expected output is produced :

Month
1     12
2     14
3     16
4     18
5     20
6     22
7     24
8     26
9     28
10    30
11    32
12    34
dtype: int64

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.18-1-MANJARO
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.0.1
numpy : 1.18.0
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

@daxid daxid changed the title test Series with NAMED period index raise error on groupby index.month (pandas 1.0) Feb 19, 2020
@daxid daxid changed the title Series with NAMED period index raise error on groupby index.month (pandas 1.0) Series with NAMED period index raise error on groupby index.month (pandas 1.0 specific) Feb 19, 2020
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 19, 2020

Thanks @daxid

Couldn't reproduce this on master, seems it's already been fixed

In [1]: import pandas as pd                                                                                                                                                                                      

In [2]: index = pd.period_range(start='2018-01', periods=24, freq='M') 
   ...: periodSerie = pd.Series(range(24),index=index) 
   ...: periodSerie.index.name = 'Month' 
   ...: periodSerie.groupby(periodSerie.index.month).sum()                                                                                                                                                       
Out[2]: 
Month
1     12
2     14
3     16
4     18
5     20
6     22
7     24
8     26
9     28
10    30
11    32
12    34
dtype: int64

(please don't close this yet though, I still haven't checked if there is a test for this)

@daxid
Copy link
Contributor Author

daxid commented Feb 19, 2020

I can confirm it runs OK with the last code on master (pandas-1.1.0.dev0+516.gac3056f2f)

I'll let it open until you check for related tests.

Best regards

@jorisvandenbossche jorisvandenbossche added the Needs Tests Unit test(s) needed to prevent regressions label Feb 19, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.2 milestone Feb 19, 2020
@jorisvandenbossche
Copy link
Member

So we need to check if it also already passes on the 1.0.x branch, or if we need to find the commit on master that fixed this to backport.

@daxid
Copy link
Contributor Author

daxid commented Feb 19, 2020

I installed the 1.0.x branch and the sample code fails !

I guess you have to go for a cherry pick 🍒

I'll give a shot to test writing. I'll report progress here, probably within a week.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 20, 2020

Just checking this but I think it was fixed by #31318

EDIT

yup, seems to be the case - @daxid thanks, tests would be welcome!

(pandas-dev) m.gorelli@ws-1808:~/pandas$ git checkout 5402ea57a6f0fe606a3de3731dcb903186b1f4c2
HEAD is now at 5402ea57a REF: make PeriodIndex.get_value wrap PeriodIndex.get_loc (#31318)
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python setup.py build_ext --inplace -j 4
running build_ext
(pandas-dev) m.gorelli@ws-1808:~/pandas$ ipython
Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.11.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd 
   ...:  
   ...: index = pd.period_range(start='2018-01', periods=24, freq='M') 
   ...: periodSerie = pd.Series(range(24),index=index) 
   ...: periodSerie.index.name = 'Month' 
   ...: periodSerie.groupby(periodSerie.index.month).sum()                                                                                                                                                       
Out[1]: 
Month
1     12
2     14
3     16
4     18
5     20
6     22
7     24
8     26
9     28
10    30
11    32
12    34
dtype: int64

In [2]:                                                                                                                                                                                                          
Do you really want to exit ([y]/n)? y
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git log -n 2
commit 5402ea57a6f0fe606a3de3731dcb903186b1f4c2 (HEAD, refs/bisect/bad)
Author: jbrockmendel <[email protected]>
Date:   Mon Jan 27 18:22:55 2020 -0800

    REF: make PeriodIndex.get_value wrap PeriodIndex.get_loc (#31318)

commit 3c763186b81ac61672f7d2e79473f1763b274a80 (refs/bisect/good-3c763186b81ac61672f7d2e79473f1763b274a80)
Author: jbrockmendel <[email protected]>
Date:   Mon Jan 27 18:22:13 2020 -0800

    CLN: require extracting .values before expressions calls (#31373)
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git checkout 3c763186b81ac61672f7d2e79473f1763b274a80
Previous HEAD position was 5402ea57a REF: make PeriodIndex.get_value wrap PeriodIndex.get_loc (#31318)
HEAD is now at 3c763186b CLN: require extracting .values before expressions calls (#31373)
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python setup.py build_ext --inplace -j 4
running build_ext
(pandas-dev) m.gorelli@ws-1808:~/pandas$ ipython
Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.11.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd 
   ...:  
   ...: index = pd.period_range(start='2018-01', periods=24, freq='M') 
   ...: periodSerie = pd.Series(range(24),index=index) 
   ...: periodSerie.index.name = 'Month' 
   ...: periodSerie.groupby(periodSerie.index.month).sum()                                                                                                                                                       
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/pandas/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso()
    308     try:
--> 309         parsed, reso = dateutil_parse(date_string, _DEFAULT_DATETIME,
    310                                       dayfirst=dayfirst, yearfirst=yearfirst,

~/pandas/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.dateutil_parse()
    481     if res is None:
--> 482         raise ValueError(f"Unknown datetime string format, unable to parse: {timestr}")
    483 

ValueError: Unknown datetime string format, unable to parse: Month

During handling of the above exception, another exception occurred:

DateParseError                            Traceback (most recent call last)
<ipython-input-1-28f368eeb32c> in <module>
      4 periodSerie = pd.Series(range(24),index=index)
      5 periodSerie.index.name = 'Month'
----> 6 periodSerie.groupby(periodSerie.index.month).sum()

~/pandas/pandas/core/series.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   1662             group_keys=group_keys,
   1663             squeeze=squeeze,
-> 1664             observed=observed,
   1665         )
   1666 

~/pandas/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    407                 sort=sort,
    408                 observed=observed,
--> 409                 mutated=self.mutated,
    410             )
    411 

~/pandas/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    584     for i, (gpr, level) in enumerate(zip(keys, levels)):
    585 
--> 586         if is_in_obj(gpr):  # df.groupby(df['name'])
    587             in_axis, name = True, gpr.name
    588             exclusions.append(name)

~/pandas/pandas/core/groupby/grouper.py in is_in_obj(gpr)
    578             return False
    579         try:
--> 580             return gpr is obj[gpr.name]
    581         except (KeyError, IndexError):
    582             return False

~/pandas/pandas/core/series.py in __getitem__(self, key)
    855 
    856         try:
--> 857             result = self.index.get_value(self, key)
    858 
    859             return result

~/pandas/pandas/core/indexes/period.py in get_value(self, series, key)
    495                 pass
    496 
--> 497             asdt, reso = parse_time_string(key, self.freq)
    498             grp = resolution.Resolution.get_freq_group(reso)
    499             freqn = resolution.get_freq_group(self.freq)

~/pandas/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_time_string()
    267             yearfirst = get_option("display.date_yearfirst")
    268 
--> 269     res = parse_datetime_string_with_reso(arg, freq=freq,
    270                                           dayfirst=dayfirst,
    271                                           yearfirst=yearfirst)

~/pandas/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso()
    312     except (ValueError, OverflowError) as err:
    313         # TODO: allow raise of errors within instead
--> 314         raise DateParseError(err)
    315     if parsed is None:
    316         raise DateParseError(f"Could not parse {date_string}")

DateParseError: Unknown datetime string format, unable to parse: Month

@TomAugspurger
Copy link
Contributor

Backporting #31318 doesn't look straightforward, so this likely won't be fixed for 1.0.2 unless someone invests some time in the next day or so.

I'll repurpose this issue to adding a test for the behavior on master.

@daxid
Copy link
Contributor Author

daxid commented Mar 10, 2020

Thanks for feedback.
I am still committed to produce tests for this but have been more busy than expected during the last couple of days.

dbeniamine added a commit to Tetras-Libre/pandas that referenced this issue Mar 23, 2020
@sumanau7
Copy link
Contributor

take

@sumanau7
Copy link
Contributor

@MarcoGorelli Request you to review.

@daxid
Copy link
Contributor Author

daxid commented Mar 28, 2020

Too bad, my colleague and I had a test ready since Monday.
He mentioned this thread in his commit (Tetras-Libre@74e65a8) but I forgot to make the pull request :-(

@MarcoGorelli to review, test should pass on master but fail on 2.0.x branch

@sumanau7
Copy link
Contributor

@daxid Missed it, i have closed my pull request please feel free to raise a new pull request, in case the issue is not resolved in few weeks, i will reopen my pull request.

@sumanau7 sumanau7 removed their assignment Mar 28, 2020
@MarcoGorelli
Copy link
Member

@daxid can you make the pull request and either link to it here, or write "closes #32108" in its body?

@sumanau7
Copy link
Contributor

@daxid Please run black pandas and update the PR.

@jreback jreback added Groupby Period Period data type labels Mar 30, 2020
@simonjayhawkins
Copy link
Member

This is also fixed on 1.0.x branch from backporting #34049 (i.e. 1.0.4)

b3ebcb0 is the first new commit
commit b3ebcb0
Author: Simon Hawkins [email protected]
Date: Tue May 19 12:39:11 2020 +0100

Backport PR #34049 on branch 1.0.x (Bug in Series.groupby would raise ValueError when grouping by PeriodIndex level) (#34247)

@simonjayhawkins simonjayhawkins removed this from the 1.1 milestone Jun 23, 2020
@simonjayhawkins simonjayhawkins added this to the 1.0.4 milestone Jun 23, 2020
@jreback jreback modified the milestones: 1.0.4, 1.1 Jun 24, 2020
fangchenli pushed a commit to fangchenli/pandas that referenced this issue Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions Period Period data type
Projects
None yet
7 participants