Skip to content

BUG: GH17464 MultiIndex now raises an error when levels aren't unique, tests changed #17971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Dec 3, 2017
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
a9a5177
BUG: GH17464 Add error checking for duplicate levels in MultiIndex
cmazzullo Oct 24, 2017
91c8388
Remove duplicate levels from `test_is_`
cmazzullo Oct 25, 2017
392ce8a
Remove duplicate levels from `test_level_setting_resets_attributes`
cmazzullo Oct 25, 2017
3c0812e
Remove duplicate levels from `test_frame_describe_multikey`
cmazzullo Oct 25, 2017
0983453
Add comments about change
cmazzullo Oct 25, 2017
b39421f
Added comments changes w/ bug number
cmazzullo Oct 25, 2017
5771426
PEP8 compliance
cmazzullo Oct 25, 2017
48f9429
whatsnew entry
cmazzullo Oct 25, 2017
9ffc8ad
Remove duplicate levels from `test_is_`
cmazzullo Oct 25, 2017
e70a2be
Remove duplicate levels from `test_level_setting_resets_attributes`
cmazzullo Oct 25, 2017
c28d3df
Remove duplicate levels from `test_frame_describe_multikey`
cmazzullo Oct 25, 2017
03f2da8
Add comments about change
cmazzullo Oct 25, 2017
1d45ab6
Added comments changes w/ bug number
cmazzullo Oct 25, 2017
015af48
PEP8 compliance
cmazzullo Oct 25, 2017
9dc7eb5
whatsnew entry
cmazzullo Oct 25, 2017
9aa2bcd
Whatsnew backticks
cmazzullo Oct 25, 2017
e52460e
whatsnew merging
cmazzullo Oct 25, 2017
48d509d
Removed comments about this issue from other tests
cmazzullo Oct 27, 2017
d75e1de
Added test to make sure a ValueError is thrown
cmazzullo Oct 27, 2017
0943d19
PEP8 compliance
cmazzullo Oct 27, 2017
6f2efc6
move whatsnew to 0.22.0
jreback Oct 28, 2017
840fe56
Remove duplicate levels from `test_is_`
cmazzullo Oct 25, 2017
137bc16
Remove duplicate levels from `test_frame_describe_multikey`
cmazzullo Oct 25, 2017
0129ee7
Add comments about change
cmazzullo Oct 25, 2017
8b400dc
Added comments changes w/ bug number
cmazzullo Oct 25, 2017
0a8e9f2
PEP8 compliance
cmazzullo Oct 25, 2017
cc7ebc7
whatsnew entry
cmazzullo Oct 25, 2017
b02114f
Remove duplicate levels from `test_is_`
cmazzullo Oct 25, 2017
b56eca0
Remove duplicate levels from `test_level_setting_resets_attributes`
cmazzullo Oct 25, 2017
9f179e6
Remove duplicate levels from `test_frame_describe_multikey`
cmazzullo Oct 25, 2017
2af9aba
Add comments about change
cmazzullo Oct 25, 2017
3e56aba
Added comments changes w/ bug number
cmazzullo Oct 25, 2017
85d6379
PEP8 compliance
cmazzullo Oct 25, 2017
49b731d
whatsnew entry
cmazzullo Oct 25, 2017
ec4f971
Whatsnew backticks
cmazzullo Oct 25, 2017
2684855
whatsnew merging
cmazzullo Oct 25, 2017
c36c236
Removed comments about this issue from other tests
cmazzullo Oct 27, 2017
2b3f4d4
Added test to make sure a ValueError is thrown
cmazzullo Oct 27, 2017
386daaf
move whatsnew to 0.22.0
jreback Oct 28, 2017
c169645
Updated `test_frame_describe_multikey` to remove duplicate MultiIndex…
cmazzullo Nov 29, 2017
073e629
Fixed linting issue
cmazzullo Dec 1, 2017
869157d
whatsnew changes
cmazzullo Dec 2, 2017
44e4552
Kwargs in error message
cmazzullo Dec 2, 2017
297216b
Removed unnecessary comment
cmazzullo Dec 2, 2017
fead79f
Added blank line
cmazzullo Dec 2, 2017
703ff1e
Got rid of duplicated tests
cmazzullo Dec 2, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ Other API Changes
- Inserting missing values into indexes will work for all types of indexes and automatically insert the correct type of missing value (``NaN``, ``NaT``, etc.) regardless of the type passed in (:issue:`18295`)
- Restricted ``DateOffset`` keyword arguments. Previously, ``DateOffset`` subclasses allowed arbitrary keyword arguments which could lead to unexpected behavior. Now, only valid arguments will be accepted. (:issue:`17176`, :issue:`18226`).
- :func:`DataFrame.from_items` provides a more informative error message when passed scalar values (:issue:`17312`)
- :class:`Timestamp` will no longer silently ignore unused or invalid `tz` or `tzinfo` arguments (:issue:`17690`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like some duplication here

- :class:`CacheableOffset` and :class:`WeekDay` are no longer available in the `tseries.offsets` module (:issue:`17830`)
- When created with duplicate labels, ``MultiIndex`` now raises a ``ValueError``. (:issue:`17464`)

.. _whatsnew_0220.deprecations:

Expand Down
8 changes: 6 additions & 2 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,8 @@ def _verify_integrity(self, labels=None, levels=None):
Raises
------
ValueError
* if length of levels and labels don't match or any label would
exceed level bounds
If length of levels and labels don't match, if any label would
exceed level bounds, or there are any duplicate levels.
"""
# NOTE: Currently does not check, among other things, that cached
# nlevels matches nor that sortorder matches actually sortorder.
Expand All @@ -198,6 +198,10 @@ def _verify_integrity(self, labels=None, levels=None):
" level (%d). NOTE: this index is in an"
" inconsistent state" % (i, label.max(),
len(level)))
if not level.is_unique:
raise ValueError("Level values must be unique: {0}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use kwargs rather than positional here

" on level {1}".format(
[value for value in level], i))

@property
def levels(self):
Expand Down
14 changes: 8 additions & 6 deletions pandas/tests/groupby/test_functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,10 @@ def test_frame_describe_multikey(self):
desc_groups = []
for col in self.tsframe:
group = grouped[col].describe()
group_col = pd.MultiIndex([[col] * len(group.columns),
group.columns],
[[0] * len(group.columns),
range(len(group.columns))])
# GH 17464 - Remove duplicate MultiIndex levels
group_col = pd.MultiIndex(
levels=[[col], group.columns],
labels=[[0] * len(group.columns), range(len(group.columns))])
group = pd.DataFrame(group.values,
columns=group_col,
index=group.index)
Expand All @@ -67,8 +67,10 @@ def test_frame_describe_multikey(self):
'C': 1, 'D': 1}, axis=1)
result = groupedT.describe()
expected = self.tsframe.describe().T
expected.index = pd.MultiIndex([[0, 0, 1, 1], expected.index],
[range(4), range(len(expected.index))])
# GH 17464 - Remove duplicate MultiIndex levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this comment here

expected.index = pd.MultiIndex(
levels=[[0, 1], expected.index],
labels=[[0, 0, 1, 1], range(len(expected.index))])
tm.assert_frame_equal(result, expected)

def test_frame_describe_tupleindex(self):
Expand Down
77 changes: 77 additions & 0 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,83 @@ def test_attr_wrapper(self):
# make sure raises error
pytest.raises(AttributeError, getattr, grouped, 'foo')

def test_series_describe_multikey(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests were moved, revert this file

ts = tm.makeTimeSeries()
grouped = ts.groupby([lambda x: x.year, lambda x: x.month])
result = grouped.describe()
assert_series_equal(result['mean'], grouped.mean(), check_names=False)
assert_series_equal(result['std'], grouped.std(), check_names=False)
assert_series_equal(result['min'], grouped.min(), check_names=False)

def test_series_describe_single(self):
ts = tm.makeTimeSeries()
grouped = ts.groupby(lambda x: x.month)
result = grouped.apply(lambda x: x.describe())
expected = grouped.describe().stack()
assert_series_equal(result, expected)

def test_series_index_name(self):
grouped = self.df.loc[:, ['C']].groupby(self.df['A'])
result = grouped.agg(lambda x: x.mean())
assert result.index.name == 'A'

def test_frame_describe_multikey(self):
grouped = self.tsframe.groupby([lambda x: x.year, lambda x: x.month])
result = grouped.describe()
desc_groups = []
for col in self.tsframe:
group = grouped[col].describe()
# GH 17464 - Remove duplicate MultiIndex levels
group_col = pd.MultiIndex(
levels=[[col], group.columns],
labels=[[0] * len(group.columns), range(len(group.columns))])
group = pd.DataFrame(group.values,
columns=group_col,
index=group.index)
desc_groups.append(group)
expected = pd.concat(desc_groups, axis=1)
tm.assert_frame_equal(result, expected)

groupedT = self.tsframe.groupby({'A': 0, 'B': 0,
'C': 1, 'D': 1}, axis=1)
result = groupedT.describe()
expected = self.tsframe.describe().T
# GH 17464 - Remove duplicate MultiIndex levels
expected.index = pd.MultiIndex(
levels=[[0, 1], expected.index],
labels=[[0, 0, 1, 1], range(len(expected.index))])
tm.assert_frame_equal(result, expected)

def test_frame_describe_tupleindex(self):

# GH 14848 - regression from 0.19.0 to 0.19.1
df1 = DataFrame({'x': [1, 2, 3, 4, 5] * 3,
'y': [10, 20, 30, 40, 50] * 3,
'z': [100, 200, 300, 400, 500] * 3})
df1['k'] = [(0, 0, 1), (0, 1, 0), (1, 0, 0)] * 5
df2 = df1.rename(columns={'k': 'key'})
pytest.raises(ValueError, lambda: df1.groupby('k').describe())
pytest.raises(ValueError, lambda: df2.groupby('key').describe())

def test_frame_describe_unstacked_format(self):
# GH 4792
prices = {pd.Timestamp('2011-01-06 10:59:05', tz=None): 24990,
pd.Timestamp('2011-01-06 12:43:33', tz=None): 25499,
pd.Timestamp('2011-01-06 12:54:09', tz=None): 25499}
volumes = {pd.Timestamp('2011-01-06 10:59:05', tz=None): 1500000000,
pd.Timestamp('2011-01-06 12:43:33', tz=None): 5000000000,
pd.Timestamp('2011-01-06 12:54:09', tz=None): 100000000}
df = pd.DataFrame({'PRICE': prices,
'VOLUME': volumes})
result = df.groupby('PRICE').VOLUME.describe()
data = [df[df.PRICE == 24990].VOLUME.describe().values.tolist(),
df[df.PRICE == 25499].VOLUME.describe().values.tolist()]
expected = pd.DataFrame(data,
index=pd.Index([24990, 25499], name='PRICE'),
columns=['count', 'mean', 'std', 'min',
'25%', '50%', '75%', 'max'])
tm.assert_frame_equal(result, expected)

def test_frame_groupby(self):
grouped = self.tsframe.groupby(lambda x: x.weekday())

Expand Down
22 changes: 17 additions & 5 deletions pandas/tests/indexes/test_multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -1618,7 +1618,9 @@ def test_is_(self):
# shouldn't change
assert mi2.is_(mi)
mi4 = mi3.view()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an explicit test for the issue, e.g that you are raising (use the example from the original issue).

mi4.set_levels([[1 for _ in range(10)], lrange(10)], inplace=True)

# GH 17464 - Remove duplicate MultiIndex levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need these comments here, rather they go on a new test (see above)

mi4.set_levels([lrange(10), lrange(10)], inplace=True)
assert not mi4.is_(mi3)
mi5 = mi.view()
mi5.set_levels(mi5.levels, inplace=True)
Expand Down Expand Up @@ -2450,13 +2452,11 @@ def test_isna_behavior(self):
pd.isna(self.index)

def test_level_setting_resets_attributes(self):
ind = MultiIndex.from_arrays([
ind = pd.MultiIndex.from_arrays([
['A', 'A', 'B', 'B', 'B'], [1, 2, 1, 2, 3]
])
assert ind.is_monotonic
ind.set_levels([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]],
inplace=True)

ind.set_levels([['A', 'B'], [1, 3, 2]], inplace=True)
# if this fails, probably didn't reset the cache correctly.
assert not ind.is_monotonic

Expand Down Expand Up @@ -3083,3 +3083,15 @@ def test_million_record_attribute_error(self):
with tm.assert_raises_regex(AttributeError,
"'Series' object has no attribute 'foo'"):
df['a'].foo()

def test_duplicate_multiindex_labels(self):
# GH 17464
# Make sure that a MultiIndex with duplicate levels throws a ValueError
with pytest.raises(ValueError):
ind = pd.MultiIndex([['A'] * 10, range(10)], [[0] * 10, range(10)])
# And that using set_levels with duplicate levels fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line

ind = MultiIndex.from_arrays([['A', 'A', 'B', 'B', 'B'],
[1, 2, 1, 2, 3]])
with pytest.raises(ValueError):
ind.set_levels([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]],
inplace=True)