-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[WIP] Test (and more fixes) for duplicate indices with concat #38745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/indexes/api.py
Outdated
@@ -89,6 +89,19 @@ def get_objs_combined_axis( | |||
Index | |||
""" | |||
obs_idxes = [obj._get_axis(axis) for obj in objs] | |||
if all_indexes_same(obs_idxes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the place to do at all. should be in concat itself.
ideally we want a nice error message if this happens (and the indices are not equal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I need to add the error message, but I do think this function is where the error should be thrown.
Every function which uses get_objs_combined_axis
throws an error if there are duplicate indices. Here is a demonstration of this from pandas 1.1.5:
import numpy as np
import pandas as pd
import pandas._testing as tm
def list_of_series_constructor(*args):
return pd.DataFrame(list(args))
def crosstab(*args):
return pd.crosstab(*args)
def concat_series(*args):
return pd.concat(list(args), axis=1)
def concat_dfs(*args):
return pd.concat([pd.DataFrame(x) for x in args], axis=1)
def duplicated_indices_errors(func, index_makers):
records = []
for i, index_maker in enumerate(index_makers):
rec = {"index_maker": index_maker.__name__}
uniq = index_maker(k=4)
non_uniq = uniq[[0, 0, 1, 2]]
s_uniq = pd.Series(np.ones(len(uniq)), index=uniq)
s_non_uniq = pd.Series(np.ones(len(non_uniq)), index=non_uniq)
try:
res = func(s_uniq, s_non_uniq)
except Exception as e:
rec["error_type"] = str(type(e))
rec["error_message"] = str(e)
records.append(rec)
return records
funcs = [list_of_series_constructor, crosstab, concat_series, concat_dfs]
index_makers = [
tm.makeStringIndex,
tm.makeIntIndex,
tm.makeUIntIndex,
tm.makeFloatIndex,
tm.makeDateIndex,
tm.makeTimedeltaIndex,
tm.makePeriodIndex,
tm.makeMultiIndex,
tm.makeBoolIndex,
]
pd.concat(
{
f.__name__: pd.DataFrame(duplicated_indices_errors(f, index_makers))
for f in funcs
}
).to_markdown()
index_maker | error_type | error_message | |
---|---|---|---|
('list_of_series_constructor', 0) | makeStringIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 1) | makeIntIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 2) | makeUIntIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 3) | makeFloatIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 4) | makeDateIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 5) | makeTimedeltaIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 6) | makePeriodIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 7) | makeMultiIndex | <class 'ValueError'> | Reindexing only valid with uniquely valued Index objects |
('list_of_series_constructor', 8) | makeBoolIndex | <class 'pandas.errors.InvalidIndexError'> | Reindexing only valid with uniquely valued Index objects |
('crosstab', 0) | makeStringIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('crosstab', 1) | makeIntIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('crosstab', 2) | makeUIntIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('crosstab', 3) | makeFloatIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('crosstab', 4) | makeDateIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('crosstab', 5) | makeTimedeltaIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('crosstab', 6) | makePeriodIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('crosstab', 7) | makeMultiIndex | <class 'ValueError'> | cannot handle a non-unique multi-index! |
('crosstab', 8) | makeBoolIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 0) | makeStringIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 1) | makeIntIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 2) | makeUIntIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 3) | makeFloatIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 4) | makeDateIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 5) | makeTimedeltaIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 6) | makePeriodIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_series', 7) | makeMultiIndex | <class 'ValueError'> | cannot handle a non-unique multi-index! |
('concat_series', 8) | makeBoolIndex | <class 'ValueError'> | cannot reindex from a duplicate axis |
('concat_dfs', 0) | makeStringIndex | <class 'ValueError'> | Shape of passed values is (5, 2), indices imply (4, 2) |
('concat_dfs', 1) | makeIntIndex | <class 'ValueError'> | Shape of passed values is (7, 2), indices imply (5, 2) |
('concat_dfs', 2) | makeUIntIndex | <class 'ValueError'> | Shape of passed values is (7, 2), indices imply (5, 2) |
('concat_dfs', 3) | makeFloatIndex | <class 'ValueError'> | Shape of passed values is (7, 2), indices imply (5, 2) |
('concat_dfs', 4) | makeDateIndex | <class 'ValueError'> | Shape of passed values is (7, 2), indices imply (5, 2) |
('concat_dfs', 5) | makeTimedeltaIndex | <class 'ValueError'> | Shape of passed values is (7, 2), indices imply (5, 2) |
('concat_dfs', 6) | makePeriodIndex | <class 'ValueError'> | Shape of passed values is (7, 2), indices imply (5, 2) |
('concat_dfs', 7) | makeMultiIndex | <class 'ValueError'> | cannot handle a non-unique multi-index! |
('concat_dfs', 8) | makeBoolIndex | <class 'ValueError'> | Shape of passed values is (4, 2), indices imply (2, 2) |
Instead of just giving concat a better error message, all these methods could get better (and more consistent) error messages. I think it's important for the rules about unions and intersections of indices to be defined in these more internal methods so behaviour isn't defined (and reimplemented) per top level function.
Additionally, this function essentially returns nonsense for unions when there are duplicated indices since different Index
types have different definitions for unions with duplicates, so I think it's appropriate to throw errors here instead of passing on those nonsense results.
This has also made me realize there is one case where it's okay for there to be duplicates, which is when indices contain duplicates, but the intersection would not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has also made me realize there is one case where it's okay for there to be duplicates, which is when indices contain duplicates, but the intersection would not.
This got a bit more complicated. idx.get_indexer(keys)
throws an error if idx
is non-unique, regardless of whether keys
have unique indices in idx
.
Interestingly enough, it does not throw an error if the idx
is non-unique, but keys
is empty (which is the case I had initially seen).
Examples using pandas 1.1.5:
pd.crosstab(
pd.Series(np.arange(4), index=[0, 1, 1, 2]),
pd.Series(np.arange(2), index=[0, 2]),
)
# ValueError: cannot reindex from a duplicate axis
pd.crosstab(
pd.Series(np.arange(4), index=[0, 1, 1, 2]),
pd.Series(np.arange(2), index=[3, 4]),
)
# Empty DataFrame
# Columns: []
# Index: []
Examples with pd.concat
This had the same behaviour despite not using .get_indexer
directly in 1.1.5.
pd.concat(
[
pd.Series(np.arange(4), index=[0, 1, 1, 2]),
pd.Series(np.arange(2), index=[0, 2]),
],
axis=1,
join="inner"
)
# ValueError: cannot reindex from a duplicate axis
pd.concat(
[
pd.Series(np.arange(4), index=[0, 1, 1, 2]),
pd.Series(np.arange(2), index=[3, 4]),
],
axis=1,
join="inner"
)
# Empty DataFrame
# Columns: [0, 1]
# Index: []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a more general issue for this: #38797
This fixes a previous issue where `get_objs_combined_axis` would not throw and error for duplicated indices in all cases. Now we check to make sure that the duplicated values don't show up in the intersection. This allows safe indexing when there are duplicated values, but they don't show up in the intersection.
Should the errors here be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/tests/indexes/test_setops.py
Outdated
@@ -463,3 +466,33 @@ def test_setop_with_categorical(index, sort, method): | |||
result = getattr(index, method)(other[:5], sort=sort) | |||
expected = getattr(index, method)(index[:5], sort=sort) | |||
tm.assert_index_equal(result, expected) | |||
|
|||
|
|||
@pytest.mark.parametrize("index_maker", tm.index_subclass_makers_generator()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the index fixture instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should I handle cases where that fixture gives inappropriate input?
Should I define a separate fixture for this? Can I do something like a hypothesis.assume
to make sure some of the cases don't get through and aren't marked as skipped?
The cases which won't work here are at least:
"empty"
"repeats"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its kind of ugly, but in many of these tests we do something like
def test_foo(index_fixture):
if whatever(index_fixture):
pytest.skip()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(id be psyched to see a hypothesis-based implementation of some of our tests, need a way to generate any-valid-index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched to the fixture, but it adds a lot of skipped tests. I would be nice if I could filter this before collection, or use a fixture where I can specify some features of the construction (e.g. values are unique and it has a length of n
)
pandas/tests/indexes/test_setops.py
Outdated
|
||
result = get_objs_combined_axis(series, intersect=True) | ||
|
||
tm.assert_index_equal(full[[0, 2]], result, check_order=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always use
result =
expected =
tm.assert_index(result, expected)
why is check_order=False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_order
is false because the MultiIndex
comes back in a different order.
More conceptually, I don't think this function promises any ordering unless you pass sort.
@pytest.mark.parametrize("join", ["inner", "outer"]) | ||
def test_concat_duplicates_error(index_maker, join): | ||
# if index_maker is tm.makeMultiIndex: | ||
# TODO: This generator only makes Indexs of size 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/pandas/_testing/__init__.py
Lines 491 to 492 in 0976c4c
def makeMultiIndex(k=10, names=None, **kwargs): | |
return MultiIndex.from_product((("foo", "bar"), (1, 2)), names=names, **kwargs) |
This function only makes a MultiIndex with length 4, ignoring the k
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
@pytest.mark.parametrize("index_maker", tm.index_subclass_makers_generator()) | ||
@pytest.mark.xfail(reason="Not implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an issue for this? what is this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting this to be allowed:
import pandas as pd
result = pd.concat(
[
pd.Series(0, index=[0, 0, 1, 2]),
pd.Series(1, index=[1, 2]),
],
join="inner",
)
expected = pd.DataFrame({0: [0, 0], 1: [1, 1]}, index=[1, 2])
pd.testing.assert_frame_equal(result, expected)
Because the intersection of those indices is well defined. However, it turns out this does not work, and also doesn't work in 1.1.5. I sort of opened this issue here: #38773, but that was a more low-level issue.
@pytest.mark.parametrize("index_maker", tm.index_subclass_makers_generator()) | ||
@pytest.mark.parametrize("join", ["inner", "outer"]) | ||
def test_concat_duplicates_error(index_maker, join): | ||
# if index_maker is tm.makeMultiIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment in all of the new tests
@@ -135,13 +135,20 @@ def _get_combined_index( | |||
indexes = _get_distinct_objs(indexes) | |||
if len(indexes) == 0: | |||
index = Index([]) | |||
elif len(indexes) == 1: | |||
elif len(indexes) == 1 or all_indexes_same(indexes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comments here as it is non-obvious what is happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
elif len(indexes) == 1 or all_indexes_same(indexes): | |
# if unique by id or unique by value | |
elif len(indexes) == 1 or all_indexes_same(indexes): |
@phofl has been working on this recently, might shed some light on the intended behavior medium-long term |
Additionally to your issue #38623 also is problematic. Do you have a list of Index classes where this is not the case for union? |
Put up #38834 to fix the IntervalIndex.intersection bug |
@phofl, let me know if there is an open issue I should post this to. It seems like it's most of them except Code used to checkimport pandas as pd
import numpy as np
import pandas._testing as tm
index_makers = [
tm.makeStringIndex,
tm.makeIntIndex,
tm.makeUIntIndex,
tm.makeFloatIndex,
tm.makeDateIndex,
tm.makeTimedeltaIndex,
tm.makePeriodIndex,
tm.makeMultiIndex,
tm.makeBoolIndex,
]
records = []
union = lambda x, y: x.union(y)
intersection = lambda x, y: x.intersection(y)
# union = lambda x, y: pd.core.indexes.api._get_combined_index([x, y], intersect=False)
# intersection = lambda x, y: pd.core.indexes.api._get_combined_index([x, y], intersect=True)
for index_maker in index_makers:
idx1 = index_maker(k=4)
idx2 = idx1[np.array([0, 0, 1, 2, 3])]
assert type(idx1) == type(idx2)
rec = {"index_type": type(idx1).__name__, "index_generator": index_maker.__name__}
# idx_intersect = intersection(idx1, idx2)
# rec["intersect_len"] = len(idx_intersect)
# rec["intersect_commutative"] = idx_intersect.sort_values().equals(intersection(idx2, idx1).sort_values())
idx_union = union(idx1, idx2)
rec["union_len"] = len(idx_union)
rec["union_commutative"] = idx_union.sort_values().equals(union(idx2, idx1).sort_values())
rec["unique_len"] = len(idx_union.unique())
records.append(rec)
print(pd.DataFrame.from_records(records).to_markdown(index=False))
This is using |
Sorry I said something wrong, intersection should be unique, union not.
This is what I am getting on top of my pr. The bool Index case might be fishy, but you get this on master too, if the BoolIndex is sorted
returns
which is consistent with non-Bool duplictaes on both sides like
But not consistent if both are equal
Probably worth looking into how to avoid this when performing the outer join in union. |
Why wouldn't unions have unique values? And related to the |
Currently running into issues with intersections of timestamp objects. They seem to be losing their |
Among others the union case is discussed here: #31326 I think a logical interpretation of a union is, that the union is the smallest index, so that both indexes are part of this index. Currently the union is handled differently based on if the indexes are monotonic or not. My pr aims to handle both cases the same and resort afterwards, this would leave us there with more consistency. The unwanted duplicates described above can be solved thereafter I think. |
This PR closes #36263 which is milestoned 1.2.1 whereas this PR is milestoned 1.3. what's the best way to resolve this inconsistency? |
this for 1.3 |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@ivirshup closing as stale. ping if you want to continue. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Follow up to #38654, currently WIP.
It's a work in progress because it turns out union and intersection act differently for different index types.
intersection
seems mostly normal, except forIntervalIndex
#38743), whileunion
is sometimes a set union (forIndex
) and sometimes keeps duplicates (for manyIndex
subclasses, though order matters sometimes).This effects
concat
inget_objs_combined_axis
. Once I figured out where the problems were coming from, I figured I could maybe side step this by adding checks toget_objs_combined_axis
for equality and uniqueness. All checks passreshape
on my machine, but I'm done working on this for the day, so I'll rely on CI to do the checks on the rest of the codebase.This test could be expanded with:
object
index currently)sort
,axis
?)TODO:
concat
where duplicates don't errorDuplicateLabelErrors