Skip to content

[WIP] Test (and more fixes) for duplicate indices with concat #38745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

ivirshup
Copy link
Contributor

@ivirshup ivirshup commented Dec 28, 2020

Follow up to #38654, currently WIP.

It's a work in progress because it turns out union and intersection act differently for different index types. intersection seems mostly normal, except for IntervalIndex #38743), while union is sometimes a set union (for Index) and sometimes keeps duplicates (for many Index subclasses, though order matters sometimes).

This effects concat in get_objs_combined_axis. Once I figured out where the problems were coming from, I figured I could maybe side step this by adding checks to get_objs_combined_axis for equality and uniqueness. All checks pass reshape on my machine, but I'm done working on this for the day, so I'll rely on CI to do the checks on the rest of the codebase.


This test could be expanded with:

  • More types of indexes (there is no object index currently)
  • Checks for commutativity, i.e. results should have set-equal columns regardless of order
  • Interactions with other arguments (sort, axis?)

TODO:

  • Move test for concat where duplicates don't error
  • Decide whether error should be DuplicateLabelErrors
  • Have error report which duplicates are used
  • Check that error reports which duplicates are used

@@ -89,6 +89,19 @@ def get_objs_combined_axis(
Index
"""
obs_idxes = [obj._get_axis(axis) for obj in objs]
if all_indexes_same(obs_idxes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the place to do at all. should be in concat itself.

ideally we want a nice error message if this happens (and the indices are not equal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I need to add the error message, but I do think this function is where the error should be thrown.

Every function which uses get_objs_combined_axis throws an error if there are duplicate indices. Here is a demonstration of this from pandas 1.1.5:

import numpy as np
import pandas as pd
import pandas._testing as tm

def list_of_series_constructor(*args):
    return pd.DataFrame(list(args))

def crosstab(*args):
    return pd.crosstab(*args)

def concat_series(*args):
    return pd.concat(list(args), axis=1)
    
def concat_dfs(*args):
    return pd.concat([pd.DataFrame(x) for x in args], axis=1)

def duplicated_indices_errors(func, index_makers):
    records = []
    for i, index_maker in enumerate(index_makers):
        rec = {"index_maker": index_maker.__name__}
        uniq = index_maker(k=4)
        non_uniq = uniq[[0, 0, 1, 2]]

        s_uniq = pd.Series(np.ones(len(uniq)), index=uniq)
        s_non_uniq = pd.Series(np.ones(len(non_uniq)), index=non_uniq)

        try:
            res = func(s_uniq, s_non_uniq)
        except Exception as e:
            rec["error_type"] = str(type(e))
            rec["error_message"] = str(e)
        records.append(rec)
    return records

funcs = [list_of_series_constructor, crosstab, concat_series, concat_dfs]

index_makers = [
    tm.makeStringIndex,
    tm.makeIntIndex,
    tm.makeUIntIndex,
    tm.makeFloatIndex,
    tm.makeDateIndex,
    tm.makeTimedeltaIndex,
    tm.makePeriodIndex,
    tm.makeMultiIndex,
    tm.makeBoolIndex,
]

pd.concat(
    {
        f.__name__: pd.DataFrame(duplicated_indices_errors(f, index_makers)) 
        for f in funcs
    }
).to_markdown()
index_maker error_type error_message
('list_of_series_constructor', 0) makeStringIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 1) makeIntIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 2) makeUIntIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 3) makeFloatIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 4) makeDateIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 5) makeTimedeltaIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 6) makePeriodIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 7) makeMultiIndex <class 'ValueError'> Reindexing only valid with uniquely valued Index objects
('list_of_series_constructor', 8) makeBoolIndex <class 'pandas.errors.InvalidIndexError'> Reindexing only valid with uniquely valued Index objects
('crosstab', 0) makeStringIndex <class 'ValueError'> cannot reindex from a duplicate axis
('crosstab', 1) makeIntIndex <class 'ValueError'> cannot reindex from a duplicate axis
('crosstab', 2) makeUIntIndex <class 'ValueError'> cannot reindex from a duplicate axis
('crosstab', 3) makeFloatIndex <class 'ValueError'> cannot reindex from a duplicate axis
('crosstab', 4) makeDateIndex <class 'ValueError'> cannot reindex from a duplicate axis
('crosstab', 5) makeTimedeltaIndex <class 'ValueError'> cannot reindex from a duplicate axis
('crosstab', 6) makePeriodIndex <class 'ValueError'> cannot reindex from a duplicate axis
('crosstab', 7) makeMultiIndex <class 'ValueError'> cannot handle a non-unique multi-index!
('crosstab', 8) makeBoolIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 0) makeStringIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 1) makeIntIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 2) makeUIntIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 3) makeFloatIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 4) makeDateIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 5) makeTimedeltaIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 6) makePeriodIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_series', 7) makeMultiIndex <class 'ValueError'> cannot handle a non-unique multi-index!
('concat_series', 8) makeBoolIndex <class 'ValueError'> cannot reindex from a duplicate axis
('concat_dfs', 0) makeStringIndex <class 'ValueError'> Shape of passed values is (5, 2), indices imply (4, 2)
('concat_dfs', 1) makeIntIndex <class 'ValueError'> Shape of passed values is (7, 2), indices imply (5, 2)
('concat_dfs', 2) makeUIntIndex <class 'ValueError'> Shape of passed values is (7, 2), indices imply (5, 2)
('concat_dfs', 3) makeFloatIndex <class 'ValueError'> Shape of passed values is (7, 2), indices imply (5, 2)
('concat_dfs', 4) makeDateIndex <class 'ValueError'> Shape of passed values is (7, 2), indices imply (5, 2)
('concat_dfs', 5) makeTimedeltaIndex <class 'ValueError'> Shape of passed values is (7, 2), indices imply (5, 2)
('concat_dfs', 6) makePeriodIndex <class 'ValueError'> Shape of passed values is (7, 2), indices imply (5, 2)
('concat_dfs', 7) makeMultiIndex <class 'ValueError'> cannot handle a non-unique multi-index!
('concat_dfs', 8) makeBoolIndex <class 'ValueError'> Shape of passed values is (4, 2), indices imply (2, 2)

Instead of just giving concat a better error message, all these methods could get better (and more consistent) error messages. I think it's important for the rules about unions and intersections of indices to be defined in these more internal methods so behaviour isn't defined (and reimplemented) per top level function.

Additionally, this function essentially returns nonsense for unions when there are duplicated indices since different Index types have different definitions for unions with duplicates, so I think it's appropriate to throw errors here instead of passing on those nonsense results.

This has also made me realize there is one case where it's okay for there to be duplicates, which is when indices contain duplicates, but the intersection would not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has also made me realize there is one case where it's okay for there to be duplicates, which is when indices contain duplicates, but the intersection would not.

This got a bit more complicated. idx.get_indexer(keys) throws an error if idx is non-unique, regardless of whether keys have unique indices in idx.

Interestingly enough, it does not throw an error if the idx is non-unique, but keys is empty (which is the case I had initially seen).

Examples using pandas 1.1.5:

pd.crosstab(
    pd.Series(np.arange(4), index=[0, 1, 1, 2]),
    pd.Series(np.arange(2), index=[0, 2]),
)
# ValueError: cannot reindex from a duplicate axis

pd.crosstab(
    pd.Series(np.arange(4), index=[0, 1, 1, 2]),
    pd.Series(np.arange(2), index=[3, 4]),
)
# Empty DataFrame
# Columns: []
# Index: []
Examples with pd.concat

This had the same behaviour despite not using .get_indexer directly in 1.1.5.

pd.concat(
    [
        pd.Series(np.arange(4), index=[0, 1, 1, 2]),
        pd.Series(np.arange(2), index=[0, 2]),
    ],
    axis=1,
    join="inner"
)
# ValueError: cannot reindex from a duplicate axis

pd.concat(
    [
        pd.Series(np.arange(4), index=[0, 1, 1, 2]),
        pd.Series(np.arange(2), index=[3, 4]),
    ],
    axis=1,
    join="inner"
)
# Empty DataFrame
# Columns: [0, 1]
# Index: []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a more general issue for this: #38797

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 28, 2020
@jreback jreback added this to the 1.3 milestone Dec 28, 2020
@pep8speaks
Copy link

pep8speaks commented Dec 29, 2020

Hello @ivirshup! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-30 03:49:30 UTC

This fixes a previous issue where `get_objs_combined_axis` would not throw and error for duplicated indices in all cases.
Now we check to make sure that the duplicated values don't show up in the intersection.
This allows safe indexing when there are duplicated values, but they don't show up in the intersection.
@ivirshup
Copy link
Contributor Author

Should the errors here be DuplicateLabelErrors?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -463,3 +466,33 @@ def test_setop_with_categorical(index, sort, method):
result = getattr(index, method)(other[:5], sort=sort)
expected = getattr(index, method)(index[:5], sort=sort)
tm.assert_index_equal(result, expected)


@pytest.mark.parametrize("index_maker", tm.index_subclass_makers_generator())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the index fixture instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should I handle cases where that fixture gives inappropriate input?

Should I define a separate fixture for this? Can I do something like a hypothesis.assume to make sure some of the cases don't get through and aren't marked as skipped?

The cases which won't work here are at least:

  • "empty"
  • "repeats"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its kind of ugly, but in many of these tests we do something like

def test_foo(index_fixture):
    if whatever(index_fixture):
        pytest.skip()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(id be psyched to see a hypothesis-based implementation of some of our tests, need a way to generate any-valid-index)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to the fixture, but it adds a lot of skipped tests. I would be nice if I could filter this before collection, or use a fixture where I can specify some features of the construction (e.g. values are unique and it has a length of n)


result = get_objs_combined_axis(series, intersect=True)

tm.assert_index_equal(full[[0, 2]], result, check_order=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always use
result =
expected =
tm.assert_index(result, expected)

why is check_order=False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_order is false because the MultiIndex comes back in a different order.

More conceptually, I don't think this function promises any ordering unless you pass sort.

@pytest.mark.parametrize("join", ["inner", "outer"])
def test_concat_duplicates_error(index_maker, join):
# if index_maker is tm.makeMultiIndex:
# TODO: This generator only makes Indexs of size 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def makeMultiIndex(k=10, names=None, **kwargs):
return MultiIndex.from_product((("foo", "bar"), (1, 2)), names=names, **kwargs)

This function only makes a MultiIndex with length 4, ignoring the k parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



@pytest.mark.parametrize("index_maker", tm.index_subclass_makers_generator())
@pytest.mark.xfail(reason="Not implemented")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an issue for this? what is this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this to be allowed:

import pandas as pd

result = pd.concat(
    [
        pd.Series(0, index=[0, 0, 1, 2]),
        pd.Series(1, index=[1, 2]),
    ],
    join="inner",
)
expected = pd.DataFrame({0: [0, 0], 1: [1, 1]}, index=[1, 2])

pd.testing.assert_frame_equal(result, expected)

Because the intersection of those indices is well defined. However, it turns out this does not work, and also doesn't work in 1.1.5. I sort of opened this issue here: #38773, but that was a more low-level issue.

@pytest.mark.parametrize("index_maker", tm.index_subclass_makers_generator())
@pytest.mark.parametrize("join", ["inner", "outer"])
def test_concat_duplicates_error(index_maker, join):
# if index_maker is tm.makeMultiIndex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment in all of the new tests

@@ -135,13 +135,20 @@ def _get_combined_index(
indexes = _get_distinct_objs(indexes)
if len(indexes) == 0:
index = Index([])
elif len(indexes) == 1:
elif len(indexes) == 1 or all_indexes_same(indexes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comments here as it is non-obvious what is happening

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

Suggested change
elif len(indexes) == 1 or all_indexes_same(indexes):
# if unique by id or unique by value
elif len(indexes) == 1 or all_indexes_same(indexes):

@jbrockmendel
Copy link
Member

because it turns out union and intersection act differently for different index types

@phofl has been working on this recently, might shed some light on the intended behavior medium-long term

@phofl
Copy link
Member

phofl commented Dec 30, 2020

union and intersection should both be unique, but as you mentioned this may not be the case for all Index types right now.

Additionally to your issue #38623 also is problematic. Do you have a list of Index classes where this is not the case for union?

@phofl
Copy link
Member

phofl commented Dec 30, 2020

Put up #38834 to fix the IntervalIndex.intersection bug

@ivirshup
Copy link
Contributor Author

ivirshup commented Jan 4, 2021

Do you have a list of Index classes where this is not the case for union?

@phofl, let me know if there is an open issue I should post this to. It seems like it's most of them except MultiIndex (but don't know if this depends on the "inner types"). For Index it depends on the order. RangeIndex is a bit of a different case since it should always be unique – I think.

Code used to check
import pandas as pd
import numpy as np

import pandas._testing as tm

index_makers = [
    tm.makeStringIndex,
    tm.makeIntIndex,
    tm.makeUIntIndex,
    tm.makeFloatIndex,
    tm.makeDateIndex,
    tm.makeTimedeltaIndex,
    tm.makePeriodIndex,
    tm.makeMultiIndex,
    tm.makeBoolIndex,
]

records = []

union = lambda x, y: x.union(y)
intersection = lambda x, y: x.intersection(y)

# union = lambda x, y: pd.core.indexes.api._get_combined_index([x, y], intersect=False)
# intersection = lambda x, y: pd.core.indexes.api._get_combined_index([x, y], intersect=True)

for index_maker in index_makers:
    idx1 = index_maker(k=4)
    idx2 = idx1[np.array([0, 0, 1, 2, 3])]
    assert type(idx1) == type(idx2)
    rec = {"index_type": type(idx1).__name__, "index_generator": index_maker.__name__}
    
    # idx_intersect = intersection(idx1, idx2)
    # rec["intersect_len"] = len(idx_intersect)
    # rec["intersect_commutative"] = idx_intersect.sort_values().equals(intersection(idx2, idx1).sort_values())

    idx_union = union(idx1, idx2)
    rec["union_len"] = len(idx_union)
    rec["union_commutative"] = idx_union.sort_values().equals(union(idx2, idx1).sort_values())

    rec["unique_len"] = len(idx_union.unique())
    records.append(rec)

print(pd.DataFrame.from_records(records).to_markdown(index=False))
index_type index_generator union_len union_commutative unique_len
Index makeStringIndex 4 False 4
Int64Index makeIntIndex 5 True 4
UInt64Index makeUIntIndex 5 True 4
Float64Index makeFloatIndex 5 True 4
DatetimeIndex makeDateIndex 5 True 4
TimedeltaIndex makeTimedeltaIndex 5 True 4
PeriodIndex makePeriodIndex 5 True 4
MultiIndex makeMultiIndex 4 True 4
Index makeBoolIndex 4 False 2

This is using 1.3.0.dev0+218.g3a066feb08

@phofl
Copy link
Member

phofl commented Jan 4, 2021

Sorry I said something wrong, intersection should be unique, union not.
#36299 will make this consistent for every class except MultiIndex. Will look into this

| index_type     | index_generator    |   union_len | union_commutative   |   unique_len |
|:---------------|:-------------------|------------:|:--------------------|-------------:|
| Index          | makeStringIndex    |           5 | True                |            4 |
| Int64Index     | makeIntIndex       |           5 | True                |            4 |
| UInt64Index    | makeUIntIndex      |           5 | True                |            4 |
| Float64Index   | makeFloatIndex     |           5 | True                |            4 |
| DatetimeIndex  | makeDateIndex      |           5 | True                |            4 |
| TimedeltaIndex | makeTimedeltaIndex |           5 | True                |            4 |
| PeriodIndex    | makePeriodIndex    |           5 | True                |            4 |
| MultiIndex     | makeMultiIndex     |           4 | True                |            4 |
| Index          | makeBoolIndex      |           7 | True                |            2 |

This is what I am getting on top of my pr.

The bool Index case might be fishy, but you get this on master too, if the BoolIndex is sorted

idx1 = Index([False, False, False, True])
idx2 = Index([False, False, False, False, True])

idx_union = idx1.union(idx2)
idx_union

returns

Index([False, False, False, False, False, False, True], dtype='object')

which is consistent with non-Bool duplictaes on both sides like

idx1 = Index([0, 1, 1])
idx2 = Index([0, 1, 1, 2])

idx_union = idx1.union(idx2)
Int64Index([0, 1, 1, 1, 2], dtype='int64')

But not consistent if both are equal

idx1 = Index([0, 1, 1])
idx2 = Index([0, 1, 1])

idx_union = idx1.union(idx2)
Int64Index([0, 1, 1], dtype='int64')

Probably worth looking into how to avoid this when performing the outer join in union.

@ivirshup
Copy link
Contributor Author

ivirshup commented Jan 5, 2021

Why wouldn't unions have unique values?

And related to the bools: I think – as you note – most of the issues here are from bugs dealing with the order of the values. I think it would be good to add a testing strategy which permutes how duplicates occur (e.g. at the start, at the end, in the middle, next to each other, combinations of these cases) to check this.

@ivirshup
Copy link
Contributor Author

ivirshup commented Jan 5, 2021

Currently running into issues with intersections of timestamp objects. They seem to be losing their offset field sometimes, not sure what's up with this yet.

@phofl
Copy link
Member

phofl commented Jan 5, 2021

Among others the union case is discussed here: #31326

I think a logical interpretation of a union is, that the union is the smallest index, so that both indexes are part of this index.

Currently the union is handled differently based on if the indexes are monotonic or not. My pr aims to handle both cases the same and resort afterwards, this would leave us there with more consistency. The unwanted duplicates described above can be solved thereafter I think.

@simonjayhawkins
Copy link
Member

This PR closes #36263 which is milestoned 1.2.1 whereas this PR is milestoned 1.3. what's the best way to resolve this inconsistency?

@jreback
Copy link
Contributor

jreback commented Jan 6, 2021

this for 1.3

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 6, 2021
@simonjayhawkins
Copy link
Member

@ivirshup closing as stale. ping if you want to continue.

@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.concat inconsistent with non-unique index
7 participants