Skip to content

Surprising and inconsistent results when adding two Series which both contain duplicate labels. #20831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mdickinson opened this issue Apr 26, 2018 · 2 comments
Labels
Bug Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@mdickinson
Copy link

If I add two Series objects, both of which contain the same duplicate label, it appears that I get a result in which every possible combination for that label appears. In the below: s1 has the label b twice, s2 has the label b thrice, and the result has the label b six times.

Python 3.6.5 (default, Mar 29 2018, 15:37:32) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> s1 = pd.Series(range(3), list('abb'))
>>> s2 = pd.Series(range(4), list('abbb'))
>>> s1 + s2
a    0
b    2
b    3
b    4
b    3
b    4
b    5
dtype: int64

That doesn't seem an unreasonable behaviour, but it's not applied consistently. If instead the second series has the same number of occurrences, then the data entries are added elementwise:

>>> s3 = pd.Series(range(3), list('abb'))
>>> s4 = pd.Series(range(3), list('abb'))
>>> s3 + s4
a    0
b    2
b    4
dtype: int64

That seems to make the "outer product" behaviour in the first example somewhat dangerous, because any code that depends on it is at risk of giving inconsistent results if it happens to get a dataset where the numbers of the various labels match exactly. Should the second behaviour be altered to be consistent with the first? Or should maybe the first behaviour become an error (after a suitable deprecation period)?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.22.0
pytest: None
pip: 10.0.0
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

@mdickinson Thanks for raising the issue. Your completely right this is inconsistent, and thus behaviour you cannot rely upon.

Trying to guess the (historical) origin of the issue: in the case of non-identical index with duplicates, the full combinatorial re-alignment is indeed reasonable, and the only thing we can do (except for raising an error). It is also consistent with indexing.
On the other hand, the fact that, when the index is identical, you don't get such a combinatorial explosion makes a DataFrame with duplicate index values somewhat more reasonable to work with.

In general, duplicate index values is not really recommended (given out default mode of doing alignment on all operation), but if you have a DataFrame with for example duplicate values in the index, the fact that you can still for example sum two columns (df_with_dupe_index['c'] = df_with_dupe_index['a'] + df_with_dupe_index['b']) makes a dataframe with duplicate index values somewhat workable.

So I am not sure what would be the best we can do here. Removing this "feature" to treat identical indexes with duplicate values as if they didn't have duplicate values will for sure break a lot of code.
So if we want to change something, I would rather change the first to error. But I am also not sure that is desirable in all cases.

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design Needs Discussion Requires discussion from core team before further action labels Apr 26, 2018
@corranwebster
Copy link

A possible partial solution to this may be to follow something like Numpy's broadcasting semantics: allow alignment with multiple values either by:

  • exact match of number of indexes with a particular value; or
  • allow multiples of a value in one index to match when there is a singleton entry in the other by broadcasting the single index value.

Everything else is an error.

So in this case, everything works as it currently does when things match:

>>> s1 = pd.Series(range(5), list('abbbb'))
>>> s2 = pd.Series(range(5), list('abbbb'))
>>> s1 + s2
a    0
b    2
b    4
b    6
b    8
dtype: int64

and this is also allowed (and also matches current behaviour):

>>> s3 = pd.Series(range(5), list('abbbb'))
>>> s4 = pd.Series(range(2), list('ab'))
>>> s3 + s4
a    0
b    2
b    3
b    4
b    5
dtype: int64

and as is this:

>>> s4 + s3
a    0
b    2
b    3
b    4
b    5
dtype: int64

but

>>> s5 = pd.Series(range(3), list('abb'))
>>> s6 = pd.Series(range(4), list('abbb'))
>>> s5 + s6

raises an exception just as if you tried to add a length 3 Numpy array to a length 4 Numpy array.

Doing this would break backwards compatibility, but would preserve some useful behaviour of the current implementation; and has the advantage of following existing conventions for similar situations in related codebases which allows users to reason about the behaviour.

@jbrockmendel jbrockmendel added Numeric Operations Arithmetic, Comparison, and Logical operations and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Feb 13, 2020
@mroeschke mroeschke added Bug and removed API Design labels Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

5 participants