Surprising and inconsistent results when adding two `Series` which both contain duplicate labels. #20831

mdickinson · 2018-04-26T12:21:36Z

If I add two Series objects, both of which contain the same duplicate label, it appears that I get a result in which every possible combination for that label appears. In the below: s1 has the label b twice, s2 has the label b thrice, and the result has the label b six times.

Python 3.6.5 (default, Mar 29 2018, 15:37:32) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> s1 = pd.Series(range(3), list('abb'))
>>> s2 = pd.Series(range(4), list('abbb'))
>>> s1 + s2
a    0
b    2
b    3
b    4
b    3
b    4
b    5
dtype: int64

That doesn't seem an unreasonable behaviour, but it's not applied consistently. If instead the second series has the same number of occurrences, then the data entries are added elementwise:

>>> s3 = pd.Series(range(3), list('abb'))
>>> s4 = pd.Series(range(3), list('abb'))
>>> s3 + s4
a    0
b    2
b    4
dtype: int64

That seems to make the "outer product" behaviour in the first example somewhat dangerous, because any code that depends on it is at risk of giving inconsistent results if it happens to get a dataset where the numbers of the various labels match exactly. Should the second behaviour be altered to be consistent with the first? Or should maybe the first behaviour become an error (after a suitable deprecation period)?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.22.0
pytest: None
pip: 10.0.0
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-04-26T19:11:31Z

@mdickinson Thanks for raising the issue. Your completely right this is inconsistent, and thus behaviour you cannot rely upon.

Trying to guess the (historical) origin of the issue: in the case of non-identical index with duplicates, the full combinatorial re-alignment is indeed reasonable, and the only thing we can do (except for raising an error). It is also consistent with indexing.
On the other hand, the fact that, when the index is identical, you don't get such a combinatorial explosion makes a DataFrame with duplicate index values somewhat more reasonable to work with.

In general, duplicate index values is not really recommended (given out default mode of doing alignment on all operation), but if you have a DataFrame with for example duplicate values in the index, the fact that you can still for example sum two columns (df_with_dupe_index['c'] = df_with_dupe_index['a'] + df_with_dupe_index['b']) makes a dataframe with duplicate index values somewhat workable.

So I am not sure what would be the best we can do here. Removing this "feature" to treat identical indexes with duplicate values as if they didn't have duplicate values will for sure break a lot of code.
So if we want to change something, I would rather change the first to error. But I am also not sure that is desirable in all cases.

corranwebster · 2018-04-29T08:50:51Z

A possible partial solution to this may be to follow something like Numpy's broadcasting semantics: allow alignment with multiple values either by:

exact match of number of indexes with a particular value; or
allow multiples of a value in one index to match when there is a singleton entry in the other by broadcasting the single index value.

Everything else is an error.

So in this case, everything works as it currently does when things match:

>>> s1 = pd.Series(range(5), list('abbbb'))
>>> s2 = pd.Series(range(5), list('abbbb'))
>>> s1 + s2
a    0
b    2
b    4
b    6
b    8
dtype: int64

and this is also allowed (and also matches current behaviour):

>>> s3 = pd.Series(range(5), list('abbbb'))
>>> s4 = pd.Series(range(2), list('ab'))
>>> s3 + s4
a    0
b    2
b    3
b    4
b    5
dtype: int64

and as is this:

>>> s4 + s3
a    0
b    2
b    3
b    4
b    5
dtype: int64

but

>>> s5 = pd.Series(range(3), list('abb'))
>>> s6 = pd.Series(range(4), list('abbb'))
>>> s5 + s6

raises an exception just as if you tried to add a length 3 Numpy array to a length 4 Numpy array.

Doing this would break backwards compatibility, but would preserve some useful behaviour of the current implementation; and has the advantage of following existing conventions for similar situations in related codebases which allows users to reason about the behaviour.

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design Needs Discussion Requires discussion from core team before further action labels Apr 26, 2018

jbrockmendel added Numeric Operations Arithmetic, Comparison, and Logical operations and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Feb 13, 2020

mroeschke added Bug and removed API Design labels Jun 19, 2021

blazespinnaker mentioned this issue Nov 17, 2022

ENH: ignore_index for Series corr #49617

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprising and inconsistent results when adding two `Series` which both contain duplicate labels. #20831

Surprising and inconsistent results when adding two `Series` which both contain duplicate labels. #20831

mdickinson commented Apr 26, 2018

INSTALLED VERSIONS

jorisvandenbossche commented Apr 26, 2018

corranwebster commented Apr 29, 2018

Surprising and inconsistent results when adding two Series which both contain duplicate labels. #20831

Surprising and inconsistent results when adding two Series which both contain duplicate labels. #20831

Comments

mdickinson commented Apr 26, 2018

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Apr 26, 2018

corranwebster commented Apr 29, 2018

Surprising and inconsistent results when adding two `Series` which both contain duplicate labels. #20831

Surprising and inconsistent results when adding two `Series` which both contain duplicate labels. #20831

Output of `pd.show_versions()`