Skip to content

BUG: Subtracting two series with unordered index and all-nan index produces unexpected result #38439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
ssche opened this issue Dec 13, 2020 · 7 comments · Fixed by #45900
Closed
3 tasks done
Assignees
Labels
good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Tests Unit test(s) needed to prevent regressions Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@ssche
Copy link
Contributor

ssche commented Dec 13, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

>>> import pandas as pd
>>> import numpy as np
>>> a_index = pd.MultiIndex.from_tuples([
...   (81.0,  np.nan, '2018-06-01'),
...   (81.0,  np.nan, '2018-07-01'),
...   (82.0,  np.nan, '2018-07-01'),
...   (82.0,  np.nan, '2018-08-01'),
...   (np.nan,np.nan, np.nan)],
...    names=['id', 'sub_ix', 'data']
... )
>>> a_values = [25, 22, 20, 21, np.nan]
>>> b_index = pd.MultiIndex.from_tuples([
...   (81.0, np.nan,  '2018-06-01'),
...   (np.nan, np.nan, np.nan),
...   (81.0, np.nan,  '2018-07-01'),
...   (82.0, np.nan,  '2018-07-01'),
...   (82.0, np.nan,  '2018-08-01')],
...   names=['id', 'sub_ix', 'data']
... )
>>> b_values = [28.28, np.nan, 28.28, 25.25, 25.25]
>>> a = pd.Series(a_values, index=a_index)
>>> b = pd.Series(b_values, index=b_index)


>>> a
id    sub_ix  data      
81.0  NaN     2018-06-01    25.0
              2018-07-01    22.0
82.0  NaN     2018-07-01    20.0
              2018-08-01    21.0
NaN   NaN     NaN            NaN
dtype: float64


>>> b
id    sub_ix  data      
81.0  NaN     2018-06-01    28.28
NaN   NaN     NaN             NaN
81.0  NaN     2018-07-01    28.28
82.0  NaN     2018-07-01    25.25
              2018-08-01    25.25
dtype: float64



>>> a - b
id    sub_ix  data      
81.0  NaN     2018-06-01   -3.28
              2018-07-01     NaN <-- this shouldn't be NaN, the index (81.0, NaN, 2018-07-01) exists in both `a` and `b` (it's just not ordered in `b`)
82.0  NaN     2018-07-01   -8.28 <-- also wrong, expected -5.25
              2018-08-01   -4.25
NaN   NaN     NaN            NaN
dtype: float64
>>> a - b.sort_index()
id    sub_ix  data      
81.0  NaN     2018-06-01   -3.28
              2018-07-01   -6.28 <-- expected value
82.0  NaN     2018-07-01   -5.25
              2018-08-01   -4.25
NaN   NaN     NaN            NaN
dtype: float64

Problem description

When combining two series with both the same index and with an all-nan index row at different positions, the result of the arithmetic operation (+, -, /) is not as expected. The issue can be worked around by sorting both indices (series.sort_index). I tried a different example with unordered indices, but without the all-nan index row and the result is as expected (so it's not an issue of the unsorted indices).

id    sub_ix  data      
81.0  NaN     2018-07-01    25
              2018-06-01    22
82.0  NaN     2018-07-01    20
              2018-08-01    21
dtype: int64
>>> b
id    sub_ix  data      
81.0  NaN     2018-06-01    1
              2018-07-01    2
82.0  NaN     2018-07-01    3
              2018-08-01    4
dtype: int64
>>> a - b
id    sub_ix  data      
81.0  NaN     2018-06-01    21
              2018-07-01    23
82.0  NaN     2018-07-01    17
              2018-08-01    17
dtype: int64

Expected Output

Operands should be aligned as per index (despite all nan-rows in the index).

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 65f0463
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.9.11-200.fc33.x86_64
Version : #1 SMP Tue Nov 24 18:18:01 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.2.0.dev0+1441.g65f0463d3
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.2
Cython : 0.29.21
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : 1.0.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@ssche ssche added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 13, 2020
@simonjayhawkins
Copy link
Member

Thanks @ssche for the report. Tested with 0.25.3 and also giving same output as master.

@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 13, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Dec 13, 2020
@simonjayhawkins simonjayhawkins added the Numeric Operations Arithmetic, Comparison, and Logical operations label Dec 13, 2020
@phofl
Copy link
Member

phofl commented Dec 15, 2020

This is a bug in MultiIndex.equals, which may hide a Bug in Series.align.
The following behaves differently:

a_index = pd.MultiIndex.from_tuples([(81.0,  np.nan), (np.nan,np.nan)])
b_index = pd.MultiIndex.from_tuples([(np.nan, np.nan),(82.0, np.nan)])
ser = Series([1, 2], index=a_index)
ser1 = Series([1, 2], index=b_index)
ser.align(ser1)

returns

(81.0  NaN    1
NaN   NaN    2
      NaN    2
82.0  NaN    2
dtype: int64, 81.0  NaN    1
NaN   NaN    1
      NaN    1
82.0  NaN    2
dtype: int64)

while

a_index = pd.MultiIndex.from_tuples([(81.0,  np.nan), (np.nan,np.nan)])
b_index = pd.MultiIndex.from_tuples([(np.nan, np.nan),(81.0, np.nan)])
ser = Series([1, 2], index=a_index)
ser1 = Series([1, 2], index=b_index)
ser.align(ser1)

returns

(81.0  NaN    1
NaN   NaN    2
dtype: int64, NaN   NaN    1
81.0  NaN    2
dtype: int64)

e.g. not duplicating the NaN rows. I think the second one is the correct one?

@mroeschke
Copy link
Member

Seems like the above example is consistent now? Could use a test possibly

In [6]: a_index = pd.MultiIndex.from_tuples([(81.0,  np.nan), (np.nan,np.nan)])
   ...: b_index = pd.MultiIndex.from_tuples([(np.nan, np.nan),(82.0, np.nan)])
   ...: ser = Series([1, 2], index=a_index)
   ...: ser1 = Series([1, 2], index=b_index)
   ...: ser.align(ser1)
Out[6]:
(81.0  NaN    1
 NaN   NaN    2
 dtype: int64,
 81.0  NaN    1
 NaN   NaN    1
 dtype: int64)

In [7]: a_index = pd.MultiIndex.from_tuples([(81.0,  np.nan), (np.nan,np.nan)])
   ...: b_index = pd.MultiIndex.from_tuples([(np.nan, np.nan),(81.0, np.nan)])
   ...: ser = Series([1, 2], index=a_index)
   ...: ser1 = Series([1, 2], index=b_index)
   ...: ser.align(ser1)
Out[7]:
(81.0  NaN    1
 NaN   NaN    2
 dtype: int64,
 81.0  NaN    2
 NaN   NaN    1
 dtype: int64)

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 14, 2021
@trevorkask
Copy link
Contributor

take

@trevorkask
Copy link
Contributor

I have tried to test the code and the assertion has no errors

In[1]
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from pandas.testing import assert_series_equal
a_index = pd.MultiIndex.from_tuples([(81.0,  np.nan), (np.nan,np.nan)])
b_index = pd.MultiIndex.from_tuples([(np.nan, np.nan),(82.0, np.nan)])
ser = Series([1, 2], index=a_index)
ser1 = Series([1, 2], index=b_index)
ser.align(ser1)

a_index = pd.MultiIndex.from_tuples([(81.0,  np.nan), (np.nan,np.nan)])
b_index = pd.MultiIndex.from_tuples([(np.nan, np.nan),(81.0, np.nan)])
ser = Series([1, 2], index=a_index)
ser1 = Series([1, 2], index=b_index)
ser.align(ser1)

a = pd.Series([1, 2, 3, 4])
b = pd.Series([1, 2, 3, 4])
diff_index=a_index
diff = Series([0,0], index=diff_index)

print(assert_series_equal(ser-ser1,diff))

Out[2]: None

@ssche
Copy link
Contributor Author

ssche commented Sep 2, 2021

@trevorkask Great news and thanks for sharing your results. I think the intent is now to create a test case (your test case above) to lock in behaviour in Pandas. Do you need assistance?

@trevorkask
Copy link
Contributor

Yes, please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Tests Unit test(s) needed to prevent regressions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
5 participants