Skip to content

MultiIndex.from_product() casts float to int when corresponding int is also present #19432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
toobaz opened this issue Jan 28, 2018 · 13 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions MultiIndex

Comments

@toobaz
Copy link
Member

toobaz commented Jan 28, 2018

Code Sample, a copy-pastable example if possible

In [2]: pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)]).values
Out[2]: array([(1, 0), (1, 1), (1, 2), (1, 0), (1, 1), (1, 2)], dtype=object)

In [3]: pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]]
   ...: )
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a2ad432a7b0f> in <module>()
----> 1 pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]])

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, labels, sortorder, names, copy, verify_integrity, _set_identity, name, **kwargs)
    238 
    239         if verify_integrity:
--> 240             result._verify_integrity()
    241         if _set_identity:
    242             result._reset_identity()

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, labels, levels)
    281                                  "level {level}".format(
    282                                      values=[value for value in level],
--> 283                                      level=i))
    284 
    285     @property

ValueError: Level values must be unique: [1, 1.0] on level 0

In [4]: pd.Index([1, 1.], dtype=object)
Out[4]: Index([1, 1.0], dtype='object')

Problem

If a flat (object) Index allows us to distinguish 1 and 1., the same should do MultiIndex. From #18913 (comment)

Expected Output

Both the first two examples should return a MultiIndex containing both 1 and 1.0 in its first level.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 8cbee35
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.0.dev0+182.g8cbee356d.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz
Copy link
Member Author

toobaz commented Jan 30, 2018

In [2]: pd.Index([1, 1.], dtype=object).is_unique
Out[2]: False

... which is correct (1 == 1.), and means that the error message about unicity is correct. So the wrong thing is that the other call is automatically casting to int.

@toobaz toobaz changed the title Impossible to initialize MultiIndex with int and corresponding float MultiIndex.from_product() casts float to int when corresponding int is also present Jan 30, 2018
@toobaz toobaz added Dtype Conversions Unexpected or buggy dtype conversions MultiIndex labels Mar 30, 2018
@simonjayhawkins
Copy link
Member

and means that the error message about unicity is correct

so does this mean that the second example is actually giving the expected output?

and that to initialize the MultiIndex you should add verify_integrity=False:

pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)],
              labels=[[0,1], [1,1]], verify_integrity=False)

which gives a MultiIndex containing both 1 and 1.0 in its first level.

MultiIndex(levels=[[1, 1.0], [0, 1, 2]], labels=[[0, 1], [1, 1]])

@simonjayhawkins
Copy link
Member

if we, for the moment, ignore the float from the problem description and 'manually' create a MultiIndex using pd.Index([1, 1], dtype=object for the first level:

>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
...               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

we get:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "...\pandas\core\indexes\multi.py", line 242, in __new__
    result._verify_integrity()
  File "...\pandas\core\indexes\multi.py", line 285, in _verify_integrity
    level=i))
ValueError: Level values must be unique: [1, 1] on level 0

which makes sense so we add verify_integrity=False to get the expected output :

>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
...               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], verify_integrity=False)
MultiIndex(levels=[[1, 1], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
>>>

if we now try to re-create this using MultiIndex.from_product(), we get:

>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

which is not the same as the output from either of the two previous cases!

since the from_product() method does not have a verify_integrity parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0 since this is the default for pd.MultiIndex

so going back to the original issue, it appears it is not related to the input containing a float and that the expected output from the first example in the issue description should actually be:

ValueError: Level values must be unique: [1, 1.] on level 0

i think this then raises the question: Should MultiIndex.from_product() have a verify_integrity parameter?

@toobaz
Copy link
Member Author

toobaz commented Sep 5, 2018

since the from_product() method does not have a verify_integrity parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0 since this is the default for pd.MultiIndex

I disagree: the from_product docs (and intuition) just refer to the "cartesian product of iterables", not in any way to the underlying levels.

Vice-versa, when you do pd.MultiIndex(levels=...) you are clearly passing levels, so it is OK check unicity and raise.

But indeed the problem is more subtle than I thought: ideally, we would want pd.Index([1, 1.], dtype=object).is_unique to return False, but it's maybe to late to change. So assuming that does return True, and that MultiIndex levels must be unique, we can't have both an int and its float representation in a same MultiIndex level.

The only doubt then is whether we should favour the float, rather than int, representation, given that for instancepd.Index([1, 1.]) gives a Float64Index.

@simonjayhawkins
Copy link
Member

... and that MultiIndex levels must be unique, we can't have both an int and its float representation in a same MultiIndex level.

...when using from_product()

The only doubt then is whether we should favour the float, rather than int, representation, given that for instancepd.Index([1, 1.]) gives a Float64Index.

swapping the order of the float and int gives a float for the first level:

>>> pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([pd.Index([1., 1], dtype=object), range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

so it appears not to be a casting issue as the issue title suggests?

@simonjayhawkins
Copy link
Member

... and that MultiIndex levels must be unique ...

Indeed, according to the documentation for both pandas.MultiIndex and pandas.MultiIndex.from_product

and yet in the non-float example, i passed a non-unique iterable as the first iterable and got a result instead of a value error:

>>> pd.Index([1, 1], dtype=object).is_unique
False
>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

@simonjayhawkins
Copy link
Member

I think it is also worth noting that:

>>>
>>> import pandas as pd
>>>
>>> pd.MultiIndex.from_product([[1, 1], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1, True], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1.0, True], range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[True, 1], range(3)])
MultiIndex(levels=[[True], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

are probably not giving the expected output either.

@simonjayhawkins
Copy link
Member

which could result in:

>>>
>>> a = 19998989890
>>> b = 19998989889 +1
>>> a is b
False
>>> a == b
True
>>> pd.MultiIndex.from_product([[a,b], range(3)])
MultiIndex(levels=[[19998989890], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

@simonjayhawkins
Copy link
Member

from pandas.core.arrays.categorical import _factorize_from_iterables
from pandas.core.reshape.util import cartesian_product
if not is_list_like(iterables):
raise TypeError("Input must be a list / sequence of iterables.")
elif is_iterator(iterables):
iterables = list(iterables)
labels, levels = _factorize_from_iterables(iterables)
labels = cartesian_product(labels)
return MultiIndex(levels, labels, sortorder=sortorder, names=names)

>>>
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>>
>>> labels, levels =_factorize_from_iterables([[1, True], range(3)])
>>> labels
[array([0, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Int64Index([1], dtype='int64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>> from pandas.core.reshape.util import cartesian_product
>>>
>>> labels = cartesian_product(labels)
>>> labels
[array([0, 0, 0, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>>
>>> pd.MultiIndex(levels, labels)
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

it appears that from_product() would need to use a different implementation of _factorize_from_iterables

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Sep 5, 2018

>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables

use 3 in the first iterable so that the objects do not compare equal

>>> labels, levels =_factorize_from_iterables([[3, True], range(3)])
>>> labels
[array([1, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Index([True, 3], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]

change the 3 back to a 1 so that the first iterable has different objects which compare equal

>>> levels = [pd.Index([1, True], dtype='object'), pd.Int64Index([0, 1, 2], dtype='int64')]
>>> levels
[Index([1, True], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> from pandas.core.reshape.util import cartesian_product
>>> labels = cartesian_product(labels)
>>> labels
[array([1, 1, 1, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>> pd.MultiIndex(levels, labels, verify_integrity=False)
MultiIndex(levels=[[1, True], [0, 1, 2]],
           labels=[[1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2]])

which is the expected output?

changing _factorize_from_iterables alone would give a value error unless MultiIndex is called with verify_integrity=False

>>> pd.MultiIndex(levels, labels)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 242, in __new__
    result._verify_integrity()
  File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 285, in _verify_integrity
    level=i))
ValueError: Level values must be unique: [1, True] on level 0

@simonjayhawkins
Copy link
Member

MultiIndex.from_product() casts float to int when corresponding int is also present

it depends on the ordering:

>>>
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1., 2., 2], dtype=object), range(3)])
>>> levels
[Index([1, 2.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]

and the index type is unchanged:

>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1.], dtype=object), range(3)])
>>> levels
[Index([1], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([pd.Index([1., 1], dtype=object), range(3)])
>>> levels
[Index([1.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>

if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:

>>>
>>> labels, levels =_factorize_from_iterables([[1., 1], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([[1, 1.], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>
>>> pd.MultiIndex.from_product([[1, 1., 2., 2], range(3)])
MultiIndex(levels=[[1.0, 2.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>

@simonjayhawkins
Copy link
Member

if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:

unless the list also contains booleans and then it depends on the ordering again:

>>>
>>>
>>> pd.MultiIndex.from_product([pd.Index([1, 1., 2., 2, True, False], dtype=object), range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
           labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> pd.MultiIndex.from_product([[1, 1., 2., 2, True, False], range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
           labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>

@Paradox456
Copy link

Paradox456 commented Jul 24, 2024

I believe you should use pd.MultiIndex.from_product since that pd.MultiIndex creation with object dtype levels treats seemingly duplicate values (e.g., 1 and 1.0) as errors, although from_product works.

Out[2]: array([(1, 0), (1, 1), (1, 2), (1, 0), (1, 1), (1, 2)], dtype=object)

In [3]: pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]]
   ...: )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions MultiIndex
Projects
None yet
Development

No branches or pull requests

3 participants