MultiIndex.from_product() casts float to int when corresponding int is also present #19432

toobaz · 2018-01-28T16:27:05Z

Code Sample, a copy-pastable example if possible

In [2]: pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)]).values
Out[2]: array([(1, 0), (1, 1), (1, 2), (1, 0), (1, 1), (1, 2)], dtype=object)

In [3]: pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]]
   ...: )
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a2ad432a7b0f> in <module>()
----> 1 pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]])

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, labels, sortorder, names, copy, verify_integrity, _set_identity, name, **kwargs)
    238 
    239         if verify_integrity:
--> 240             result._verify_integrity()
    241         if _set_identity:
    242             result._reset_identity()

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, labels, levels)
    281                                  "level {level}".format(
    282                                      values=[value for value in level],
--> 283                                      level=i))
    284 
    285     @property

ValueError: Level values must be unique: [1, 1.0] on level 0

In [4]: pd.Index([1, 1.], dtype=object)
Out[4]: Index([1, 1.0], dtype='object')

Problem

If a flat (object) Index allows us to distinguish 1 and 1., the same should do MultiIndex. From #18913 (comment)

Expected Output

Both the first two examples should return a MultiIndex containing both 1 and 1.0 in its first level.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: 8cbee35
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.0.dev0+182.g8cbee356d.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

toobaz · 2018-01-30T05:21:15Z

In [2]: pd.Index([1, 1.], dtype=object).is_unique
Out[2]: False

... which is correct (1 == 1.), and means that the error message about unicity is correct. So the wrong thing is that the other call is automatically casting to int.

simonjayhawkins · 2018-09-04T22:18:28Z

and means that the error message about unicity is correct

so does this mean that the second example is actually giving the expected output?

and that to initialize the MultiIndex you should add verify_integrity=False:

pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)],
              labels=[[0,1], [1,1]], verify_integrity=False)

which gives a MultiIndex containing both 1 and 1.0 in its first level.

MultiIndex(levels=[[1, 1.0], [0, 1, 2]], labels=[[0, 1], [1, 1]])

simonjayhawkins · 2018-09-05T11:26:15Z

if we, for the moment, ignore the float from the problem description and 'manually' create a MultiIndex using pd.Index([1, 1], dtype=object for the first level:

>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
...               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

we get:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "...\pandas\core\indexes\multi.py", line 242, in __new__
    result._verify_integrity()
  File "...\pandas\core\indexes\multi.py", line 285, in _verify_integrity
    level=i))
ValueError: Level values must be unique: [1, 1] on level 0

which makes sense so we add verify_integrity=False to get the expected output :

>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
...               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], verify_integrity=False)
MultiIndex(levels=[[1, 1], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
>>>

if we now try to re-create this using MultiIndex.from_product(), we get:

>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

which is not the same as the output from either of the two previous cases!

since the from_product() method does not have a verify_integrity parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0 since this is the default for pd.MultiIndex

so going back to the original issue, it appears it is not related to the input containing a float and that the expected output from the first example in the issue description should actually be:

ValueError: Level values must be unique: [1, 1.] on level 0

i think this then raises the question: Should MultiIndex.from_product() have a verify_integrity parameter?

toobaz · 2018-09-05T12:12:47Z

since the from_product() method does not have a verify_integrity parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0 since this is the default for pd.MultiIndex

I disagree: the from_product docs (and intuition) just refer to the "cartesian product of iterables", not in any way to the underlying levels.

Vice-versa, when you do pd.MultiIndex(levels=...) you are clearly passing levels, so it is OK check unicity and raise.

But indeed the problem is more subtle than I thought: ideally, we would want pd.Index([1, 1.], dtype=object).is_unique to return False, but it's maybe to late to change. So assuming that does return True, and that MultiIndex levels must be unique, we can't have both an int and its float representation in a same MultiIndex level.

The only doubt then is whether we should favour the float, rather than int, representation, given that for instancepd.Index([1, 1.]) gives a Float64Index.

simonjayhawkins · 2018-09-05T12:38:25Z

... and that MultiIndex levels must be unique, we can't have both an int and its float representation in a same MultiIndex level.

...when using from_product()

The only doubt then is whether we should favour the float, rather than int, representation, given that for instancepd.Index([1, 1.]) gives a Float64Index.

swapping the order of the float and int gives a float for the first level:

>>> pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([pd.Index([1., 1], dtype=object), range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

so it appears not to be a casting issue as the issue title suggests?

simonjayhawkins · 2018-09-05T13:40:09Z

... and that MultiIndex levels must be unique ...

Indeed, according to the documentation for both pandas.MultiIndex and pandas.MultiIndex.from_product

and yet in the non-float example, i passed a non-unique iterable as the first iterable and got a result instead of a value error:

>>> pd.Index([1, 1], dtype=object).is_unique
False
>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

simonjayhawkins · 2018-09-05T18:51:40Z

I think it is also worth noting that:

>>>
>>> import pandas as pd
>>>
>>> pd.MultiIndex.from_product([[1, 1], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1, True], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1.0, True], range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[True, 1], range(3)])
MultiIndex(levels=[[True], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

are probably not giving the expected output either.

simonjayhawkins · 2018-09-05T19:50:03Z

which could result in:

>>>
>>> a = 19998989890
>>> b = 19998989889 +1
>>> a is b
False
>>> a == b
True
>>> pd.MultiIndex.from_product([[a,b], range(3)])
MultiIndex(levels=[[19998989890], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

simonjayhawkins · 2018-09-05T21:18:50Z

pandas/pandas/core/indexes/multi.py

Lines 1357 to 1367 in e2e1a10

    
           from pandas.core.arrays.categorical import _factorize_from_iterables 
        
           from pandas.core.reshape.util import cartesian_product 
        
           if not is_list_like(iterables): 
        
               raise TypeError("Input must be a list / sequence of iterables.") 
        
           elif is_iterator(iterables): 
        
               iterables = list(iterables) 
        
           labels, levels = _factorize_from_iterables(iterables) 
        
           labels = cartesian_product(labels) 
        
           return MultiIndex(levels, labels, sortorder=sortorder, names=names)

>>>
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>>
>>> labels, levels =_factorize_from_iterables([[1, True], range(3)])
>>> labels
[array([0, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Int64Index([1], dtype='int64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>> from pandas.core.reshape.util import cartesian_product
>>>
>>> labels = cartesian_product(labels)
>>> labels
[array([0, 0, 0, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>>
>>> pd.MultiIndex(levels, labels)
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

it appears that from_product() would need to use a different implementation of _factorize_from_iterables

simonjayhawkins · 2018-09-05T23:17:56Z

>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables

use 3 in the first iterable so that the objects do not compare equal

>>> labels, levels =_factorize_from_iterables([[3, True], range(3)])
>>> labels
[array([1, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Index([True, 3], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]

change the 3 back to a 1 so that the first iterable has different objects which compare equal

>>> levels = [pd.Index([1, True], dtype='object'), pd.Int64Index([0, 1, 2], dtype='int64')]
>>> levels
[Index([1, True], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> from pandas.core.reshape.util import cartesian_product
>>> labels = cartesian_product(labels)
>>> labels
[array([1, 1, 1, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>> pd.MultiIndex(levels, labels, verify_integrity=False)
MultiIndex(levels=[[1, True], [0, 1, 2]],
           labels=[[1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2]])

which is the expected output?

changing _factorize_from_iterables alone would give a value error unless MultiIndex is called with verify_integrity=False

>>> pd.MultiIndex(levels, labels)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 242, in __new__
    result._verify_integrity()
  File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 285, in _verify_integrity
    level=i))
ValueError: Level values must be unique: [1, True] on level 0

simonjayhawkins · 2018-09-06T10:28:52Z

MultiIndex.from_product() casts float to int when corresponding int is also present

it depends on the ordering:

>>>
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1., 2., 2], dtype=object), range(3)])
>>> levels
[Index([1, 2.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]

and the index type is unchanged:

>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1.], dtype=object), range(3)])
>>> levels
[Index([1], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([pd.Index([1., 1], dtype=object), range(3)])
>>> levels
[Index([1.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>

if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:

>>>
>>> labels, levels =_factorize_from_iterables([[1., 1], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([[1, 1.], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>
>>> pd.MultiIndex.from_product([[1, 1., 2., 2], range(3)])
MultiIndex(levels=[[1.0, 2.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>

simonjayhawkins · 2018-09-06T10:50:16Z

if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:

unless the list also contains booleans and then it depends on the ordering again:

>>>
>>>
>>> pd.MultiIndex.from_product([pd.Index([1, 1., 2., 2, True, False], dtype=object), range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
           labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> pd.MultiIndex.from_product([[1, 1., 2., 2, True, False], range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
           labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>

Paradox456 · 2024-07-24T16:42:47Z

I believe you should use pd.MultiIndex.from_product since that pd.MultiIndex creation with object dtype levels treats seemingly duplicate values (e.g., 1 and 1.0) as errors, although from_product works.

Out[2]: array([(1, 0), (1, 1), (1, 2), (1, 0), (1, 1), (1, 2)], dtype=object)

In [3]: pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]]
   ...: )

toobaz changed the title ~~Impossible to initialize MultiIndex with int and corresponding float~~ MultiIndex.from_product() casts float to int when corresponding int is also present Jan 30, 2018

toobaz added Dtype Conversions Unexpected or buggy dtype conversions MultiIndex labels Mar 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiIndex.from_product() casts float to int when corresponding int is also present #19432

MultiIndex.from_product() casts float to int when corresponding int is also present #19432

toobaz commented Jan 28, 2018

INSTALLED VERSIONS

toobaz commented Jan 30, 2018

simonjayhawkins commented Sep 4, 2018

simonjayhawkins commented Sep 5, 2018

toobaz commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018 •

edited

Loading

simonjayhawkins commented Sep 6, 2018

simonjayhawkins commented Sep 6, 2018

Paradox456 commented Jul 24, 2024 •

edited

Loading

MultiIndex.from_product() casts float to int when corresponding int is also present #19432

MultiIndex.from_product() casts float to int when corresponding int is also present #19432

Comments

toobaz commented Jan 28, 2018

Code Sample, a copy-pastable example if possible

Problem

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

toobaz commented Jan 30, 2018

simonjayhawkins commented Sep 4, 2018

simonjayhawkins commented Sep 5, 2018

toobaz commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018

simonjayhawkins commented Sep 5, 2018 • edited Loading

simonjayhawkins commented Sep 6, 2018

simonjayhawkins commented Sep 6, 2018

Paradox456 commented Jul 24, 2024 • edited Loading

Output of `pd.show_versions()`

simonjayhawkins commented Sep 5, 2018 •

edited

Loading

Paradox456 commented Jul 24, 2024 •

edited

Loading