-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
MultiIndex.from_product() casts float to int when corresponding int is also present #19432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In [2]: pd.Index([1, 1.], dtype=object).is_unique
Out[2]: False ... which is correct ( |
so does this mean that the second example is actually giving the expected output? and that to initialize the MultiIndex you should add pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)],
labels=[[0,1], [1,1]], verify_integrity=False) which gives a MultiIndex containing both 1 and 1.0 in its first level.
|
if we, for the moment, ignore the float from the problem description and 'manually' create a MultiIndex using >>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
... labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]]) we get: Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "...\pandas\core\indexes\multi.py", line 242, in __new__
result._verify_integrity()
File "...\pandas\core\indexes\multi.py", line 285, in _verify_integrity
level=i))
ValueError: Level values must be unique: [1, 1] on level 0 which makes sense so we add >>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
... labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], verify_integrity=False)
MultiIndex(levels=[[1, 1], [0, 1, 2]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
>>> if we now try to re-create this using >>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>> which is not the same as the output from either of the two previous cases! since the so going back to the original issue, it appears it is not related to the input containing a float and that the expected output from the first example in the issue description should actually be: ValueError: Level values must be unique: [1, 1.] on level 0 i think this then raises the question: Should |
I disagree: the Vice-versa, when you do But indeed the problem is more subtle than I thought: ideally, we would want The only doubt then is whether we should favour the |
...when using
swapping the order of the float and int gives a float for the first level: >>> pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([pd.Index([1., 1], dtype=object), range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>> so it appears not to be a casting issue as the issue title suggests? |
Indeed, according to the documentation for both and yet in the non-float example, i passed a non-unique iterable as the first iterable and got a result instead of a value error: >>> pd.Index([1, 1], dtype=object).is_unique
False
>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>> |
I think it is also worth noting that: >>>
>>> import pandas as pd
>>>
>>> pd.MultiIndex.from_product([[1, 1], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1, True], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1.0, True], range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[True, 1], range(3)])
MultiIndex(levels=[[True], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>> are probably not giving the expected output either. |
which could result in: >>>
>>> a = 19998989890
>>> b = 19998989889 +1
>>> a is b
False
>>> a == b
True
>>> pd.MultiIndex.from_product([[a,b], range(3)])
MultiIndex(levels=[[19998989890], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>> |
pandas/pandas/core/indexes/multi.py Lines 1357 to 1367 in e2e1a10
>>>
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>>
>>> labels, levels =_factorize_from_iterables([[1, True], range(3)])
>>> labels
[array([0, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Int64Index([1], dtype='int64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>> from pandas.core.reshape.util import cartesian_product
>>>
>>> labels = cartesian_product(labels)
>>> labels
[array([0, 0, 0, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>>
>>> pd.MultiIndex(levels, labels)
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>> it appears that |
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables use 3 in the first iterable so that the objects do not compare equal >>> labels, levels =_factorize_from_iterables([[3, True], range(3)])
>>> labels
[array([1, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Index([True, 3], dtype='object'), Int64Index([0, 1, 2], dtype='int64')] change the 3 back to a 1 so that the first iterable has different objects which compare equal >>> levels = [pd.Index([1, True], dtype='object'), pd.Int64Index([0, 1, 2], dtype='int64')]
>>> levels
[Index([1, True], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> from pandas.core.reshape.util import cartesian_product
>>> labels = cartesian_product(labels)
>>> labels
[array([1, 1, 1, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>> pd.MultiIndex(levels, labels, verify_integrity=False)
MultiIndex(levels=[[1, True], [0, 1, 2]],
labels=[[1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2]]) which is the expected output? changing >>> pd.MultiIndex(levels, labels) Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 242, in __new__
result._verify_integrity()
File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 285, in _verify_integrity
level=i))
ValueError: Level values must be unique: [1, True] on level 0 |
it depends on the ordering: >>>
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1., 2., 2], dtype=object), range(3)])
>>> levels
[Index([1, 2.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')] and the index type is unchanged: >>> labels, levels =_factorize_from_iterables([pd.Index([1, 1.], dtype=object), range(3)])
>>> levels
[Index([1], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([pd.Index([1., 1], dtype=object), range(3)])
>>> levels
[Index([1.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>> if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int: >>>
>>> labels, levels =_factorize_from_iterables([[1., 1], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([[1, 1.], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>
>>> pd.MultiIndex.from_product([[1, 1., 2., 2], range(3)])
MultiIndex(levels=[[1.0, 2.0], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> |
unless the list also contains booleans and then it depends on the ordering again: >>>
>>>
>>> pd.MultiIndex.from_product([pd.Index([1, 1., 2., 2, True, False], dtype=object), range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> pd.MultiIndex.from_product([[1, 1., 2., 2, True, False], range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> |
I believe you should use pd.MultiIndex.from_product since that pd.MultiIndex creation with object dtype levels treats seemingly duplicate values (e.g., 1 and 1.0) as errors, although from_product works.
|
Code Sample, a copy-pastable example if possible
Problem
If a flat (object)
Index
allows us to distinguish 1 and 1., the same should doMultiIndex
. From #18913 (comment)Expected Output
Both the first two examples should return a
MultiIndex
containing both 1 and 1.0 in its first level.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: 8cbee35
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8
pandas: 0.23.0.dev0+182.g8cbee356d.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1
The text was updated successfully, but these errors were encountered: