DataFrame.set_index() may not preserve dtype #30517

Dr-Irv · 2019-12-27T20:27:15Z

xref #19602

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.25.3'

In [3]: df = pd.DataFrame({'mixed' : [1, 2, 'abc', 'def'], 'ints': [100, 200, 3
   ...: 00, 400]})

In [4]: df
Out[4]:
  mixed  ints
0     1   100
1     2   200
2   abc   300
3   def   400

In [5]: df.dtypes
Out[5]:
mixed    object
ints      int64
dtype: object

In [6]: df.query('ints < 300').set_index('mixed').index
Out[6]: Int64Index([1, 2], dtype='int64', name='mixed')

In [7]: df.set_index('mixed').query('ints < 300').index
Out[7]: Index([1, 2], dtype='object', name='mixed')

Problem description

In the above, I start with a DataFrame with a column mixed that has both integer and string values.

In statement [6], I do a query on a different column and then set the index to be the column mixed. The resulting index now has an int64 dtype as opposed to having the dtype preserved from the original column.

But in statement [7], I first set the index, and then do the query, and now the index has the object dtype.

This becomes an issue if one does some computation on the queried DataFrame and then create the index mixed, and then you want to merge it back to the original DataFrame. Now the original one will have mixed as dtype 'O' and the new one has mixed as dtype 'int'

Expected Output

From statement [6], I would have expected:

Index([1, 2], dtype='object', name='mixed')

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 4.54.2
sphinx : 2.3.0
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.6

The text was updated successfully, but these errors were encountered:

trevorbye · 2020-01-03T02:56:33Z

To me this seems to be working as intended. When you call query() you're returning a new df that's been modified, in this example it no longer has string types in the "mixed" field since they have been filtered out.

It's possible that this could just be fixed by adding something to the docs for set_index() to explain that it's doing a dynamic type check.

Would be curious to see what others think about this one. Willing to do a patch if anyone thinks it's necessary.

trevorbye · 2020-01-03T03:01:18Z

take

Dr-Irv · 2020-01-03T15:53:58Z

To me this seems to be working as intended. When you call query() you're returning a new df that's been modified, in this example it no longer has string types in the "mixed" field since they have been filtered out.

I don't think you are right. If you do

In [9]: df.query('ints < 300').dtypes
Out[9]:
mixed    object
ints      int64
dtype: object

you see that the type of mixed has remained object, but then if you do

In [11]: df.query('ints < 300').set_index('mixed').reset_index().dtypes
Out[11]:
mixed    int64
ints     int64
dtype: object

then the dtype of mixed has changed.

I don't think the dtype should change as a result of the set_index() operation. If you do the following:

In [12]: df.query('ints < 300')
Out[12]:
  mixed  ints
0     1   100
1     2   200

In [13]:  df.query('ints < 300').set_index('mixed').reset_index()
Out[13]:
   mixed  ints
0      1   100
1      2   200

the two DataFrames look the same, but the dtype of mixed is different. (see above).

jreback · 2020-01-04T03:50:03Z

there is another issue like this i and a closed PR that wasn’t finished to address

trevorbye · 2020-01-04T03:51:31Z

@jreback can you link the PR? I'm curious to see what change they made. Thanks

jreback · 2020-01-04T04:15:17Z

#27370 with 2 linked issues

trevorbye · 2020-01-06T17:30:32Z

@jreback there's a comment in that PR from the OP where they say

I didn't have enough knowledge of the code base to know if inferring the type of object indexes is still the intended behavior

This is pretty much my question as well after looking at this. Looking at the example @Dr-Irv is showing here for set_index(), the type inference doesn't seem like a bug. The call stack for set_index() eventually gets to the Index (or MultiIndex) class constructor where it builds the index.

If you look at that constructor, you see all the type inference logic, but there is also a dtype param with the docstring:

If dtype is None, we find the dtype that best fits the data.
If an actual dtype is provided, we coerce to that dtype if it's safe.
Otherwise, an error will be raised.

So, if one were to build their own index manually they could use this param to explicitly coerce to their intended type. But set_index() doesn't expose this param through the call stack. I'm wondering if we can introduce a new optional param to this function and possibly others like it, something along the lines of coerce_dtype=None, which would allow the user to try and force a dtype without having to call the Index constructor themselves.

Let me know what you think, or if I'm misinterpreting this.

jreback · 2020-01-07T00:13:35Z

So my preferred api to this would be that we DO preserve things exactly on round-trip., meaning the identity

(df.set_index(cols).reset_index(cols).dtypes == df.dtypes).all()

would always be True (ignoring the column reordering, e.g. you set from not the last column, but the reset puts it last).

Hence (and this would be an API change), we should not coerce object dtypes on set / reset index.

Note that this actually has nothing to do with Index itself coercing non-dtyped inputs (meaning if you pass data to Index with-out passing a dtype it WILL infer), this IS desired behavior.

So this actually would be a pretty easy change (simply pass dtype=object)

trevorbye · 2020-01-07T21:05:43Z

@jreback are you suggesting to make a change to the set_index() function implementation? Because set_index() eventually just calls the Index constructor with no dtype, so if your data changed from strings and ints to only ints (like the examples above), it would automatically infer ints.

If we exposed a new param like set_index("col", dtype=object) we could pass that dtype through to the Index constructor and then you could preserve the same type.

Let me know if I'm interpreting this correctly. I'm a little confused because in the examples above, the original issue was that the user expected both dtypes to be object, but the second one got inferred as int because set_index was called after filtering out the strings that were originally in that field. To me this fix seems primarily for situations where you DO want to maintain a dtype of object, rather than automatically inferring a more strict type.

I want to make sure I understand what the change should be, thanks!

jreback · 2020-01-08T01:42:54Z

So here's a typical example

In [1]: df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype=object), 'B': [1, 2, 3]})                                                                                                                                        

In [2]: df                                                                                                                                                                                                                  
Out[2]: 
   A  B
0  1  1
1  2  2
2  3  3

In [3]: df.dtypes                                                                                                                                                                                                           
Out[3]: 
A    object
B     int64
dtype: object

In [4]: df.set_index('A').index                                                                                                                                                                                             
Out[4]: Int64Index([1, 2, 3], dtype='int64', name='A')

the issue is that [4] is inferred by the Index constructor; this is not needed, just preserve object dtypes if its set; we already know the dtype of the column, we only need to infer if its not possible to represent it (e.g. an int32 column)

trevorbye · 2020-01-08T03:02:06Z

@jreback okay I see what you mean now, that's a simpler solution than what I was thinking. I'll look at building a fix

trevorbye · 2020-01-09T22:53:50Z

@jreback opened PR #30870 here if you can take a look

sickcodes · 2021-08-29T23:11:51Z

FWIW uint64 doesn't work with sqlalchemy, so using set_index before using df.to_sql() can causes errors.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1176-L1177

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "./lib/python3.9/site-packages/pandas/core/generic.py", line 2872, in to_sql
    sql.to_sql(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 717, in to_sql
    pandas_sql.to_sql(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1751, in to_sql
    table = self.prep_table(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1640, in prep_table
    table = SQLTable(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 831, in __init__
    self.table = self._create_table_setup()
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1094, in _create_table_setup
    column_names_and_types = self._get_column_names_and_types(self._sqlalchemy_type)
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1077, in _get_column_names_and_types
    idx_type = dtype_mapper(self.frame.index._get_level_values(i))
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1227, in _sqlalchemy_type
    raise ValueError("Unsigned 64 bit integer datatype is not supported")
ValueError: Unsigned 64 bit integer datatype is not supported

Dirty fix:
df.index = df.index.astype('string')

github-actions bot assigned trevorbye Jan 3, 2020

jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 7, 2020

trevorbye mentioned this issue Jan 9, 2020

BUG: set_index() and reset_index() not preserving object dtypes #30870

Closed

5 tasks

jreback mentioned this issue May 23, 2020

BUG: Setting the index changes dtype #34304

Closed

3 tasks

bergkvist mentioned this issue May 24, 2020

BUG: reset_index() and set_index() expands uint32 to uint64 #34358

Closed

3 tasks

Rufflewind mentioned this issue Dec 11, 2020

BUG: set_index screws up the dtypes on empty DataFrames #38419

Closed

3 tasks

mroeschke added Bug expressions pd.eval, query and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.set_index() may not preserve dtype #30517

DataFrame.set_index() may not preserve dtype #30517

Dr-Irv commented Dec 27, 2019 •

edited by jreback

Loading

INSTALLED VERSIONS

trevorbye commented Jan 3, 2020

trevorbye commented Jan 3, 2020

Dr-Irv commented Jan 3, 2020

jreback commented Jan 4, 2020

trevorbye commented Jan 4, 2020

jreback commented Jan 4, 2020

trevorbye commented Jan 6, 2020 •

edited

Loading

jreback commented Jan 7, 2020

trevorbye commented Jan 7, 2020

jreback commented Jan 8, 2020

trevorbye commented Jan 8, 2020

trevorbye commented Jan 9, 2020

sickcodes commented Aug 29, 2021

DataFrame.set_index() may not preserve dtype #30517

DataFrame.set_index() may not preserve dtype #30517

Comments

Dr-Irv commented Dec 27, 2019 • edited by jreback Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

trevorbye commented Jan 3, 2020

trevorbye commented Jan 3, 2020

Dr-Irv commented Jan 3, 2020

jreback commented Jan 4, 2020

trevorbye commented Jan 4, 2020

jreback commented Jan 4, 2020

trevorbye commented Jan 6, 2020 • edited Loading

jreback commented Jan 7, 2020

trevorbye commented Jan 7, 2020

jreback commented Jan 8, 2020

trevorbye commented Jan 8, 2020

trevorbye commented Jan 9, 2020

sickcodes commented Aug 29, 2021

Dr-Irv commented Dec 27, 2019 •

edited by jreback

Loading

Output of `pd.show_versions()`

trevorbye commented Jan 6, 2020 •

edited

Loading