Skip to content

DataFrame.set_index() may not preserve dtype #30517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Dr-Irv opened this issue Dec 27, 2019 · 13 comments
Open

DataFrame.set_index() may not preserve dtype #30517

Dr-Irv opened this issue Dec 27, 2019 · 13 comments
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions expressions pd.eval, query

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 27, 2019

xref #19602

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.25.3'

In [3]: df = pd.DataFrame({'mixed' : [1, 2, 'abc', 'def'], 'ints': [100, 200, 3
   ...: 00, 400]})

In [4]: df
Out[4]:
  mixed  ints
0     1   100
1     2   200
2   abc   300
3   def   400

In [5]: df.dtypes
Out[5]:
mixed    object
ints      int64
dtype: object

In [6]: df.query('ints < 300').set_index('mixed').index
Out[6]: Int64Index([1, 2], dtype='int64', name='mixed')

In [7]: df.set_index('mixed').query('ints < 300').index
Out[7]: Index([1, 2], dtype='object', name='mixed')

Problem description

In the above, I start with a DataFrame with a column mixed that has both integer and string values.

In statement [6], I do a query on a different column and then set the index to be the column mixed. The resulting index now has an int64 dtype as opposed to having the dtype preserved from the original column.

But in statement [7], I first set the index, and then do the query, and now the index has the object dtype.

This becomes an issue if one does some computation on the queried DataFrame and then create the index mixed, and then you want to merge it back to the original DataFrame. Now the original one will have mixed as dtype 'O' and the new one has mixed as dtype 'int'

Expected Output

From statement [6], I would have expected:

Index([1, 2], dtype='object', name='mixed')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 4.54.2
sphinx : 2.3.0
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.6

@trevorbye
Copy link

To me this seems to be working as intended. When you call query() you're returning a new df that's been modified, in this example it no longer has string types in the "mixed" field since they have been filtered out.

It's possible that this could just be fixed by adding something to the docs for set_index() to explain that it's doing a dynamic type check.

Would be curious to see what others think about this one. Willing to do a patch if anyone thinks it's necessary.

@trevorbye
Copy link

take

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 3, 2020

To me this seems to be working as intended. When you call query() you're returning a new df that's been modified, in this example it no longer has string types in the "mixed" field since they have been filtered out.

I don't think you are right. If you do

In [9]: df.query('ints < 300').dtypes
Out[9]:
mixed    object
ints      int64
dtype: object

you see that the type of mixed has remained object, but then if you do

In [11]: df.query('ints < 300').set_index('mixed').reset_index().dtypes
Out[11]:
mixed    int64
ints     int64
dtype: object

then the dtype of mixed has changed.

I don't think the dtype should change as a result of the set_index() operation. If you do the following:

In [12]: df.query('ints < 300')
Out[12]:
  mixed  ints
0     1   100
1     2   200

In [13]:  df.query('ints < 300').set_index('mixed').reset_index()
Out[13]:
   mixed  ints
0      1   100
1      2   200

the two DataFrames look the same, but the dtype of mixed is different. (see above).

@jreback
Copy link
Contributor

jreback commented Jan 4, 2020

there is another issue like this i and a closed PR that wasn’t finished to address

@trevorbye
Copy link

@jreback can you link the PR? I'm curious to see what change they made. Thanks

@jreback
Copy link
Contributor

jreback commented Jan 4, 2020

#27370 with 2 linked issues

@trevorbye
Copy link

trevorbye commented Jan 6, 2020

@jreback there's a comment in that PR from the OP where they say

I didn't have enough knowledge of the code base to know if inferring the type of object indexes is still the intended behavior

This is pretty much my question as well after looking at this. Looking at the example @Dr-Irv is showing here for set_index(), the type inference doesn't seem like a bug. The call stack for set_index() eventually gets to the Index (or MultiIndex) class constructor where it builds the index.

If you look at that constructor, you see all the type inference logic, but there is also a dtype param with the docstring:

If dtype is None, we find the dtype that best fits the data.
If an actual dtype is provided, we coerce to that dtype if it's safe.
Otherwise, an error will be raised.

So, if one were to build their own index manually they could use this param to explicitly coerce to their intended type. But set_index() doesn't expose this param through the call stack. I'm wondering if we can introduce a new optional param to this function and possibly others like it, something along the lines of coerce_dtype=None, which would allow the user to try and force a dtype without having to call the Index constructor themselves.

Let me know what you think, or if I'm misinterpreting this.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 7, 2020
@jreback
Copy link
Contributor

jreback commented Jan 7, 2020

So my preferred api to this would be that we DO preserve things exactly on round-trip., meaning the identity

(df.set_index(cols).reset_index(cols).dtypes == df.dtypes).all()

would always be True (ignoring the column reordering, e.g. you set from not the last column, but the reset puts it last).

Hence (and this would be an API change), we should not coerce object dtypes on set / reset index.

Note that this actually has nothing to do with Index itself coercing non-dtyped inputs (meaning if you pass data to Index with-out passing a dtype it WILL infer), this IS desired behavior.

So this actually would be a pretty easy change (simply pass dtype=object)

@trevorbye
Copy link

@jreback are you suggesting to make a change to the set_index() function implementation? Because set_index() eventually just calls the Index constructor with no dtype, so if your data changed from strings and ints to only ints (like the examples above), it would automatically infer ints.

If we exposed a new param like set_index("col", dtype=object) we could pass that dtype through to the Index constructor and then you could preserve the same type.

Let me know if I'm interpreting this correctly. I'm a little confused because in the examples above, the original issue was that the user expected both dtypes to be object, but the second one got inferred as int because set_index was called after filtering out the strings that were originally in that field. To me this fix seems primarily for situations where you DO want to maintain a dtype of object, rather than automatically inferring a more strict type.

I want to make sure I understand what the change should be, thanks!

@jreback
Copy link
Contributor

jreback commented Jan 8, 2020

So here's a typical example

In [1]: df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype=object), 'B': [1, 2, 3]})                                                                                                                                        

In [2]: df                                                                                                                                                                                                                  
Out[2]: 
   A  B
0  1  1
1  2  2
2  3  3

In [3]: df.dtypes                                                                                                                                                                                                           
Out[3]: 
A    object
B     int64
dtype: object

In [4]: df.set_index('A').index                                                                                                                                                                                             
Out[4]: Int64Index([1, 2, 3], dtype='int64', name='A')

the issue is that [4] is inferred by the Index constructor; this is not needed, just preserve object dtypes if its set; we already know the dtype of the column, we only need to infer if its not possible to represent it (e.g. an int32 column)

@trevorbye
Copy link

@jreback okay I see what you mean now, that's a simpler solution than what I was thinking. I'll look at building a fix

@trevorbye
Copy link

@jreback opened PR #30870 here if you can take a look

@sickcodes
Copy link

FWIW uint64 doesn't work with sqlalchemy, so using set_index before using df.to_sql() can causes errors.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1176-L1177

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "./lib/python3.9/site-packages/pandas/core/generic.py", line 2872, in to_sql
    sql.to_sql(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 717, in to_sql
    pandas_sql.to_sql(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1751, in to_sql
    table = self.prep_table(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1640, in prep_table
    table = SQLTable(
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 831, in __init__
    self.table = self._create_table_setup()
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1094, in _create_table_setup
    column_names_and_types = self._get_column_names_and_types(self._sqlalchemy_type)
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1077, in _get_column_names_and_types
    idx_type = dtype_mapper(self.frame.index._get_level_values(i))
  File "./lib/python3.9/site-packages/pandas/io/sql.py", line 1227, in _sqlalchemy_type
    raise ValueError("Unsigned 64 bit integer datatype is not supported")
ValueError: Unsigned 64 bit integer datatype is not supported

Dirty fix:
df.index = df.index.astype('string')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions expressions pd.eval, query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants