-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame.set_index() may not preserve dtype #30517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To me this seems to be working as intended. When you call It's possible that this could just be fixed by adding something to the docs for Would be curious to see what others think about this one. Willing to do a patch if anyone thinks it's necessary. |
take |
I don't think you are right. If you do
you see that the type of
then the dtype of I don't think the dtype should change as a result of the
the two DataFrames look the same, but the dtype of |
there is another issue like this i and a closed PR that wasn’t finished to address |
@jreback can you link the PR? I'm curious to see what change they made. Thanks |
#27370 with 2 linked issues |
@jreback there's a comment in that PR from the OP where they say
This is pretty much my question as well after looking at this. Looking at the example @Dr-Irv is showing here for If you look at that constructor, you see all the type inference logic, but there is also a
So, if one were to build their own index manually they could use this param to explicitly coerce to their intended type. But Let me know what you think, or if I'm misinterpreting this. |
So my preferred api to this would be that we DO preserve things exactly on round-trip., meaning the identity
would always be True (ignoring the column reordering, e.g. you set from not the last column, but the reset puts it last). Hence (and this would be an API change), we should not coerce Note that this actually has nothing to do with Index itself coercing non-dtyped inputs (meaning if you pass data to Index with-out passing a dtype it WILL infer), this IS desired behavior. So this actually would be a pretty easy change (simply pass |
@jreback are you suggesting to make a change to the If we exposed a new param like Let me know if I'm interpreting this correctly. I'm a little confused because in the examples above, the original issue was that the user expected both dtypes to be I want to make sure I understand what the change should be, thanks! |
So here's a typical example
the issue is that [4] is inferred by the Index constructor; this is not needed, just preserve object dtypes if its set; we already know the dtype of the column, we only need to infer if its not possible to represent it (e.g. an int32 column) |
@jreback okay I see what you mean now, that's a simpler solution than what I was thinking. I'll look at building a fix |
FWIW https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1176-L1177
Dirty fix: |
xref #19602
Code Sample, a copy-pastable example if possible
Problem description
In the above, I start with a
DataFrame
with a columnmixed
that has both integer and string values.In statement [6], I do a query on a different column and then set the index to be the column
mixed
. The resulting index now has anint64
dtype as opposed to having the dtype preserved from the original column.But in statement [7], I first set the index, and then do the query, and now the index has the
object
dtype.This becomes an issue if one does some computation on the queried DataFrame and then create the index
mixed
, and then you want to merge it back to the originalDataFrame
. Now the original one will havemixed
as dtype'O'
and the new one hasmixed
as dtype 'int
'Expected Output
From statement [6], I would have expected:
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 4.54.2
sphinx : 2.3.0
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.6
The text was updated successfully, but these errors were encountered: