-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: document the perils of reading mixed dtypes / how to handle #13746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I could see an argument for the first case raising - although it's been this way forever and the warning is decent and gives explicit corrective action?
Case 2 is a different issue altogether - you have a space in your data and need to pass In [26]: from pandas.compat import StringIO
...: df = pd.read_csv(StringIO('1, "1"\n"1",1'), header=None, skipinitialspace=True)
...: df
Out[26]:
0 1
0 1 1
1 1 1
In [27]: df.dtypes
Out[27]:
0 int64
1 int64
dtype: object |
@chris-b1 Understood. Perhaps then clarify the docs?
Now The ideal solution would have been to produce an optional log of parse errors. Barring that, perhaps modifying |
@pkch For converting the object column to numeric, you need to set
As the docstring says, If you want to find the values that did cause the conversion to fail, you can first drop the NaN values before applying It would maybe also be nice if |
yeah at best this is a doc issue, a nice note section under dtypes in io.rst / cookbook of the perils of having mixed dtypes. |
Code Sample, a copy-pastable example if possible
See for example http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue
(the behavior didn't since 3 years ago, except a warning is now issued)
Similar examples: the column that contains '1' and 1 is inferred as string; one that contains 1 and '1' is inferred as numeric.
Furthermore,
pd.to_numeric(df[1])
won't actually work in the above case.While it's an understandable behavior, it is unexpected to many users and isn't even documented (beyond the general warnings that type inference isn't perfect). Given the number of people who use pandas for reading csv files, without knowing the history of the library or the mechanics of type inference, this results in a lot of wasted time identifying and fixing a problem that could have been prevented.
I suggest at least adding clear documentation on this, but preferably changing the default behavior to either fail with an error stating that type inference failed and that dtypes should be explicitly provided (I don't think it's possible to use two passes since the input data may be a generator that is exhausted after the first read).
Expected Output
the column
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: