-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Surprising type conversion when iterating #20791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As I said on gitter, we have been consistently changing this in several conversion functions to always return native python types if possible:
Since many of To have some context: #13258, #17491, #16048, and many others. I agree this can certainly be surprising, and it's possible this can use better information in the docs. Or thinking about ways to give an option to the users, ... But as I also said on gitter, it would be good to see some actual reasons for caring about this (the typing one is good). As if performance is your concern, you should try to see how you can avoid iterating over Series at all. |
Issues you linked above mostly talk about
But the important thing is that they are upcasted to a common dtype, not Python scalar. Even documentation shows so:
The snippet above is taken from documentation for What my issue here is that then user can ask, OK, how can I get things without this conversion. Oh, there is My proposal would be to change the itertuples(self, index=True, name="Pandas", raw=False): If |
I am realizing there is some deeper issue with iterrows as well and I reported it in its issue. |
Code Sample, a copy-pastable example if possible
Problem description
When iterating over a series, values are being converted to closest Python type and not left as an original type. This is problematic because it is a surprising change and leads to data manipulation one does not expect. One would hope that you can easily get what you stored into a dataframe out when iterating and not a variation of it.
The issue is that iterating uses internally numpy's
tolist
which has this behavior. This is problematic for two reasons: it constructs the whole list first in memory (#20783) and converts to "closest Python type".Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: 3.3.0
pip: 9.0.1
setuptools: 38.2.3
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.9.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: