Skip to content

Surprising type conversion when iterating #20791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mitar opened this issue Apr 23, 2018 · 4 comments
Open

Surprising type conversion when iterating #20791

mitar opened this issue Apr 23, 2018 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@mitar
Copy link
Contributor

mitar commented Apr 23, 2018

Code Sample, a copy-pastable example if possible

>>> df = pandas.DataFrame({'ints': range(5)})
>>> type(df.iloc[:, 0][0])
<class 'numpy.int64'>
>>> type(list(df.iloc[:, 0])[0])
<class 'int'>

Problem description

When iterating over a series, values are being converted to closest Python type and not left as an original type. This is problematic because it is a surprising change and leads to data manipulation one does not expect. One would hope that you can easily get what you stored into a dataframe out when iterating and not a variation of it.

The issue is that iterating uses internally numpy's tolist which has this behavior. This is problematic for two reasons: it constructs the whole list first in memory (#20783) and converts to "closest Python type".

Expected Output

>>> df = pandas.DataFrame({'ints': range(5)})
>>> type(df.iloc[:, 0][0])
<class 'numpy.int64'>
>>> type(list(df.iloc[:, 0])[0])
<class 'numpy.int64'>

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.0
pip: 9.0.1
setuptools: 38.2.3
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.9.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

As I said on gitter, we have been consistently changing this in several conversion functions to always return native python types if possible:

In [44]: s = pd.Series([1, 2, 3])

In [45]: type([i for i in s][0])
Out[45]: int

In [46]: type(list(s)[0])
Out[46]: int

In [47]: type(s.to_dict()[0])
Out[47]: int

Since many of list(..) and to_dict etc rely on iterating, I suppose we decide to let iterating return native python types.

To have some context: #13258, #17491, #16048, and many others.

I agree this can certainly be surprising, and it's possible this can use better information in the docs. Or thinking about ways to give an option to the users, ...

But as I also said on gitter, it would be good to see some actual reasons for caring about this (the typing one is good). As if performance is your concern, you should try to see how you can avoid iterating over Series at all.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2018

several other related: #13468, #13236, #15385 and closing as duplicate of #13468

@jreback jreback closed this as completed Apr 23, 2018
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Apr 23, 2018
@jreback jreback modified the milestones: 0.23.1, No action Apr 23, 2018
@mitar
Copy link
Contributor Author

mitar commented Apr 23, 2018

Issues you linked above mostly talk about iterrows. That one is to me less problematic because you are anyway getting Series and so you know that it is a Pandas object and things are getting converted, and all dtypes have to be converted upwards anyway to be able to store into one Series across columns. This is well documented as well:

Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).

But the important thing is that they are upcasted to a common dtype, not Python scalar. Even documentation shows so:

>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
>>> row = next(df.iterrows())[1]
>>> row
int      1.0
float    1.5
Name: 0, dtype: float64
>>> print(row['int'].dtype)
float64
>>> print(df['int'].dtype)
int64

The snippet above is taken from documentation for iterrows. As you see, values are float64, not Python float. For me this behavior is reasonable and normal and expected. It is the same as what happens in numpy if you want to store values of different dtypes into same array, they have to be upcasted to a common dtype. Series has the same semantic, it can hold only one dtype. And if you call df.values` the same thing happens.

What my issue here is that then user can ask, OK, how can I get things without this conversion. Oh, there is itertuples, great. Let's use that. But then instead of converting to common dtype, now you get conversion to Python scalars. Which is maybe cool, but you cannot really have then a way to get rows out which match dtypes as stored in DataFrame.

My proposal would be to change the itertuples to:

itertuples(self, index=True, name="Pandas", raw=False):

If raw (or some other similar name) would be set to True, then it would be returning values which match dtypes types. I think this is a backwards compatible change and allows one to really get matching types to what is claimed in dtypes.

@jorisvandenbossche jorisvandenbossche removed this from the No action milestone Apr 23, 2018
@mitar
Copy link
Contributor Author

mitar commented Apr 23, 2018

I am realizing there is some deeper issue with iterrows as well and I reported it in its issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants