Indexing into a dataframe re-casts / "forgets" dtypes #16508

makmanalp · 2017-05-25T21:40:03Z

Code Sample, a copy-pastable example if possible

In [62]: df = pd.DataFrame([(float(x) for x in range(0, 10)), (float(x) for x in range(10,20))])

In [63]: df
Out[63]:
      0     1     2     3     4     5     6     7     8     9
0   0.0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0   9.0
1  10.0  11.0  12.0  13.0  14.0  15.0  16.0  17.0  18.0  19.0

In [64]: df[0]
Out[64]:
0     0.0
1    10.0
Name: 0, dtype: float64

In [65]: df[0].astype(int)
Out[65]:
0     0
1    10
Name: 0, dtype: int64

In [66]: df[0] = df[0].astype(int)

In [67]: df
Out[67]:
    0     1     2     3     4     5     6     7     8     9
0   0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0   9.0
1  10  11.0  12.0  13.0  14.0  15.0  16.0  17.0  18.0  19.0

In [68]: df.iloc[0]
Out[68]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: 0, dtype: float64

Problem description

After I reassign the 0th column as int, I expect it to be int, and it appears to be that way. But when I do a .iloc on the dataframe, it seems to be returning back into being a float somehow!

This is the narrowed-down version of a more insidious problem where instead of doing an iloc[] I was running a .apply(f) on a dataframe and the dtypes of the resulting dataframe were all messed up even when the function f wasn't doing anything discernible with the types, so I narrowed it down to this.

Current workaround is to re-cast all the types in f, but that can get frustrating real quick depending on the number of columns.

Expected Output

I expect the row to be a mixed dtype object, with the dtype of each cell matching that of the column:

In [68]: df.iloc[0]
Out[68]:
0    0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: 0, dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: 1.5.1
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.0.9
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-25T21:55:01Z

duplicate of #12859

This is as expected, mixed int-float get upcasted to float. In particular on a cross section you an get upcasting generally as you are cutting across mixed dtypes (you won't get upcast if its a single dtype).

I don't think there is an easy way to get around this, though for perf reasons you never want this to be object.

makmanalp · 2017-05-25T22:07:57Z

@jreback I see. More philosophically, if we had a string or datetime or other dtype column in there, wouldn't the dtype of the result of the .iloc[] necessarily have to be object? It seems like that's the more sensible thing to happen. Besides, the implicit upcasting behavior on .apply or .iloc or .to_dict is surprising and not easy to track down.

Or is this upcasting behavior considered normal because .iloc[] returns a series, so in which case we've flipped things sideways and switched from the multi-columnar (and thus multi-dtype) dataframe format to a single "column" (which has one dtype, unless we make it be "object")?

I hit the .to_dict() version almost immediately after :-) I'm trying to clean out float numbers that are polluting some json output, essentially, and these are two things that I ran into back to back.

jreback · 2017-05-25T22:30:12Z

if you had any other mixed dtypes it would upcast appropriately (to object if needed)

this comes fundamentally from numpy

In [1]: np.array([1, 1.0])
Out[1]: array([ 1.,  1.])

since holding mixed dtypes in a 1-d is not generally supported, this is how it is.

makmanalp · 2017-05-25T22:48:32Z

OK, thank you!

jreback closed this as completed May 25, 2017

jreback added Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels May 25, 2017

jreback added this to the No action milestone May 25, 2017

jreback added the Duplicate Report Duplicate issue or pull request label May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing into a dataframe re-casts / "forgets" dtypes #16508

Indexing into a dataframe re-casts / "forgets" dtypes #16508

makmanalp commented May 25, 2017

jreback commented May 25, 2017

makmanalp commented May 25, 2017 •

edited

Loading

jreback commented May 25, 2017

makmanalp commented May 25, 2017

Indexing into a dataframe re-casts / "forgets" dtypes #16508

Indexing into a dataframe re-casts / "forgets" dtypes #16508

Comments

makmanalp commented May 25, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented May 25, 2017

makmanalp commented May 25, 2017 • edited Loading

jreback commented May 25, 2017

makmanalp commented May 25, 2017

Output of `pd.show_versions()`

makmanalp commented May 25, 2017 •

edited

Loading