Skip to content

Indexing into a dataframe re-casts / "forgets" dtypes #16508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
makmanalp opened this issue May 25, 2017 · 4 comments
Closed

Indexing into a dataframe re-casts / "forgets" dtypes #16508

makmanalp opened this issue May 25, 2017 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@makmanalp
Copy link
Contributor

Code Sample, a copy-pastable example if possible

In [62]: df = pd.DataFrame([(float(x) for x in range(0, 10)), (float(x) for x in range(10,20))])

In [63]: df
Out[63]:
      0     1     2     3     4     5     6     7     8     9
0   0.0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0   9.0
1  10.0  11.0  12.0  13.0  14.0  15.0  16.0  17.0  18.0  19.0

In [64]: df[0]
Out[64]:
0     0.0
1    10.0
Name: 0, dtype: float64

In [65]: df[0].astype(int)
Out[65]:
0     0
1    10
Name: 0, dtype: int64

In [66]: df[0] = df[0].astype(int)

In [67]: df
Out[67]:
    0     1     2     3     4     5     6     7     8     9
0   0   1.0   2.0   3.0   4.0   5.0   6.0   7.0   8.0   9.0
1  10  11.0  12.0  13.0  14.0  15.0  16.0  17.0  18.0  19.0

In [68]: df.iloc[0]
Out[68]:
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: 0, dtype: float64

Problem description

After I reassign the 0th column as int, I expect it to be int, and it appears to be that way. But when I do a .iloc on the dataframe, it seems to be returning back into being a float somehow!

This is the narrowed-down version of a more insidious problem where instead of doing an iloc[] I was running a .apply(f) on a dataframe and the dtypes of the resulting dataframe were all messed up even when the function f wasn't doing anything discernible with the types, so I narrowed it down to this.

Current workaround is to re-cast all the types in f, but that can get frustrating real quick depending on the number of columns.

Expected Output

I expect the row to be a mixed dtype object, with the dtype of each cell matching that of the column:

In [68]: df.iloc[0]
Out[68]:
0    0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: 0, dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: 1.5.1
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.0.9
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 25, 2017

duplicate of #12859

This is as expected, mixed int-float get upcasted to float. In particular on a cross section you an get upcasting generally as you are cutting across mixed dtypes (you won't get upcast if its a single dtype).

I don't think there is an easy way to get around this, though for perf reasons you never want this to be object.

@jreback jreback closed this as completed May 25, 2017
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels May 25, 2017
@jreback jreback added this to the No action milestone May 25, 2017
@jreback jreback added the Duplicate Report Duplicate issue or pull request label May 25, 2017
@makmanalp
Copy link
Contributor Author

makmanalp commented May 25, 2017

@jreback I see. More philosophically, if we had a string or datetime or other dtype column in there, wouldn't the dtype of the result of the .iloc[] necessarily have to be object? It seems like that's the more sensible thing to happen. Besides, the implicit upcasting behavior on .apply or .iloc or .to_dict is surprising and not easy to track down.

Or is this upcasting behavior considered normal because .iloc[] returns a series, so in which case we've flipped things sideways and switched from the multi-columnar (and thus multi-dtype) dataframe format to a single "column" (which has one dtype, unless we make it be "object")?

I hit the .to_dict() version almost immediately after :-) I'm trying to clean out float numbers that are polluting some json output, essentially, and these are two things that I ran into back to back.

@jreback
Copy link
Contributor

jreback commented May 25, 2017

if you had any other mixed dtypes it would upcast appropriately (to object if needed)

this comes fundamentally from numpy

In [1]: np.array([1, 1.0])
Out[1]: array([ 1.,  1.])

since holding mixed dtypes in a 1-d is not generally supported, this is how it is.

@makmanalp
Copy link
Contributor Author

OK, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants