variable dtype does not update when populating a dataframe #25294

stragu · 2019-02-13T01:47:04Z

I posted a question about this on StackOverflow, but though it might be something worth reporting here.

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame(columns = ("Name", "Age"))
df.loc[1] = "Jane", 5
df.loc[2] = "Riley", 24
df.dtypes

Problem description

In my test:

with Python 3.5 and Pandas 0.18.1, populating the dataframe does update the "object" dtype of the Age variable to "float64"
with Python 3.7 and Pandas 0.23.4, populating the dataframe does not update the "object" dtype of the Age variable

Why is that? I couldn't find an explanation in the documentation.

Expected Output

The dtype of an ampty variable gets updated when populating it for the first time, similarly to what infer_objects() does.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-45-lowlatency machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: en_AU.UTF-8

pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.11
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-13T01:59:12Z

Hmm well this way of populating a DataFrame is not idiomatic and inferring the intended result is nearly impossible. Understood this is a toy example, but the construction here should be done in one expression if you want to be explicit about dtypes.

While ambiguous the previous behavior is in any case not correct; there isn't any indication that you want floats with what you are doing, especially since you are inserting int values.

stragu · 2019-02-13T06:32:20Z

What do you mean by "inferring the intended result is nearly impossible"? From the names of the columns when creating the DataFrame, of course, but from what values are added to them, something like that was done with previous versions apparently. The dtype could be updated as it gets populated (although it would probably be deemed inefficient): add an int, dtype is changed to int; add a number with a decimal point, it is coerced to float; add then a string, the dtype is changed to str (or object I guess, which is the more general dtype?).

What would then be the recommended way to populate an empty DataFrame in Pandas, for example in a loop? Populating series and constructing the DataFrame at the end, or creating an empty dataframe with columns and a specific dtype for each variable?

df = pd.DataFrame(columns = ("Name", "Age"), dtype = (str, int)) does not seem to work.

Thanks for the quick reply! And very sorry for my limited experience with Pandas – and Python in general.

WillAyd · 2019-02-13T08:07:34Z

Well your ideal state is probably what is manifested in #4464 so I'm going to close this as a duplicated.

You have a few other things in there but I'll say generally appending to a DataFrame is very expensive. Your best approach is usually to construct the entire DataFrame from a sequence of values rather than creating and empty DataFrame and continually appending. In lieu of dtype accepting multiple values in the constructor you can after construction use the .astype method

For further and future usage questions we ask that you turn to StackOverflow as this tracker is for enhancement requests and bugs. SO will be a much better forum for Q&A on usage and will help other users out with the same question more than this issue tracker could

stragu · 2019-02-13T12:29:54Z

No problem, thank you @WillAyd , I appreciate it.

WillAyd added the Usage Question label Feb 13, 2019

WillAyd added the Duplicate Report Duplicate issue or pull request label Feb 13, 2019

WillAyd closed this as completed Feb 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

variable dtype does not update when populating a dataframe #25294

variable dtype does not update when populating a dataframe #25294

stragu commented Feb 13, 2019

WillAyd commented Feb 13, 2019

stragu commented Feb 13, 2019

WillAyd commented Feb 13, 2019

stragu commented Feb 13, 2019

variable dtype does not update when populating a dataframe #25294

variable dtype does not update when populating a dataframe #25294

Comments

stragu commented Feb 13, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Feb 13, 2019

stragu commented Feb 13, 2019

WillAyd commented Feb 13, 2019

stragu commented Feb 13, 2019

Output of `pd.show_versions()`