Skip to content

variable dtype does not update when populating a dataframe #25294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stragu opened this issue Feb 13, 2019 · 4 comments
Closed

variable dtype does not update when populating a dataframe #25294

stragu opened this issue Feb 13, 2019 · 4 comments
Labels
Duplicate Report Duplicate issue or pull request Usage Question

Comments

@stragu
Copy link
Contributor

stragu commented Feb 13, 2019

I posted a question about this on StackOverflow, but though it might be something worth reporting here.

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame(columns = ("Name", "Age"))
df.loc[1] = "Jane", 5
df.loc[2] = "Riley", 24
df.dtypes

Problem description

In my test:

  • with Python 3.5 and Pandas 0.18.1, populating the dataframe does update the "object" dtype of the Age variable to "float64"
  • with Python 3.7 and Pandas 0.23.4, populating the dataframe does not update the "object" dtype of the Age variable

Why is that? I couldn't find an explanation in the documentation.

Expected Output

The dtype of an ampty variable gets updated when populating it for the first time, similarly to what infer_objects() does.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-45-lowlatency machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: en_AU.UTF-8

pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.11
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Feb 13, 2019

Hmm well this way of populating a DataFrame is not idiomatic and inferring the intended result is nearly impossible. Understood this is a toy example, but the construction here should be done in one expression if you want to be explicit about dtypes.

While ambiguous the previous behavior is in any case not correct; there isn't any indication that you want floats with what you are doing, especially since you are inserting int values.

@stragu
Copy link
Contributor Author

stragu commented Feb 13, 2019

What do you mean by "inferring the intended result is nearly impossible"? From the names of the columns when creating the DataFrame, of course, but from what values are added to them, something like that was done with previous versions apparently. The dtype could be updated as it gets populated (although it would probably be deemed inefficient): add an int, dtype is changed to int; add a number with a decimal point, it is coerced to float; add then a string, the dtype is changed to str (or object I guess, which is the more general dtype?).

What would then be the recommended way to populate an empty DataFrame in Pandas, for example in a loop? Populating series and constructing the DataFrame at the end, or creating an empty dataframe with columns and a specific dtype for each variable?

df = pd.DataFrame(columns = ("Name", "Age"), dtype = (str, int)) does not seem to work.

Thanks for the quick reply! And very sorry for my limited experience with Pandas – and Python in general.

@WillAyd WillAyd added the Duplicate Report Duplicate issue or pull request label Feb 13, 2019
@WillAyd
Copy link
Member

WillAyd commented Feb 13, 2019

Well your ideal state is probably what is manifested in #4464 so I'm going to close this as a duplicated.

You have a few other things in there but I'll say generally appending to a DataFrame is very expensive. Your best approach is usually to construct the entire DataFrame from a sequence of values rather than creating and empty DataFrame and continually appending. In lieu of dtype accepting multiple values in the constructor you can after construction use the .astype method

For further and future usage questions we ask that you turn to StackOverflow as this tracker is for enhancement requests and bugs. SO will be a much better forum for Q&A on usage and will help other users out with the same question more than this issue tracker could

@WillAyd WillAyd closed this as completed Feb 13, 2019
@stragu
Copy link
Contributor Author

stragu commented Feb 13, 2019

No problem, thank you @WillAyd , I appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Usage Question
Projects
None yet
Development

No branches or pull requests

2 participants