Inconsistency on dtype and fillna #49251

borisiskra · 2022-10-22T20:10:12Z

Python 3.9.7
Pandas: 1.3.5

Same result in
Python 3.9.12
Pandas 1.4.2

Using the following dictionary to create a df, all numbers are integers, 'drop' is boolean and 'status' is a string:

dct = {
 'name1': {'grade1': {'drop': True,
   'q1': 955,
   'q2': 7633,
   'q3': 0,
   'q4': 1670,
   'q5': 5963,
   'q6': 53384,
   'e1': 1535065,
   'e2': 0,
   'e3': 432747,
   'e4': 1102318},
  'grade2': {'drop': True,
   'q1': 54,
   'e1': 507,
   'e2': 0,
   'e3': 37,
   'e4': 470,
   'status': 'bad'}},
 'name2': {'grade1': {'drop': False,
   'q1': 70,
   'e1': 21706,
   'q2': 730,
   'e1': 317792,
   'status': 'good'},
  'grade2': {'drop': True,
   'q1': 11,
   'e4': 6414,
   'e1': 0,
   'e2': 605,
   'e3': 5809}}
      }

using the following code:

df =pd.concat( {k: pd.DataFrame.from_dict(v, 'index', dtype=str).fillna('EMPTY') for k, v in dct.items()}, axis=0, sort=False).reset_index()
get the following DataFrame:

	level_0	level_1	drop	q1	q2	q3	q4	q5	q6	e1	e2	e3	e4	status
0	name1	grade1	True	955	7633.0	0.0	1670.0	5963.0	53384.0	1535065	0	432747	1102318	EMPTY
1	name1	grade2	True	54	EMPTY	EMPTY	EMPTY	EMPTY	EMPTY	507	0	37	470	bad
2	name2	grade1	False	70	730.0	NaN	NaN	NaN	NaN	317792	EMPTY	EMPTY	EMPTY	good
3	name2	grade2	True	11	EMPTY	NaN	NaN	NaN	NaN	0	605.0	5809.0	6414.0	EMPTY

line 0 q3 and e2 where both 0, now one in 0 the other is 0.0
line 0 q1 = 995 is int, but q2 = 7633.0 is float, before been set as string
In lines 2 and 3, some NaN where fill as "EMPTY" others were not.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-10-22T21:26:50Z

Thanks for the report! If you do

for k, v in dct.items():
    print(pd.DataFrame.from_dict(v, 'index', dtype=str))

you get

        drop   q1      q2   q3      q4  ...       e1 e2      e3       e4 status
grade1  True  955  7633.0  0.0  1670.0  ...  1535065  0  432747  1102318    NaN
grade2  True   54     NaN  NaN     NaN  ...      507  0      37      470    bad

[2 rows x 12 columns]
         drop  q1      e1     q2 status      e4     e2      e3
grade1  False  70  317792  730.0   good     NaN    NaN     NaN
grade2   True  11       0    NaN    NaN  6414.0  605.0  5809.0

The call to from_dict is enforcing the dtype string, but only after DataFrame construction. In other words, the values you specify are being coerced to float because of the null values, and then converted to strings be. This seems appropriate to me - users can convert individual values if they so choose, and leaving dtype=... to only act after the DataFrame is constructed.

For your other source of null values, this is occurring because the is e.g. no column q3 in the 2nd DataFrame. This happens after the call to fillna, and should be expected.

rhshadrach · 2022-10-22T21:27:30Z

Assuming current behavior is to remain as is, perhaps this can be better documented as Data type to force after DataFrame construction, otherwise infer.

borisiskra · 2022-10-22T21:52:41Z

if 'the values you specify are being coerced to float because of the null values' then why in the first row some are integers (955, 1535065, 0, 432747, 1102318) and others are floats (7633.0, 0.0, 1670.0)?

I tried the first row without missing value (added 'status':'bad') and got the same result

rhshadrach · 2022-10-22T22:01:09Z

If a value isn't missing in any row, then there is no coercion to float - they remain integers. If nulls appear because the of the concat, they are not coerced to floats because at this point they are strings and not integers.

You're combining three operations here - (a) DataFrame construction via from_dict; (b) fillna; and (c) concat. It would be helpful to take these one at a time rather than altogether. If you think there is an unexpected result in one or more of them, report the input and output to just that step alone and describe what you're finding to be unexpected.

borisiskra · 2022-10-22T22:25:15Z

I see, the "issue" is column-wise not row-wise, now I understand what is going on.

rhshadrach · 2022-10-22T22:28:59Z

Yes - that's correct. Each column in a pandas DataFrame has a single dtype - you can see what the dtype is for each column with DataFrame.dtypes. For object dtype - this can hold any Python object but will be much slower in computations. You should prefer bool / int / float dtypes when possible.

If there is no remaining issue for you here, please feel free to close.

rhshadrach · 2022-10-22T22:29:31Z

On second thought, I think the docs could be improved a little here as noted in #49251 (comment). Leaving this open.

EleekaN · 2022-10-23T09:52:41Z

Hi, how can I work on the documentation related to this issue?
Sorry, I am new here, trying to figure out. Any direction will be appreciated.
Thank you.

rhshadrach · 2022-10-23T11:38:42Z

For general information on contributing, I would recommend: https://pandas.pydata.org/pandas-docs/dev/development/index.html

For this specific issue, #49251 (comment) describes how I think the docs could be improved.

EleekaN · 2022-10-23T14:11:10Z

@rhshadrach Thank you so much. I will explore more from the link. :-)

natmokval · 2022-10-25T14:28:54Z

take

rhshadrach added Needs Discussion Requires discussion from core team before further action Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions labels Oct 22, 2022

rhshadrach closed this as completed Oct 22, 2022

rhshadrach reopened this Oct 22, 2022

rhshadrach added Docs good first issue and removed Needs Discussion Requires discussion from core team before further action labels Oct 22, 2022

github-actions bot assigned natmokval Oct 25, 2022

natmokval mentioned this issue Oct 26, 2022

update from_dict docstring #49332

Merged

1 task

mroeschke closed this as completed in #49332 Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency on dtype and fillna #49251

Inconsistency on dtype and fillna #49251

borisiskra commented Oct 22, 2022

rhshadrach commented Oct 22, 2022

rhshadrach commented Oct 22, 2022 •

edited

Loading

borisiskra commented Oct 22, 2022 •

edited

Loading

rhshadrach commented Oct 22, 2022

borisiskra commented Oct 22, 2022

rhshadrach commented Oct 22, 2022

rhshadrach commented Oct 22, 2022 •

edited

Loading

EleekaN commented Oct 23, 2022

rhshadrach commented Oct 23, 2022

EleekaN commented Oct 23, 2022

natmokval commented Oct 25, 2022

Inconsistency on dtype and fillna #49251

Inconsistency on dtype and fillna #49251

Comments

borisiskra commented Oct 22, 2022

rhshadrach commented Oct 22, 2022

rhshadrach commented Oct 22, 2022 • edited Loading

borisiskra commented Oct 22, 2022 • edited Loading

rhshadrach commented Oct 22, 2022

borisiskra commented Oct 22, 2022

rhshadrach commented Oct 22, 2022

rhshadrach commented Oct 22, 2022 • edited Loading

EleekaN commented Oct 23, 2022

rhshadrach commented Oct 23, 2022

EleekaN commented Oct 23, 2022

natmokval commented Oct 25, 2022

rhshadrach commented Oct 22, 2022 •

edited

Loading

borisiskra commented Oct 22, 2022 •

edited

Loading

rhshadrach commented Oct 22, 2022 •

edited

Loading