Skip to content

Inconsistency on dtype and fillna #49251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
borisiskra opened this issue Oct 22, 2022 · 11 comments · Fixed by #49332
Closed

Inconsistency on dtype and fillna #49251

borisiskra opened this issue Oct 22, 2022 · 11 comments · Fixed by #49332
Assignees
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Docs Dtype Conversions Unexpected or buggy dtype conversions good first issue

Comments

@borisiskra
Copy link

Python 3.9.7
Pandas: 1.3.5

Same result in
Python 3.9.12
Pandas 1.4.2

Using the following dictionary to create a df, all numbers are integers, 'drop' is boolean and 'status' is a string:

dct = {
 'name1': {'grade1': {'drop': True,
   'q1': 955,
   'q2': 7633,
   'q3': 0,
   'q4': 1670,
   'q5': 5963,
   'q6': 53384,
   'e1': 1535065,
   'e2': 0,
   'e3': 432747,
   'e4': 1102318},
  'grade2': {'drop': True,
   'q1': 54,
   'e1': 507,
   'e2': 0,
   'e3': 37,
   'e4': 470,
   'status': 'bad'}},
 'name2': {'grade1': {'drop': False,
   'q1': 70,
   'e1': 21706,
   'q2': 730,
   'e1': 317792,
   'status': 'good'},
  'grade2': {'drop': True,
   'q1': 11,
   'e4': 6414,
   'e1': 0,
   'e2': 605,
   'e3': 5809}}
      }

using the following code:

df =pd.concat( {k: pd.DataFrame.from_dict(v, 'index', dtype=str).fillna('EMPTY') for k, v in dct.items()}, axis=0, sort=False).reset_index()
get the following DataFrame:

	level_0	level_1	drop	q1	q2	q3	q4	q5	q6	e1	e2	e3	e4	status
0	name1	grade1	True	955	7633.0	0.0	1670.0	5963.0	53384.0	1535065	0	432747	1102318	EMPTY
1	name1	grade2	True	54	EMPTY	EMPTY	EMPTY	EMPTY	EMPTY	507	0	37	470	bad
2	name2	grade1	False	70	730.0	NaN	NaN	NaN	NaN	317792	EMPTY	EMPTY	EMPTY	good
3	name2	grade2	True	11	EMPTY	NaN	NaN	NaN	NaN	0	605.0	5809.0	6414.0	EMPTY

line 0 q3 and e2 where both 0, now one in 0 the other is 0.0
line 0 q1 = 995 is int, but q2 = 7633.0 is float, before been set as string
In lines 2 and 3, some NaN where fill as "EMPTY" others were not.

@rhshadrach
Copy link
Member

Thanks for the report! If you do

for k, v in dct.items():
    print(pd.DataFrame.from_dict(v, 'index', dtype=str))

you get

        drop   q1      q2   q3      q4  ...       e1 e2      e3       e4 status
grade1  True  955  7633.0  0.0  1670.0  ...  1535065  0  432747  1102318    NaN
grade2  True   54     NaN  NaN     NaN  ...      507  0      37      470    bad

[2 rows x 12 columns]
         drop  q1      e1     q2 status      e4     e2      e3
grade1  False  70  317792  730.0   good     NaN    NaN     NaN
grade2   True  11       0    NaN    NaN  6414.0  605.0  5809.0

The call to from_dict is enforcing the dtype string, but only after DataFrame construction. In other words, the values you specify are being coerced to float because of the null values, and then converted to strings be. This seems appropriate to me - users can convert individual values if they so choose, and leaving dtype=... to only act after the DataFrame is constructed.

For your other source of null values, this is occurring because the is e.g. no column q3 in the 2nd DataFrame. This happens after the call to fillna, and should be expected.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 22, 2022

Assuming current behavior is to remain as is, perhaps this can be better documented as Data type to force after DataFrame construction, otherwise infer.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions labels Oct 22, 2022
@borisiskra
Copy link
Author

borisiskra commented Oct 22, 2022

if 'the values you specify are being coerced to float because of the null values' then why in the first row some are integers (955, 1535065, 0, 432747, 1102318) and others are floats (7633.0, 0.0, 1670.0)?

I tried the first row without missing value (added 'status':'bad') and got the same result

@rhshadrach
Copy link
Member

If a value isn't missing in any row, then there is no coercion to float - they remain integers. If nulls appear because the of the concat, they are not coerced to floats because at this point they are strings and not integers.

You're combining three operations here - (a) DataFrame construction via from_dict; (b) fillna; and (c) concat. It would be helpful to take these one at a time rather than altogether. If you think there is an unexpected result in one or more of them, report the input and output to just that step alone and describe what you're finding to be unexpected.

@borisiskra
Copy link
Author

I see, the "issue" is column-wise not row-wise, now I understand what is going on.

@rhshadrach
Copy link
Member

Yes - that's correct. Each column in a pandas DataFrame has a single dtype - you can see what the dtype is for each column with DataFrame.dtypes. For object dtype - this can hold any Python object but will be much slower in computations. You should prefer bool / int / float dtypes when possible.

If there is no remaining issue for you here, please feel free to close.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 22, 2022

On second thought, I think the docs could be improved a little here as noted in #49251 (comment). Leaving this open.

@rhshadrach rhshadrach added Docs good first issue and removed Needs Discussion Requires discussion from core team before further action labels Oct 22, 2022
@EleekaN
Copy link

EleekaN commented Oct 23, 2022

Hi, how can I work on the documentation related to this issue?
Sorry, I am new here, trying to figure out. Any direction will be appreciated.
Thank you.

@rhshadrach
Copy link
Member

For general information on contributing, I would recommend: https://pandas.pydata.org/pandas-docs/dev/development/index.html

For this specific issue, #49251 (comment) describes how I think the docs could be improved.

@EleekaN
Copy link

EleekaN commented Oct 23, 2022

@rhshadrach Thank you so much. I will explore more from the link. :-)

@natmokval
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Docs Dtype Conversions Unexpected or buggy dtype conversions good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants