Skip to content

Infer dtype when using df.explode()ENH: #34923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hamzahiqb opened this issue Jun 21, 2020 · 6 comments
Open

Infer dtype when using df.explode()ENH: #34923

hamzahiqb opened this issue Jun 21, 2020 · 6 comments
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@hamzahiqb
Copy link

Is your feature request related to a problem?

Yes. Currently, the df.explode method always returns an object for the column being exploded. This leads to loss of information about the dtype of the exploded column.

E.g.

s = pd.Series([1,2,3]) # <- dtype('int64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1})
df.explode("A").dtypes
0
A object
B int64

It would be great if pandas could return the underlying dtype if it was consistent across all rows. (Or return the best dtype (int -> float -> object).)

Describe the solution you'd like

  1. solution 1: The best case scenario would be where pandas would directly infer the dtype if it was consistent (ignoring NaNs) across the across the row.
s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN 
df.explode("A").dtypes
0
A float64
B int64
  1. solution 2: Providing a argument to force inferring the dtype:
s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN 
df.explode("A", infer_type=True).dtypes
0
A float64
B int64

Describe alternatives you've considered

Currently, I use the following workaround:

s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN 

d = df.A[0].dtype
df2 = df.explode("A")
df2.A = df2.A.astype(d)

API breaking implications

Not sure.

@hamzahiqb hamzahiqb added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 21, 2020
@WillAyd
Copy link
Member

WillAyd commented Jun 22, 2020

I think this is reasonable. Can probably add a maybe_infer_objects call on the result of the exploded column (you'll find this in pandas._libs.lib)

@WillAyd WillAyd removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 22, 2020
@erfannariman
Copy link
Member

Can you elaborate @WillAyd , I looked in lib but there was nothing even starting with maybe_infer, also in the whole project I could only find:

  • maybe_infer_freq
  • maybe_infer_tz
  • maybe_infer_dtype_type

@WillAyd
Copy link
Member

WillAyd commented Jun 23, 2020

My mistake - I meant to say maybe_convert_objects which is in pandas._libs.lib

@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

ok this is not going to work as-is as its a breaking change. would suggest that we add downcast=None|'infer' option which folks can opt-in (and at some point we could deprecate and default).

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Nov 26, 2020
@jreback jreback added this to the Contributions Welcome milestone Nov 26, 2020
@jreback jreback added API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions labels Nov 26, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

if the implementation is just going to be if infer: return result.infer_objects(copy=False), then we should just tell users to do that directly

@daskol
Copy link

daskol commented Mar 19, 2023

then we should just tell users to do that directly

I think the issue is that pandas does not preserve element type at the moment. From my perspective, explode should create a column with common dtype of all elements of lists by default because there is no a meaningful reason to change data element type from np.int64 to object. So, explode should act like [[a]] -> [a] but this requires that column elements are actually of common list type [a].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants