-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: creating DataFrame with no columns: object vs string dtype columns? #60338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I wonder if it would be less disruptive to have the empty Index default to a string data type and coerce to object as needed (at least when used in columns). |
I was actually wrong about the default empty index being object dtype. While that is the case for directly creating an empty index, for DataFrame/Series we already deviate from that and create an empty range index: >>> pd.DataFrame().index
RangeIndex(start=0, stop=0, step=1)
>>> pd.DataFrame().columns
RangeIndex(start=0, stop=0, step=1)
>>> pd.Index([])
Index([], dtype='object') Now, the result is the same because inserting a string label in the integer-like range index also upcasts to object dtype. But yeah, I think it could make sense for the columns to be string by default. |
Good point, although its hard to make any guarantees about what the data type of an empty dataframe is with our current data model. Might be another good motivating factor for PDEP-13 #58455 to implement the Null type and use that as the default. That's of course a ways off; in the meantime I think we just have to make a best effort at this, which I think would be assuming string column labels |
As an index object is immutable and an empty index has no labels does it actually matter what the dtype is when adding rows/columns? Why do we find a common dtype and not just ignore the dtype of the zero length index when creating the new index? |
Because in general, we don't want to have "value dependent" behaviour, i.e. just based on input dtypes you should be able to know the resulting dtype, regardless of the actual values in the object (and so also how many values there are). See for example #40893 or #39122 for some context (that's mostly about concat, and not necessarily Index union, but it relies on the "common dtype" logic). Now, I also commented on that PR that we could still keep specifically float and object dtype empties as special case (because those get created by default for empty objects): #52532 (comment) Focusing on the issue here: I was leaning to going with the suggestion from above to just default to string dtype columns. However, that also introduces a special case (default Index object in a Series/DataFrame if not specified is always a RangeIndex, except for the len-0 case if it is for the columns), and there are various places (not just If we instead could solve this by ignoring the dtype of an empty Index if that dtype is object or is a RangeIndex (i.e. the equivalent for object/float for Series/columns), that might be a good alternative as well. |
Do you have an example of where this matters? I'm somewhat on board with what you are saying, although I'm struggling to understand why that isn't problematic with having an object-dtype versus RangeIndex in the first place |
I don't think I have an example of where this inconsistency matters, it's just about having the inconsistency (not having it is always easier to explain). So this is not a very strong (practical) argument, I think I was more wondering if we can avoid the special case for constructing. Because we do have various places where we would have to construct this properly (e.g. just for Given I am already a bit in favor of special casing empty object/float objects for Series (for the concat examples), I was mostly wondering if a similar change for Index unions/inserts would essentially make the exact default less important (and potentially avoid introducing the inconsistency on the construction part). |
On monthly dev meeting yesterday we discussed the notion of having the Null type for the index of an empty series/frame (as brought up by @jbrockmendel) but generally the team agreed that introducing the Null type won't be feasible for the 2.3 or 3.0 releases |
@jorisvandenbossche - this can be closed now, right? |
A typical case we encounter in the tests is starting from an empty DataFrame, and then adding some columns.
Simplied example of this pattern:
The dataframe starts with an empty
Index
columns, and the default dtype for an empty Index isobject
dtype. And then inserting string labels for the actual columns into that Index object, preserves theobject
dtype.As long as we used object dtype for string column names, this was perfectly fine. But now that we will infer
str
dtype for actual string column names, it gets a bit annoying that the pattern above does not result instr
butobject
colums.This is not the best pattern, so maybe it's OK this does not give the ideal result. But at the same since we even use it quite regularly in our own tests, I suppose this is not that uncommon.
The text was updated successfully, but these errors were encountered: