Skip to content

API: creating DataFrame with no columns: object vs string dtype columns? #60338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Nov 16, 2024 · 9 comments
Labels
API Design Index Related to the Index class or subclasses Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

A typical case we encounter in the tests is starting from an empty DataFrame, and then adding some columns.

Simplied example of this pattern:

df = pd.DataFrame()
df["a"] = values
...

The dataframe starts with an empty Index columns, and the default dtype for an empty Index is object dtype. And then inserting string labels for the actual columns into that Index object, preserves the object dtype.

As long as we used object dtype for string column names, this was perfectly fine. But now that we will infer str dtype for actual string column names, it gets a bit annoying that the pattern above does not result in str but object colums.

This is not the best pattern, so maybe it's OK this does not give the ideal result. But at the same since we even use it quite regularly in our own tests, I suppose this is not that uncommon.

@jorisvandenbossche jorisvandenbossche added API Design Strings String extension data type and string data Index Related to the Index class or subclasses labels Nov 16, 2024
@WillAyd
Copy link
Member

WillAyd commented Nov 16, 2024

I wonder if it would be less disruptive to have the empty Index default to a string data type and coerce to object as needed (at least when used in columns).

@jorisvandenbossche
Copy link
Member Author

I was actually wrong about the default empty index being object dtype. While that is the case for directly creating an empty index, for DataFrame/Series we already deviate from that and create an empty range index:

>>> pd.DataFrame().index
RangeIndex(start=0, stop=0, step=1)
>>> pd.DataFrame().columns
RangeIndex(start=0, stop=0, step=1)
>>> pd.Index([])
Index([], dtype='object')

Now, the result is the same because inserting a string label in the integer-like range index also upcasts to object dtype.

But yeah, I think it could make sense for the columns to be string by default.
This would be a backwards incompatible change for the case where you start with an empty dataframe and then insert columns with integer labels (that would then cast to object dtype, instead of preserving the integer dtype)

@WillAyd
Copy link
Member

WillAyd commented Nov 16, 2024

Good point, although its hard to make any guarantees about what the data type of an empty dataframe is with our current data model.

Might be another good motivating factor for PDEP-13 #58455 to implement the Null type and use that as the default. That's of course a ways off; in the meantime I think we just have to make a best effort at this, which I think would be assuming string column labels

@simonjayhawkins
Copy link
Member

As an index object is immutable and an empty index has no labels does it actually matter what the dtype is when adding rows/columns? Why do we find a common dtype and not just ignore the dtype of the zero length index when creating the new index?

@jorisvandenbossche
Copy link
Member Author

Why do we find a common dtype and not just ignore the dtype of the zero length index when creating the new index?

Because in general, we don't want to have "value dependent" behaviour, i.e. just based on input dtypes you should be able to know the resulting dtype, regardless of the actual values in the object (and so also how many values there are).

See for example #40893 or #39122 for some context (that's mostly about concat, and not necessarily Index union, but it relies on the "common dtype" logic).
And we deprecated the ignoring of empties in concat in the 2.x series: #52532

Now, I also commented on that PR that we could still keep specifically float and object dtype empties as special case (because those get created by default for empty objects): #52532 (comment)


Focusing on the issue here: I was leaning to going with the suggestion from above to just default to string dtype columns. However, that also introduces a special case (default Index object in a Series/DataFrame if not specified is always a RangeIndex, except for the len-0 case if it is for the columns), and there are various places (not just pd.DataFrame() constructor) where such DataFrame could be created.

If we instead could solve this by ignoring the dtype of an empty Index if that dtype is object or is a RangeIndex (i.e. the equivalent for object/float for Series/columns), that might be a good alternative as well.

@WillAyd
Copy link
Member

WillAyd commented Jan 3, 2025

However, that also introduces a special case (default Index object in a Series/DataFrame if not specified is always a RangeIndex, except for the len-0 case if it is for the columns), and there are various places (not just pd.DataFrame() constructor) where such DataFrame could be created.

Do you have an example of where this matters? I'm somewhat on board with what you are saying, although I'm struggling to understand why that isn't problematic with having an object-dtype versus RangeIndex in the first place

@jorisvandenbossche
Copy link
Member Author

I don't think I have an example of where this inconsistency matters, it's just about having the inconsistency (not having it is always easier to explain). So this is not a very strong (practical) argument, I think I was more wondering if we can avoid the special case for constructing.

Because we do have various places where we would have to construct this properly (e.g. just for DataFrame() constructor, default_index(0) is called for creating the columns in the init itself, but also in ndarray_to_mgr and dict_to_mgr. But then dataframes with empty columns could also be created in concat, IO operations, ..). So it might be quite easy to miss some places. And then if someone explicitly creates the index with pd.Index(seq) with an empty sequence, it still defaults to object and has similar problems.

Given I am already a bit in favor of special casing empty object/float objects for Series (for the concat examples), I was mostly wondering if a similar change for Index unions/inserts would essentially make the exact default less important (and potentially avoid introducing the inconsistency on the construction part).

@WillAyd
Copy link
Member

WillAyd commented Jan 23, 2025

On monthly dev meeting yesterday we discussed the notion of having the Null type for the index of an empty series/frame (as brought up by @jbrockmendel) but generally the team agreed that introducing the Null type won't be feasible for the 2.3 or 3.0 releases

@rhshadrach
Copy link
Member

@jorisvandenbossche - this can be closed now, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Index Related to the Index class or subclasses Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants