No way to construct mixed dtype DataFrame without total copy, proposed solution

After hours of tearing my hair, I've come to the conclusion that it is impossible to create a mixed dtype DataFrame without copying all of its data in. That is, no matter what you do, if you want to create a mixed dtype DataFrame, you will inevitably create a temporary version of the data (e.g. using np.empty), and the various DataFrame will constructors will always make copies of this temporary. This issue has already been brought up, a year ago: https://github.com/pydata/pandas/issues/5902. 

This is especially terrible for interoperability with other programming languages. If you plan to populate the data in the DataFrame from e.g. a call to C, the easiest way to do it by far is to create the DataFrame in python, get pointers to the underlying data, which are np.arrays, and pass these np.arrays along so that they can be populated. In this situation, you simply don't care what data the DataFrame starts off with, the goal is just to allocate the memory so you know what you're copying to.

This is also just generally frustrating because it implies that in principle (depending potentially on the specific situation, and the implementation specifics, etc) it is hard to guarantee that you will not end up using twice the memory you really should.

This has an extremely simple solution that is already grounded in the quantitative python stack: have a method analagous to numpy's empty. This allocates the space, but does not actually waste any time writing or copying anything. Since empty is already taken, I would propose calling the method from_empty. It would accept an index (mandatory, most common use case would be to pass np.arange(N)), columns (mandatory, typically a list of strings), types (list of acceptable types for columns, same length as columns). The list of types should include support for all numpy numeric types (ints, floats), as well as special Pandas columns such as DatetimeIndex and Categorical. 

As an added bonus, since the implementation is in a completely separate method, it will not interfere with the existing API at all.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

No way to construct mixed dtype DataFrame without total copy, proposed solution #9216

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

No way to construct mixed dtype DataFrame without total copy, proposed solution #9216

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions