Skip to content

Add low-level create_dataframe_from_blocks helper function #58197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions pandas/api/internals.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from pandas import DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a docstring (module and/or function level) to the effect of "we discourage this for everyone except pyarrow. if you think you have a use case for this, let us know at [...]"

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl might also have a use case in dask (I don't know id you already have a better idea now if that would be the case?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we are working on changing how we shuffle data were this would be helpful (we will get a huge number of small data frames, so overhead is painful), but I agree that we should strengthen this a little bit that makes it clear that end users shouldn't need this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added a more generic note "For almost all use cases, you should use the standard pd.DataFrame(..) constructor instead." without naming specific libraries that use this.

What would we gain with a "if you think you have a use case for this, let us know at"? Learning about use cases where people would use this is certainly valuable, but in the end it will be public developer API and so if we would in the future change or remove it, we need to go through normal deprecation processes anyway, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we'll never have to revisit this again. But if we do, there is evidence that discussions around a deprecation here would be more painful than elsewhere. It would be helpful to know ahead of such a discussion if anyone else was using it. Moreover, the "let us know" is a chance to try to talk anyone out of using this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "If you are planning to use this function, let us know by opening an issue at https://github.com/pandas-dev/pandas/issues."

from pandas.core.internals.api import _make_block
from pandas.core.internals.managers import BlockManager as _BlockManager


def create_dataframe_from_blocks(blocks, index, columns):
"""
Low-level function to create a DataFrame from arrays as they are
representing the block structure of the resulting DataFrame.

Attention: this is an advanced, low-level function that should only be
used if you know that the below-mentioned assumptions are guaranteed.
If passing data that do not follow those assumptions, subsequent
subsequent operations on the resulting DataFrame might lead to strange
errors.

Assumptions:

- The block arrays are either a 2D numpy array or a pandas ExtensionArray
- In case of a numpy array, it is assumed to already be in the expected
shape for Blocks (2D, (cols, rows), i.e. transposed compared to the
DataFrame columns).
- All arrays are taken as is (no type inference) and expected to have the
correct size.
- The placement arrays have the correct length (equalling the number of
columns that its equivalent block array represents), and all placement
arrays together form a complete set of 0 to n_columns - 1.

Parameters
----------
blocks : list of tuples of (block_array, block_placement)
This should be a list of tuples existing of (block_array, block_placement),
where:

- block_array is a 2D numpy array or a 1D ExtensionArray, following the
requirements listed above.
- block_placement is a 1D integer numpy array
index : Index
The Index object for the `index` of the resulting DataFrame.
columns : Index
The Index object for the `columns` of the resulting DataFrame.

Returns
-------
DataFrame
"""
blocks = [_make_block(*block) for block in blocks]
axes = [columns, index]
mgr = _BlockManager(blocks, axes)
return DataFrame._from_mgr(mgr, mgr.axes)
33 changes: 32 additions & 1 deletion pandas/core/internals/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,14 @@
from pandas.core.dtypes.common import pandas_dtype
from pandas.core.dtypes.dtypes import (
DatetimeTZDtype,
ExtensionDtype,
PeriodDtype,
)

from pandas.core.arrays import DatetimeArray
from pandas.core.arrays import (
DatetimeArray,
ExtensionArray,
)
from pandas.core.construction import extract_array
from pandas.core.internals.blocks import (
check_ndim,
Expand All @@ -37,6 +41,33 @@
from pandas.core.internals.blocks import Block


def _make_block(values: ExtensionArray | np.ndarray, placement: np.ndarray) -> Block:
"""
This is an analogue to blocks.new_block(_2d) that ensures:
1) correct dimension for EAs that support 2D (`ensure_block_shape`), and
2) correct EA class for datetime64/timedelta64 (`maybe_coerce_values`).

The input `values` is assumed to be either numpy array or ExtensionArray:
- In case of a numpy array, it is assumed to already be in the expected
shape for Blocks (2D, (cols, rows)).
- In case of an ExtensionArray the input can be 1D, also for EAs that are
internally stored as 2D.

For the rest no preprocessing or validation is done, except for those dtypes
that are internally stored as EAs but have an exact numpy equivalent (and at
the moment use that numpy dtype), i.e. datetime64/timedelta64.
"""
dtype = values.dtype
klass = get_block_type(dtype)
placement = BlockPlacement(placement)

if isinstance(dtype, ExtensionDtype) and dtype._supports_2d:
values = ensure_block_shape(values, ndim=2)

values = maybe_coerce_values(values)
return klass(values, ndim=2, placement=placement)


def make_block(
values, placement, klass=None, ndim=None, dtype: Dtype | None = None
) -> Block:
Expand Down
1 change: 1 addition & 0 deletions scripts/validate_unwanted_patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
# TODO(4.0): GH#55043 - remove upon removal of CoW option
"_get_option",
"_fill_limit_area_1d",
"_make_block",
}


Expand Down
Loading