Skip to content

Internals: concat does not preserve block types / does not take ndim of blocks into account #17283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Aug 18, 2017 · 5 comments
Labels
Internals Related to non-user accessible pandas implementation

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 18, 2017

xref #17144

I am trying to get concat (and affiliated functionality) working with the external GeometryBlock (which we are developing in geopandas). This is currently failing, and in such a manner (I think) that it will not be fixable in geopandas.

Currenlty the concat (and the concatenate_join_units) functionality has a lot of hardcoded logic about blocks. Two problems that I encountered (so far):

  • it does not respect the 'dimensionality' of the block.

    • In case of non-consolidatable blocks, the block values are stored in 1D array-likes, while the block is actually 2D. For this, there is some hard-coded logic: in _concat_compat (
      def _concat_compat(to_concat, axis=0):
      , called from concatenate_join_units), eg for tz datetimes it defers to _concat_datetimetz which knows it gets 1D instead of 2D data. The general case however is passed to np.concatenate(to_concat, axis=axis) where the axis does not take the dimensionality of the Block into account (which leads to an error if your block is non-consolidatable with 1D underlying data)
    • A possible solution is to check the dimensionality of the data of the block, and if it is 1D take this into account when concatting the data.
    • However, the problem is then still that np.concatenate does not fully the right thing. So a further possible solution would be that there is eg a class method on the block to concatenate the blocks.
  • In general, Block types are not preserved

    • Eg in the case of concatting two Series objects of the same block type, the values are concatted and the block type is re-inferred. This means that the result can never have the externally defined block type without pandas knowing about this block type. (this does not only happen in concat, but in other places as well)
    • In the specific case of concat, I think we could certainly check if the block types are equal and in that case preserve the block class.
    • More in general, it is difficult to manipulate with BlockManagers while preserving block types. Eg BlockManager.insert/set are not able to add a Block while preserving the block class (they always infer the block type based on the data). Other example is concatenate_block_managers where the blocks are again reconstructed without taking into account the original block class.
    • Would it be desirable/acceptable to more retain the block class? (not sure if this would have impact on pandas itself)

@jreback (cc @mrocklin)

@gfyoung gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Internals Related to non-user accessible pandas implementation and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 18, 2017
@jorisvandenbossche
Copy link
Member Author

@jreback Do you have an opinion about this?
We mentioned it very shortly at the end of the dev call, but I don't remember anymore what you said.

@mrocklin
Copy link
Contributor

mrocklin commented Oct 1, 2017

Thought I'd ping this in case this got lost in @jreback's inbox (which I'm sure is overflowing).

This poses a significant usability issue to the geopandas cythonization process.

@jorisvandenbossche
Copy link
Member Author

@mrocklin FYI, I just started a PR looking at the Series case: #17728

@mrocklin
Copy link
Contributor

mrocklin commented Oct 1, 2017

Oh excellent

@jorisvandenbossche
Copy link
Member Author

I think this can be closed now (given the merged PRs and the ExtensionArray interface functionality). Any remaining problems with concatting would be bugs in the ExtensionArray interface.

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Jul 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

3 participants