-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF/INT: concat blocks of same type with preserving block type #17728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF/INT: concat blocks of same type with preserving block type #17728
Conversation
Hello @jorisvandenbossche! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on October 10, 2017 at 23:27 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #17728 +/- ##
==========================================
- Coverage 91.27% 91.26% -0.01%
==========================================
Files 163 163
Lines 49765 49793 +28
==========================================
+ Hits 45421 45442 +21
- Misses 4344 4351 +7
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17728 +/- ##
==========================================
+ Coverage 91.22% 91.24% +0.01%
==========================================
Files 163 163
Lines 50014 50043 +29
==========================================
+ Hits 45627 45661 +34
+ Misses 4387 4382 -5
Continue to review full report at Codecov.
|
@jreback this PR introduces some additional code without removing any (due to other complexities like index not being based on blocks, I can't remove any code I think), but I think the code addition is not too excessive for obtaining the goal. It is still a bit messy where I need to check for the SparseBlock. It also seems to give a modest speed-up on a very quick test, from around 900ms to around 700ms on this test (although performance was not the original motivation):
|
pandas/core/internals.py
Outdated
@@ -2684,6 +2705,19 @@ def shift(self, periods, axis=0, mgr=None): | |||
return [self.make_block_same_class(new_values, | |||
placement=self.mgr_locs)] | |||
|
|||
def concat_same_type(self, to_concat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should prob make a dispatch function in _concat
to avoid all of this boilerplate
def _concat_for_type(to_concat, typ='sparse|datetime|cat')
pandas/core/reshape/concat.py
Outdated
# concat Series with length to keep dtype as much | ||
non_empties = [x for x in self.objs if len(x) > 0] | ||
|
||
# check if all series are of the same block type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the purpose of concat_compat, why are you re-writing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concat_compat
works on the level of values, here I want to work on the level of blocks. I could also try to do that in concat_compat
, but I am not sure I can fully switch that to using blocks (eg for concatting index it still needs to work on values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we just pushed the sparse handling for this down into the blocks to make .unstack
cleaner. Its not that hard actually. you would just have a BlockManager.concat which does the work for you (e.g. give it a list of other block managers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right seem my comment. This would be ideal to actually do inside the BlockManger (not objecting to the actual code, more of its location).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For series this would be SingleBlockManager then?
For me that would be OK to move this there, but I in the SingleBlockManager.concat, I would still need to dispatch to the actual Blocks (as I now have the methods on the individual blocks), as my original goal is enabling external blocks to override how their values are concatenated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure that's fine, your code would be pretty similar. the idea is you just have a Block.concat that works for mostl cases and case override as needed (e.g. in your custom block).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the sparse unstack PR is indeed a good example for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did an attempt to move the logic inside SingleBlockManager, see last commit. Not fully sure I find it cleaner (as eg the new_axis is passed into the function, as that logic is currently quite ingrained into the _Concatenator
class to calculate the new axis values)
ab3f3a2
to
bd561d9
Compare
I added a similar logic to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty reasonable.
pandas/core/internals.py
Outdated
to_concat = [blk.values for blk in to_concat] | ||
values = _concat._concat_categorical(to_concat, axis=self.ndim - 1) | ||
|
||
if is_categorical_dtype(values.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could dispense with all of these if/thens and just use make_block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I could, but I wanted to make advantage of the fact that for those case it is categorical, I don't need to re-infer that.
Not sure how important that is for the internal ones though, as the if/else statements in make_block will not give that much of overhead (I added the concat_same_type
machinery to be able to do that in an external Block, but that does not mean that I necessarily need to do it internally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its just added code and more complexity. Trying reduce code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I removed this if/else check in the categorical and datetimetz block's concat_same_type.
But if you want, I can also completely remove all overwriting of Block.concat_same_type
and use a generic one in the base Block type (using _concat._concat_compat
and make_block
instead of np.concatenate
and self.make_block_same_class
). And in this way only keep the ability of external block types (like geopandas) to overwrite this, and without using that ability internally. But, doing that would remove the performance improvement I mentioned above (which is not the main objective of this PR, so not too bad, but having it is still nice).
|
||
""" | ||
return ( | ||
# all blocks need to have the same type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a mouthful!
can you put blank lines in between statements
blk_mgr = BlockManager([block], [['col'], range(3)]) | ||
df = pd.DataFrame(blk_mgr) | ||
assert repr(df) == ' col\n0 Val: 0\n1 Val: 1\n2 Val: 2' | ||
|
||
|
||
def test_concat_series(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number here
ping when ready to have a look |
@@ -314,6 +314,15 @@ def ftype(self): | |||
def merge(self, other): | |||
return _merge_blocks([self, other]) | |||
|
|||
def concat_same_type(self, to_concat, placement=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you could combine all of these concat_same_type
into a single routine if you did this
def concat_same_type(self, to_concat, placement=None)
"""
Concatenate list of single blocks of the same type.
"""
values = self._concatenator([blk.values for blk in to_concat], axis=self.ndim - 1)
return self.make_block_same_class(
values, placement=placement or slice(0, len(values), 1))
Then add to Block
_concatenator = np.concatenate
Categorical
_concatnator = _concat._concat_categorical
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that would actually be nice. The only problem with this is that (for the reasons that I had the if/else statements originally) self.make_block_same_class
will not always work. Eg _concat_categorical
can return both categorical values as object values, and depending on that should return a CategoricalBlock or another type of Block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for categorical block override; otherwise u end up repeating lots of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but need to overwrite for Datetimetz as well, so then end up with almost as many overridden ones as now (only for Sparse it would not be needed then).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still repeatting way too much code. you can just do it this way
in Block
def concat_same_type(self, to_concat, constructor=None, placement=None):
values = self._concatenator(......)
if constructor is None:
constructor = make_block
return constructor(....)
then where needed
def concat_same_type(.......):
return super(Categorical, self).concat_same_type(....., constructor=self.make_block_same_class)
that way for an overriden class you are not repeating evertyhing.
if len(non_empties) > 0: | ||
blocks = [obj.blocks[0] for obj in non_empties] | ||
|
||
if all([type(b) is type(blocks[0]) for b in blocks[1:]]): # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could some of this logic be moreve to concaty_same_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which logic do you mean exactly?
For now, I coded concat_same_type
such that is assumes only blocks of the same type are passed (so I don't do the checking there, but before the method is called)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. OK to merge, and then iterate as needed (basically as driven by geopandas)?
So as long we agree on the "external interface", which currently is a |
Just to note that I have no time until friday to do edits (work-related coding event). But, I think the requested changes are more exact internal implementation cosmetics, and is more clean-up and not fundamental code changes. So depending on when you want to release, I would propose to merge this, and I promise to do a follow-up PR. |
Fine by me. I think we can trust you to not flake off :)
…On Wed, Oct 11, 2017 at 9:07 PM, Joris Van den Bossche < ***@***.***> wrote:
Just to note that I have no time until friday to do edits (work-related
coding event). But, I think the requested changes are more exact internal
implementation cosmetics, and is more clean-up and not fundamental code
changes. So depending on when you want to release, I would propose to merge
this, and I promise to do a follow-up PR.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#17728 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIjQhV0XR9s6m8gMjhn-BRvRzZ8H9ks5srXRNgaJpZM4Ppl3X>
.
|
thanks @jorisvandenbossche yep a clean up would be great. |
* Add _concatenator method to GeometryBlock Following on pandas-dev/pandas#17728 * Use None for missing values Previously we used `Empty Polygon` for missing values. Now we revert to using NULL in GeometryArray (as before) and Python None when we convert to shapely objects. * fix align and fillna tests * implement dropna * remove comments * Redefine isna This makes it so that only Nones and NaNs are considered missing. * clean up tests * respond to comments * remove unsupported kwargs
Following on pandas-dev/pandas#17728 * Use None for missing values Previously we used `Empty Polygon` for missing values. Now we revert to using NULL in GeometryArray (as before) and Python None when we convert to shapely objects. This makes it so that only Nones and NaNs are considered missing.
Following on pandas-dev/pandas#17728 * Use None for missing values Previously we used `Empty Polygon` for missing values. Now we revert to using NULL in GeometryArray (as before) and Python None when we convert to shapely objects. This makes it so that only Nones and NaNs are considered missing.
Following on pandas-dev/pandas#17728 * Use None for missing values Previously we used `Empty Polygon` for missing values. Now we revert to using NULL in GeometryArray (as before) and Python None when we convert to shapely objects. This makes it so that only Nones and NaNs are considered missing.
Related to #17283
Goal is to get
pd.concat([list of series]
working with Series with external block type. Currently the values are always converted to arrays, concatenated, and converted back to block with block type inference.This is a proof-of-concept to check whether something like this would be approved. Tests should still break because I didn't specialize the
concat_same_type
for categorical data.