Skip to content

BUG: concat of Series of EA and other dtype fails #20840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 29, 2018

Conversation

jorisvandenbossche
Copy link
Member

Closes #20832

@jorisvandenbossche jorisvandenbossche added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 27, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Apr 27, 2018
@codecov
Copy link

codecov bot commented Apr 27, 2018

Codecov Report

Merging #20840 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20840      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49280    49313      +33     
==========================================
+ Hits        45229    45261      +32     
- Misses       4051     4052       +1
Flag Coverage Δ
#multiple 90.17% <100%> (ø) ⬆️
#single 41.88% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 99.18% <100%> (ø) ⬆️
pandas/util/testing.py 84.59% <0%> (-0.21%) ⬇️
pandas/core/arrays/base.py 83.95% <0%> (ø) ⬆️
pandas/api/extensions/__init__.py 100% <0%> (ø) ⬆️
pandas/core/series.py 93.99% <0%> (ø) ⬆️
pandas/core/internals.py 95.59% <0%> (+0.01%) ⬆️
pandas/core/indexing.py 93.14% <0%> (+0.05%) ⬆️
pandas/core/dtypes/missing.py 92.94% <0%> (+0.08%) ⬆️
pandas/core/algorithms.py 94.48% <0%> (+0.08%) ⬆️
pandas/core/dtypes/cast.py 88.06% <0%> (+0.2%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cacdde...964a5ff. Read the comment docs.

@@ -175,8 +175,8 @@ def is_nonempty(x):
return _concat_sparse(to_concat, axis=axis, typs=typs)

extensions = [is_extension_array_dtype(x) for x in to_concat]
if any(extensions):
to_concat = [np.atleast_2d(x.astype('object')) for x in to_concat]
if any(extensions) and axis == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what happens on axis=0? coerced toobject?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right (everything is upcast).

@@ -64,6 +64,11 @@ def test_concat_mixed_dtypes(self, data):
expected = pd.concat([df1.astype('object'), df2.astype('object')])
self.assert_frame_equal(result, expected)

result = pd.concat([df1['A'], df2['A']])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just for axis=1? I would make this a separate test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug is just with axis=0, the default, which this is testing.

For axis=1 there's not any up-casting required.

Though I don't see any tests for concat(..., axis='columns') with the extension array. Can you add some Joris? concat EA and EA, mix of EA and non-EA, aligned and no aligned. Or should I?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thought we already had some, but indeed, apparently not. Will add some.

if any(extensions):
to_concat = [np.atleast_2d(x.astype('object')) for x in to_concat]
if any(extensions) and axis == 1:
to_concat = [np.atleast_2d(x.astype('object')) for x in to_concat]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you added this extra indent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, sorry that was from a previous iteration where I had a second if :-)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 28, 2018 via email

@@ -175,7 +175,7 @@ def is_nonempty(x):
return _concat_sparse(to_concat, axis=axis, typs=typs)

extensions = [is_extension_array_dtype(x) for x in to_concat]
if any(extensions):
if any(extensions) and axis == 1:
to_concat = [np.atleast_2d(x.astype('object')) for x in to_concat]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very strange that you need to add axis=1 here, I don't think this routine should be call at all if axis=1. This seems like it should be handled at a higher level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very strange that you need to add axis=1 here, I don't think this routine should be call at all if axis=1

axis=1 is used all over _concat_compat. Why do you think it's strange?

For the specific case from #20832 it is axis=0, but we don't want to reshape to 2d.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the special casing non-consolidateble blocks with axis=1 is indeed an existing pattern.

For categoricals:

if axis == 1:
return res.reshape(1, len(res))

for datetimetz:

if axis == 1:
x = np.atleast_2d(x)

Note that this is all rather confusing, as axis=1 in the "concatting code" actually means axis=0 in user facing code (because we store the 1D data in 2D objects):

# Need to flip BlockManager axis in the DataFrame special case
self._is_frame = isinstance(sample, DataFrame)
if self._is_frame:
axis = 1 if axis == 0 else 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right this should be changed to use _get_block_manager_axis FYI (but can be in future).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right this should be changed to use _get_block_manager_axis FYI (but can be in future).

I don't think so, _get_block_manager_axis is for a full Series/DataFrame, the logic here is dtype-dependent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does your respsone have to do with my comment? this is exactly what _get_block_managed_axis does

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my response was not very well phrased, but still, then you will have to explain your comment better :-)

I understood your comment as: "we should use _get_block_manager_axis here ", and AFAIK, _get_block_manager_axis lets you convert an axis of 0/1 to the appropriate number for blocks (so for DataFrame it switches 0 and 1). But here, I don't convert any axis argument, I reshape data depending on the axis value (and depending on the dtype).
So I don't fully understand where in the above code lines I would use _get_block_manager_axis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/pandas/core/reshape/concat.py b/pandas/core/reshape/concat.py
index 6e564975f..6570e7a83 100644
--- a/pandas/core/reshape/concat.py
+++ b/pandas/core/reshape/concat.py
@@ -312,7 +312,7 @@ class _Concatenator(object):
         # Need to flip BlockManager axis in the DataFrame special case
         self._is_frame = isinstance(sample, DataFrame)
         if self._is_frame:
-            axis = 1 if axis == 0 else 0
+            axis = sample._get_block_manager_axis(axis)
 
         self._is_series = isinstance(sample, Series)
         if not 0 <= axis <= sample.ndim:

is a more correct usage, though prob doesn't simplify code much. this is spagetti anyhow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is spagetti anyhow.

On that we certainly agree! :-)

@jreback jreback merged commit 6322043 into pandas-dev:master Apr 29, 2018
@jorisvandenbossche jorisvandenbossche deleted the ea-concat-mixed branch April 30, 2018 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants