-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: concat on axis with both different and duplicate labels raising error #6963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The main issue is how to align indices that both have duplicate items, as of now, indexing with dupes does strange things: In [1]: pd.Index([1,1,2])
Out[1]: Int64Index([1, 1, 2], dtype='int64')
In [2]: _1.get_indexer_for(_1)
Out[2]: Int64Index([0, 1, 0, 1, 2], dtype='int64') Apparently, for each non-unique element found in destination, get_indexer tries to insert all locs of this element. I can hardly thinkg of a use case when I'd want to do a |
The dup indexers came out of having duplicates indexers on a unique index, so you have to dup
Now maybe outlaw that I suppose |
Maybe it would make more sense to require destination index to have the same count of duplicate entries for each element present in source, i.e.: # e.g. these should be ok
pd.Index([1,1,2]).get_indexer_for([1,1,2]) # should be ok and return [0, 1, 2]
pd.Index([1,1,2]).get_indexer_for([2]) # return [2]
pd.Index([1,1,2]).get_indexer_for([1,2,1]) # return [0, 2, 1]
# but these should be forbidden
pd.Index([1,1,2]).get_indexer_for([1,2]) # which one of `1` did you want?
pd.Index([1,1,2]).get_indexer_for([1,1,1,2]) # which `1` should be duplicated? UPD: or maybe cycle over duplicate elements like np.putmask does... x = np.arange(5)
np.putmask(x, x>1, [-33, -44])
print x
array([ 0, 1, -33, -44, -33])
# so that
pd.Index([1,1,2]).get_indexer_for([1,1,2,1,1]) # return [0,1,2,0,1] ? |
right, so that's a duplicate of a duplicate; yes, that would need to be handled differently |
Some more example from #17552 (with
|
@xuancong84 this is anti pandas philosophy having order dependencies is extremely fragile and would work unexpectedly if you happened to reorder columns or not we must align columns on the labels; aligning some of them is just really odd if you want to treat this as a tensor then drop the labels -1 on this proposal |
@jreback Sorry, could not get what you mean? Would you be more specific by illustration with examples? Regardless of how you feel, currently when I am developing the omnipotent data plotter for pandas (https://github.com/xuancong84/beiwe-visualizer ), I am experiencing lots of frustrating limitations of pandas. A lot of situations that could be handled easily could not be handled very well. You might want to take a look at my code to how troublesome I work around those to get things work. Regarding order dependencies, I know it is not ideal because Python dictionary does not preserve order (unless you use OrderedDict). But in cases where there are duplicate names in both row indices and column names, that is the only way to make things work. Otherwise, Pandas just ends up with stupid crashes such as #28479 and #30772 where it should in principle work out correctly. |
I understand, it is not easy to fix? Could there maybe be at least an Exception with a helpful message? I stumbled on it today and took some time to understand the problem. |
Notificactions keep bringing me here :) I haven't touched pandas codebase for a while, so take my 2¢ with a grain of salt.
I agree with that: they are fragile and unreliable. And as a maintainer of other projects I get the sentiment of not adding stuff unless really necessary, even if it is conceptual stuff, like "in case of indexing non-unique indexes with non-unique indexer (re-reading this sentence hurts), matching is performed according to the order of the labels". But from pandas end-user perspective I do see this as a UX papercut. Yes, most of the time you shouldn't care about the ordering of columns or rows, but there are cases where there is no way around. At which point you either verify the existing ordering, or sort by a given criterion, and then for a short period of time you can rely on a specific order to perform a specific operation. A good example of this would be forward-/backward-filling of NAs: ordering among the filling axis will directly influence the outcome, so before applying that I would need to make sure that the data is ordered as I want it to be. The same approach could be applicable here: if you need to concatenate dataframes with non-unique labels and they are not in the order you want them to be, it's up to you to sort them in whatever order you like. |
I have a few more cases of the error messages being much less helpful than they could be: pd.concat([ # One dataframe has repeated column names
pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
]) ValueError: Plan shapes are not aligned pd.concat([ # Repeated columns (same amount) different column ordering
pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
pd.DataFrame(np.ones((2, 4)), columns=list("abca")),
]) AssertionError: Number of manager items must equal union of block items Full tracebacks>>> import pandas as pd, numpy as np
>>> pd.concat([ # One dataframe has repeated column names
... pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
... pd.DataFrame(np.ones((4, 3)), columns=list("abc")),
... ])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 287, in concat
return op.get_result()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 502, in get_result
new_data = concatenate_block_managers(
File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 54, in concatenate_block_managers
for placement, join_units in concat_plan:
File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 561, in _combine_concat_plans
raise ValueError("Plan shapes are not aligned")
ValueError: Plan shapes are not aligned
>>> pd.concat([ # Repeated columns (same amount) different column ordering
... pd.DataFrame(np.ones((4, 4)), columns=list("aabc")),
... pd.DataFrame(np.ones((4, 4)), columns=list("abca")),
... ])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 287, in concat
return op.get_result()
File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 502, in get_result
new_data = concatenate_block_managers(
File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 84, in concatenate_block_managers
return BlockManager(blocks, axes)
File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 149, in __init__
self._verify_integrity()
File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 331, in _verify_integrity
raise AssertionError(
AssertionError: Number of manager items must equal union of block items
# manager items: 3, # tot_items: 4 I think it's a little strange that the following works, but the previous example don't: >>> pd.concat([ # Repeated columns, same ordering
... pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
... pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
... ])
a a b c
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 Could there be a check for this in concatenation which throws a better error? If non-unique column names are to be disallowed it could be something simple, like this other error pandas throws:
It could even be more specific and even name some of the repeated elements, if you wanted to get fancy. If the case where ordering is preserved will be kept, it could be something like:
|
@ivirshup hapoy to take a PR to have better error message yeha duplicates along the same axis of concatenation are almost always an error |
I'd be happy to make a PR. I feel like there might be code in pandas that does checks like these already. Any chance you could point me to places these might be for reference? Also, I'm assuming you want to keep the existing behaviour of the following working for now? pd.concat([ # Repeated columns, same order
pd.DataFrame(np.ones((2, 4)), columns=list("aabc")),
pd.DataFrame(np.ones((2, 5)), columns=list("aabc")),
]) Is that right? |
@jreback, I think this conflicts pretty directly with #36290, which allows duplicate items. Also, I think there are some bugs in that implementation. Using the current release candidate: import pandas as pd
import numpy as np
from string import ascii_lowercase
letters = np.array(list(ascii_lowercase))
a_int = pd.DataFrame(np.arange(5), index=[0,1,2,3,3], columns=['a'])
b_int = pd.DataFrame(np.arange(5), index=[0,1,2,2,4], columns=['b'])
a_str = a_int.set_index(letters[a_int.index])
b_str = b_int.set_index(letters[b_int.index]) This works (the purpose of the PR, and the example in it's linked issue): pd.concat([a_int, b_int], axis=1)
This does not work, though I believe it's pretty equivalent to the previous example: pd.concat([a_str, b_str], axis=1) ----> 1 pd.concat([a_str, b_str], axis=1)
~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
297 )
298
--> 299 return op.get_result()
300
301
~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/reshape/concat.py in get_result(self)
526 mgrs_indexers.append((obj._mgr, indexers))
527
--> 528 new_data = concatenate_block_managers(
529 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
530 )
~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
87 blocks.append(b)
88
---> 89 return BlockManager(blocks, axes)
90
91
~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
141
142 if do_integrity_check:
--> 143 self._verify_integrity()
144
145 # Populate known_consolidate, blknos, and blklocs lazily
~/miniconda3/envs/pandas-1.2/lib/python3.8/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
321 for block in self.blocks:
322 if block.shape[1:] != mgr_shape[1:]:
--> 323 raise construction_error(tot_items, block.shape[1:], self.axes)
324 if len(self.items) != tot_items:
325 raise AssertionError(
ValueError: Shape of passed values is (6, 2), indices imply (5, 2) As an overall point, I think the target behaviour of that PR is wrong. Here's an example of why: # Using pandas 1.2.0rc0
df1 = pd.DataFrame(np.arange(3), index=[0,1,1], columns=['a'])
df2 = pd.DataFrame(np.arange(3), index=[1,0,1], columns=['b'])
pd.concat([df1, df2], axis=1)
The results here rely on the ordering of the labels, (#6963 (comment)) which I agree is brittle. I think there are two more reasonable option for the behaviour.
I'd note the current behaviour of
merge behaviour
merge "inner" and "outer" are equivalent for common repeated indicesIn [11]: pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
Out[11]:
a b
0 0 1
1 1 0
1 1 2
1 2 0
1 2 2
In [12]: pd.merge(df1, df2, left_index=True, right_index=True, how="outer")
Out[12]:
a b
0 0 1
1 1 0
1 1 2
1 2 0
1 2 2 and do not match current behaviour of Current implementation otherwise basically works for outer joins if indices are only repeated in one DataFrameUsing definitions from above, e.g.: a_int = pd.DataFrame(np.random.randn(5), index=[0,1,2,3,3], columns=['a'])
b_int = pd.DataFrame(np.random.randn(5), index=[0,1,2,2,4], columns=['b']) In [4]: pd.merge(a_int, b_int, left_index=True, right_index=True, how="outer")
Out[4]:
a b
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
2 2.0 3.0
3 3.0 NaN
3 4.0 NaN
4 NaN 4.0
In [89]: pd.concat([a_int, b_int], axis=1, join="outer")
Out[89]:
a b
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
2 2.0 3.0
3 3.0 NaN
3 4.0 NaN
4 NaN 4.0 But not for inner joinsIn [8]: pd.merge(a_int, b_int, left_index=True, right_index=True, how="inner")
Out[8]:
a b
0 0 0
1 1 1
2 2 2
2 2 3
In [9]: pd.concat([a_int, b_int], axis=1, join="inner")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-03adb33c977d> in <module>
----> 1 pd.concat([a_int, b_int], axis=1, join="inner")
~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
297 )
298
--> 299 return op.get_result()
300
301
~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/reshape/concat.py in get_result(self)
526 mgrs_indexers.append((obj._mgr, indexers))
527
--> 528 new_data = concatenate_block_managers(
529 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
530 )
~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/concat.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
87 blocks.append(b)
88
---> 89 return BlockManager(blocks, axes)
90
91
~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
141
142 if do_integrity_check:
--> 143 self._verify_integrity()
144
145 # Populate known_consolidate, blknos, and blklocs lazily
~/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
321 for block in self.blocks:
322 if block.shape[1:] != mgr_shape[1:]:
--> 323 raise construction_error(tot_items, block.shape[1:], self.axes)
324 if len(self.items) != tot_items:
325 raise AssertionError(
ValueError: Shape of passed values is (4, 2), indices imply (3, 2) |
@ivirshup ahh i remember now. yeah handling duplicates is hard. so we can handle some of them. i am actually ok with raising on duplicates in either axis, but would have to see how much would break. |
When concatting two dataframes where there are a) there are duplicate columns in one of the dataframes, and b) there are non-overlapping column names in both, then you get a IndexError:
I don't know if it should work (although I suppose it should, as with only the duplicate columns it does work), but at least the error message is not really helpfull.
The text was updated successfully, but these errors were encountered: