Skip to content

API/BUG: pd.concat doesn't copy indexes if with axis=1 and copy=True when they are the same #50673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lithomas1 opened this issue Jan 11, 2023 · 3 comments
Labels
Bug Copy / view semantics Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@lithomas1
Copy link
Member

lithomas1 commented Jan 11, 2023

Example

>>> import numpy as np
>>> import pandas as pd
>>> a = np.array([1,2,3])
>>> c = pd.DataFrame({0: ["a", "b", "c"]}, index=a)
>>> print(c)
   0
1  a
2  b
3  c
>>> d = pd.DataFrame({1: ["d", "e", "f"]}, index=a)
>>> print(d)
   1
1  d
2  e
3  f
>>> cd_df = pd.concat([c,d], axis=1, copy=True)
>>> a[0] = 9
>>> cd_df # Changed
   0  1
9  a  d
2  b  e
3  c  f

When indexes are the same, concat with axis=1, and copy=True doesn't copy the index.

Originally discovered in #37441 (index was not copied, and was a view pointing to the original ndarray read by PyTables, preventing it from being freed).

@rhshadrach
Copy link
Member

Indices should be thought of as immutable, no? Much of the functionality (e.g. cached properties) depends on this I think. Should modifying _data really be supported?

@lithomas1
Copy link
Member Author

Hmm. This probably wasn't the best example.
I still think we should do the copy, though.

Here's one that doesn't abuse internals

>>> import numpy as np
>>> import pandas as pd
>>> a = np.array([1,2,3])
>>> c = pd.DataFrame({0: ["a", "b", "c"]}, index=a)
>>> print(c)
   0
1  a
2  b
3  c
>>> d = pd.DataFrame({1: ["d", "e", "f"]}, index=a)
>>> print(d)
   1
1  d
2  e
3  f
>>> cd_df = pd.concat([c,d], axis=1, copy=True)
>>> a[0] = 9
>>> cd_df # Changed
   0  1
9  a  d
2  b  e
3  c  f

@rhshadrach
Copy link
Member

rhshadrach commented Jan 11, 2023

It's a question of performance vs safety. When pandas creates an Index, it doesn't copy memory unnecessarily. This can lead to an index changing even without concat, and this can lead to some bad behavior.

a = np.array([1, 2, 3])
c = pd.DataFrame({0: ["a", "b", "c"]}, index=a)
print(c.index.is_monotonic_increasing)
# True

a[0] = 9
print(c)
#    0
# 9  a
# 2  b
# 3  c

print(c.index.is_monotonic_increasing)
# True

So to me, this is more about whether a copy should happen on Index construction. We can only guarantee immutability (assuming no access to protected internals) if a copy is made, but this could be inefficient.

We make a pretty blanket assumption throughout pandas that an Index is immutable, and assuming that to be the case I think concat should not make a copy. But perhaps construction should.

Related: #42934; cc @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Copy / view semantics Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants