-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_excel() modifies provided types dict when accessing file with duplicate column #42508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 10 commits
cf40fb3
a45535c
6c98fa3
e79e9a1
c10b931
4419146
b63aef2
f0f3022
cf27280
52c9bd5
0f78c9f
cb369bb
7bcb504
ffb5852
72d50f4
57c65e5
3df5cf3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,6 @@ | |
) | ||
|
||
from pandas.core.dtypes.common import is_integer | ||
from pandas.core.dtypes.inference import is_dict_like | ||
|
||
from pandas.io.parsers.base_parser import ( | ||
ParserBase, | ||
|
@@ -417,21 +416,13 @@ def _infer_columns(self): | |
counts: DefaultDict = defaultdict(int) | ||
|
||
for i, col in enumerate(this_columns): | ||
old_col = col | ||
cur_count = counts[col] | ||
|
||
if cur_count > 0: | ||
while cur_count > 0: | ||
counts[col] = cur_count + 1 | ||
col = f"{col}.{cur_count}" | ||
cur_count = counts[col] | ||
if ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Same is done in the c parser, you should remove that too probably? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. made the changes in c parser as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Test failure Looks related There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that particular failure is resolved now. I am not getting the current failures in local. |
||
self.dtype is not None | ||
and is_dict_like(self.dtype) | ||
and self.dtype.get(old_col) is not None | ||
and self.dtype.get(col) is None | ||
): | ||
self.dtype.update({col: self.dtype.get(old_col)}) | ||
|
||
this_columns[i] = col | ||
counts[col] = cur_count + 1 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -576,8 +576,11 @@ def test_reader_dtype_str(self, read_ext, dtype, expected): | |
def test_dtype_mangle_dup_cols(self, read_ext, dtypes, exp_value): | ||
# GH#35211 | ||
basename = "df_mangle_dup_col_dtypes" | ||
result = pd.read_excel(basename + read_ext, dtype={"a": str, **dtypes}) | ||
dtype_dict = {"a": str, **dtypes} | ||
dtype_dict_copy = dtype_dict.copy() | ||
result = pd.read_excel(basename + read_ext, dtype=dtype_dict) | ||
expected = DataFrame({"a": ["1"], "a.1": [exp_value]}) | ||
assert dtype_dict == dtype_dict_copy, "dtype dict changed" # GH 42462 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please move the issue reference to the start and remove the comment, same below |
||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_reader_spaces(self, read_ext): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm, can we not leave the original and simply copy self.dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I had done that originally (send a copy of dtype, so that the original is unaffected). But then you suggested to try and not modify the dict at all(#42508 (comment)) 😅 . Also @phofl had mentioned that this was a unwanted side effect of #41411. So, I tried to incorporate those comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please let me know if you want me to revert all the changes, and just pass a copy of dtype dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer keeping the mangle_dup_cols logic together, e.g. simply copying the dict would be the best here. If we spread this over the code base it will be harder to refactor