-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_excel() modifies provided types dict when accessing file with duplicate column #42508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_excel() modifies provided types dict when accessing file with duplicate column #42508
Conversation
debnathshoham
commented
Jul 12, 2021
- closes read_excel() modifies provided types dict when accessing file with duplicate column #42462
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pr @debnathshoham. Left some comments, but in general this fix looks extremely heavyweight since we are now copying all parameters. I also think the new structure is harder to follow with locals being deleted and everything bundled in kwargs.
IMO a cleaner solution would be to find where the dtypes
dict is being modified, then either
- Change the relevant code so no modification inplace is necessary
- If the above is not possible, then copy only the dtypes dict in the case where we need to modify it
pandas/io/excel/_base.py
Outdated
kwargs = locals().copy() | ||
for each in kwargs: | ||
if isinstance(locals()[each], dict): | ||
kwargs[each] = locals()[each].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially trying to do a deepcopy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mzeitlin11 . Yes it is. I tried to use deepcopy first, but it was failing multiple tests.
I got the idea of creating a copy of kwargs from read_csv (which as you mentioned doesn't produce this unnecessary side effect). So I thought it would be a good idea to keep it consistent.
pandas/pandas/io/parsers/readers.py
Lines 566 to 586 in d60e687
# locals() should never be modified | |
kwds = locals().copy() | |
del kwds["filepath_or_buffer"] | |
del kwds["sep"] | |
kwds_defaults = _refine_defaults_read( | |
dialect, | |
delimiter, | |
delim_whitespace, | |
engine, | |
sep, | |
error_bad_lines, | |
warn_bad_lines, | |
on_bad_lines, | |
names, | |
prefix, | |
defaults={"delimiter": ","}, | |
) | |
kwds.update(kwds_defaults) | |
return _read(filepath_or_buffer, kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the idea of creating a copy of kwargs from read_csv (which as you mentioned doesn't produce this unnecessary side effect). So I thought it would be a good idea to keep it consistent.
Makes sense. That case is a bit different though because kwargs
itself was being modified. Here the issue is limited to the dtypes
dict. If there's a clean way to copy (or not modify in place) the dtypes dict, I think that would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will work along those lines. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you run into any issues!
dtype_dict = {"a": str, "b": str, "c": str} | ||
dtype_dict_copy = dtype_dict.copy() | ||
pd.read_excel(filename, dtype=dtype_dict) | ||
assert dtype_dict == dtype_dict_copy, "dtype dict changed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that the resulting frame is as expected? (I know this is focusing on the dtypes dict, but may as well also test the reading portion here since unlikely we have great coverage for dtypes
dict with duplicate cols)
@@ -1278,6 +1278,13 @@ def test_ignore_chartsheets_by_int(self, request, read_ext): | |||
): | |||
pd.read_excel("chartsheet" + read_ext, sheet_name=1) | |||
|
|||
def test_dtype_dict(self, read_ext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please leave a link to the relevant github issue here? And maybe also make the name of this test more specific, eg in this care we about about dtype
argument not being modified in the case of duplicate columns being present
} | ||
) | ||
assert dtype_dict == dtype_dict_copy, "dtype dict changed" | ||
tm.assert_frame_equal(read, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you please rename read
-> result
(that's the naming convention we use for all tests comparing result
to expected
)
The dict is being changed below. Preventing this will require a bigger change. pandas/pandas/io/parsers/python_parser.py Lines 428 to 434 in 3c10f1f
|
pandas/io/parsers/python_parser.py
Outdated
@@ -81,7 +81,10 @@ def __init__(self, f: FilePathOrBuffer | list, **kwds): | |||
self.verbose = kwds["verbose"] | |||
self.converters = kwds["converters"] | |||
|
|||
self.dtype = kwds["dtype"] | |||
if isinstance(kwds["dtype"], dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please a small comment why this is necessary (even just pointing back to the issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzeitlin11 added comment pointing back to issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @debnathshoham!
def test_dtype_dict_unchanged_with_duplicate_columns(self, read_ext): | ||
# GH 42462 | ||
|
||
filename = "test_common_headers" + read_ext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not use existing data? or simply do a round trip; i don't want to add even more files like this
pandas/io/parsers/python_parser.py
Outdated
@@ -81,7 +81,11 @@ def __init__(self, f: FilePathOrBuffer | list, **kwds): | |||
self.verbose = kwds["verbose"] | |||
self.converters = kwds["converters"] | |||
|
|||
self.dtype = kwds["dtype"] | |||
# GH 42462 : dtype dict parameter changes with duplicate columns in input data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just copy.copy
is the answer here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or better to actually see where it is being modified and fix it there
This was introduced in #41411. You can try changing the modified Code there to avoid this and you can use the Tests I have added there |
8786256
to
c10b931
Compare
pandas/io/parsers/python_parser.py
Outdated
cur_count = counts[col] | ||
|
||
if cur_count > 0: | ||
while cur_count > 0: | ||
counts[col] = cur_count + 1 | ||
col = f"{col}.{cur_count}" | ||
cur_count = counts[col] | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Same is done in the c parser, you should remove that too probably?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the changes in c parser as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure Looks related
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that particular failure is resolved now. I am not getting the current failures in local.
please review |
HI @phofl @jreback @mzeitlin11 - do you mind taking a look, if this is okay now? |
I‘ll have a look shortly |
pandas/_libs/parsers.pyx
Outdated
@@ -989,6 +981,10 @@ cdef class TextReader: | |||
col_dtype = self.dtype[name] | |||
elif i in self.dtype: | |||
col_dtype = self.dtype[i] | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm, can we not leave the original and simply copy self.dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I had done that originally (send a copy of dtype, so that the original is unaffected). But then you suggested to try and not modify the dict at all(#42508 (comment)) 😅 . Also @phofl had mentioned that this was a unwanted side effect of #41411. So, I tried to incorporate those comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please let me know if you want me to revert all the changes, and just pass a copy of dtype dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer keeping the mangle_dup_cols logic together, e.g. simply copying the dict would be the best here. If we spread this over the code base it will be harder to refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, can you add a whatsnew note, bug fixes I/O section for 1.4. ping on green.
Lets do this for 1.3.2, was a regression |
Hello @debnathshoham! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-08-04 14:23:43 UTC |
13b4d87
to
3df5cf3
Compare
thanks @debnathshoham |
@meeseeksdev backport 1.3.x |
…ct when accessing file with duplicate column
Something went wrong ... Please have a look at my logs. |
…cessing file with duplicate column (#42893) Co-authored-by: Shoham Debnath <[email protected]>