Skip to content

read_excel() modifies provided types dict when accessing file with duplicate column #42508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 4, 2021

Conversation

debnathshoham
Copy link
Member

@debnathshoham debnathshoham changed the title Read excel bug dtype BUG: Read excel bug dtype Jul 12, 2021
Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr @debnathshoham. Left some comments, but in general this fix looks extremely heavyweight since we are now copying all parameters. I also think the new structure is harder to follow with locals being deleted and everything bundled in kwargs.

IMO a cleaner solution would be to find where the dtypes dict is being modified, then either

  • Change the relevant code so no modification inplace is necessary
  • If the above is not possible, then copy only the dtypes dict in the case where we need to modify it

kwargs = locals().copy()
for each in kwargs:
if isinstance(locals()[each], dict):
kwargs[each] = locals()[each].copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially trying to do a deepcopy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mzeitlin11 . Yes it is. I tried to use deepcopy first, but it was failing multiple tests.

I got the idea of creating a copy of kwargs from read_csv (which as you mentioned doesn't produce this unnecessary side effect). So I thought it would be a good idea to keep it consistent.

# locals() should never be modified
kwds = locals().copy()
del kwds["filepath_or_buffer"]
del kwds["sep"]
kwds_defaults = _refine_defaults_read(
dialect,
delimiter,
delim_whitespace,
engine,
sep,
error_bad_lines,
warn_bad_lines,
on_bad_lines,
names,
prefix,
defaults={"delimiter": ","},
)
kwds.update(kwds_defaults)
return _read(filepath_or_buffer, kwds)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the idea of creating a copy of kwargs from read_csv (which as you mentioned doesn't produce this unnecessary side effect). So I thought it would be a good idea to keep it consistent.

Makes sense. That case is a bit different though because kwargs itself was being modified. Here the issue is limited to the dtypes dict. If there's a clean way to copy (or not modify in place) the dtypes dict, I think that would be cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will work along those lines. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you run into any issues!

dtype_dict = {"a": str, "b": str, "c": str}
dtype_dict_copy = dtype_dict.copy()
pd.read_excel(filename, dtype=dtype_dict)
assert dtype_dict == dtype_dict_copy, "dtype dict changed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that the resulting frame is as expected? (I know this is focusing on the dtypes dict, but may as well also test the reading portion here since unlikely we have great coverage for dtypes dict with duplicate cols)

@@ -1278,6 +1278,13 @@ def test_ignore_chartsheets_by_int(self, request, read_ext):
):
pd.read_excel("chartsheet" + read_ext, sheet_name=1)

def test_dtype_dict(self, read_ext):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please leave a link to the relevant github issue here? And maybe also make the name of this test more specific, eg in this care we about about dtype argument not being modified in the case of duplicate columns being present

@mzeitlin11 mzeitlin11 added Bug IO Excel read_excel, to_excel labels Jul 12, 2021
@debnathshoham debnathshoham requested a review from mzeitlin11 July 13, 2021 14:13
}
)
assert dtype_dict == dtype_dict_copy, "dtype dict changed"
tm.assert_frame_equal(read, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you please rename read -> result (that's the naming convention we use for all tests comparing result to expected)

@debnathshoham
Copy link
Member Author

The dict is being changed below. Preventing this will require a bigger change.

if (
self.dtype is not None
and is_dict_like(self.dtype)
and self.dtype.get(old_col) is not None
and self.dtype.get(col) is None
):
self.dtype.update({col: self.dtype.get(old_col)})

@debnathshoham debnathshoham requested a review from mzeitlin11 July 13, 2021 18:23
@@ -81,7 +81,10 @@ def __init__(self, f: FilePathOrBuffer | list, **kwds):
self.verbose = kwds["verbose"]
self.converters = kwds["converters"]

self.dtype = kwds["dtype"]
if isinstance(kwds["dtype"], dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please a small comment why this is necessary (even just pointing back to the issue)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzeitlin11 added comment pointing back to issue

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @debnathshoham!

def test_dtype_dict_unchanged_with_duplicate_columns(self, read_ext):
# GH 42462

filename = "test_common_headers" + read_ext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not use existing data? or simply do a round trip; i don't want to add even more files like this

@@ -81,7 +81,11 @@ def __init__(self, f: FilePathOrBuffer | list, **kwds):
self.verbose = kwds["verbose"]
self.converters = kwds["converters"]

self.dtype = kwds["dtype"]
# GH 42462 : dtype dict parameter changes with duplicate columns in input data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just copy.copy is the answer here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or better to actually see where it is being modified and fix it there

@phofl
Copy link
Member

phofl commented Jul 15, 2021

This was introduced in #41411. You can try changing the modified Code there to avoid this and you can use the Tests I have added there

@debnathshoham debnathshoham force-pushed the read-excel-bug-dtype branch from 8786256 to c10b931 Compare July 17, 2021 16:02
@debnathshoham debnathshoham requested a review from jreback July 17, 2021 18:56
cur_count = counts[col]

if cur_count > 0:
while cur_count > 0:
counts[col] = cur_count + 1
col = f"{col}.{cur_count}"
cur_count = counts[col]
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Same is done in the c parser, you should remove that too probably?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the changes in c parser as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failure Looks related

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that particular failure is resolved now. I am not getting the current failures in local.

@debnathshoham debnathshoham requested a review from phofl July 18, 2021 08:59
@debnathshoham
Copy link
Member Author

please review

@debnathshoham
Copy link
Member Author

HI @phofl @jreback @mzeitlin11 - do you mind taking a look, if this is okay now?

@phofl
Copy link
Member

phofl commented Jul 27, 2021

I‘ll have a look shortly

@@ -989,6 +981,10 @@ cdef class TextReader:
col_dtype = self.dtype[name]
elif i in self.dtype:
col_dtype = self.dtype[i]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, can we not leave the original and simply copy self.dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I had done that originally (send a copy of dtype, so that the original is unaffected). But then you suggested to try and not modify the dict at all(#42508 (comment)) 😅 . Also @phofl had mentioned that this was a unwanted side effect of #41411. So, I tried to incorporate those comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please let me know if you want me to revert all the changes, and just pass a copy of dtype dictionary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer keeping the mangle_dup_cols logic together, e.g. simply copying the dict would be the best here. If we spread this over the code base it will be harder to refactor

@debnathshoham debnathshoham requested review from jreback and phofl July 31, 2021 07:31
@debnathshoham
Copy link
Member Author

Hi @jreback @phofl could you pls take a look, if this is fine?

@jreback jreback added this to the 1.4 milestone Aug 4, 2021
@jreback jreback changed the title BUG: Read excel bug dtype read_excel() modifies provided types dict when accessing file with duplicate column Aug 4, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, can you add a whatsnew note, bug fixes I/O section for 1.4. ping on green.

@phofl
Copy link
Member

phofl commented Aug 4, 2021

Lets do this for 1.3.2, was a regression

@phofl phofl modified the milestones: 1.4, 1.3.2 Aug 4, 2021
@pep8speaks
Copy link

pep8speaks commented Aug 4, 2021

Hello @debnathshoham! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-04 14:23:43 UTC

@debnathshoham
Copy link
Member Author

I have added the whatsnew entry in 1.3.2 as @phofl suggested.
@jreback Greenish (unrelated failures).

@jreback jreback merged commit c182565 into pandas-dev:master Aug 4, 2021
@jreback
Copy link
Contributor

jreback commented Aug 4, 2021

thanks @debnathshoham

@jreback
Copy link
Contributor

jreback commented Aug 4, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Aug 4, 2021

Something went wrong ... Please have a look at my logs.

@debnathshoham debnathshoham deleted the read-excel-bug-dtype branch August 5, 2021 04:02
phofl pushed a commit that referenced this pull request Aug 5, 2021
…cessing file with duplicate column (#42893)

Co-authored-by: Shoham Debnath <[email protected]>
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_excel() modifies provided types dict when accessing file with duplicate column
5 participants