read_excel() modifies provided types dict when accessing file with duplicate column #42508

debnathshoham · 2021-07-12T16:17:21Z

closes read_excel() modifies provided types dict when accessing file with duplicate column #42462
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

mzeitlin11

Thanks for the pr @debnathshoham. Left some comments, but in general this fix looks extremely heavyweight since we are now copying all parameters. I also think the new structure is harder to follow with locals being deleted and everything bundled in kwargs.

IMO a cleaner solution would be to find where the dtypes dict is being modified, then either

Change the relevant code so no modification inplace is necessary
If the above is not possible, then copy only the dtypes dict in the case where we need to modify it

mzeitlin11 · 2021-07-12T22:08:13Z

pandas/io/excel/_base.py

+    kwargs = locals().copy()
+    for each in kwargs:
+        if isinstance(locals()[each], dict):
+            kwargs[each] = locals()[each].copy()


This is essentially trying to do a deepcopy?

Hi @mzeitlin11 . Yes it is. I tried to use deepcopy first, but it was failing multiple tests.

I got the idea of creating a copy of kwargs from read_csv (which as you mentioned doesn't produce this unnecessary side effect). So I thought it would be a good idea to keep it consistent.

pandas/pandas/io/parsers/readers.py

Lines 566 to 586 in d60e687

# locals() should never be modified

kwds = locals().copy()

del kwds["filepath_or_buffer"]

del kwds["sep"]

kwds_defaults = _refine_defaults_read(

dialect,

delimiter,

delim_whitespace,

engine,

sep,

error_bad_lines,

warn_bad_lines,

on_bad_lines,

names,

prefix,

defaults={"delimiter": ","},

)

kwds.update(kwds_defaults)

return _read(filepath_or_buffer, kwds)

I got the idea of creating a copy of kwargs from read_csv (which as you mentioned doesn't produce this unnecessary side effect). So I thought it would be a good idea to keep it consistent.

Makes sense. That case is a bit different though because kwargs itself was being modified. Here the issue is limited to the dtypes dict. If there's a clean way to copy (or not modify in place) the dtypes dict, I think that would be cleaner.

Sure. I will work along those lines. Thanks!

Let me know if you run into any issues!

mzeitlin11 · 2021-07-12T22:09:52Z

pandas/tests/io/excel/test_readers.py

+        dtype_dict = {"a": str, "b": str, "c": str}
+        dtype_dict_copy = dtype_dict.copy()
+        pd.read_excel(filename, dtype=dtype_dict)
+        assert dtype_dict == dtype_dict_copy, "dtype dict changed"


Can we also check that the resulting frame is as expected? (I know this is focusing on the dtypes dict, but may as well also test the reading portion here since unlikely we have great coverage for dtypes dict with duplicate cols)

mzeitlin11 · 2021-07-12T22:10:45Z

pandas/tests/io/excel/test_readers.py

@@ -1278,6 +1278,13 @@ def test_ignore_chartsheets_by_int(self, request, read_ext):
        ):
            pd.read_excel("chartsheet" + read_ext, sheet_name=1)

+    def test_dtype_dict(self, read_ext):


Can you please leave a link to the relevant github issue here? And maybe also make the name of this test more specific, eg in this care we about about dtype argument not being modified in the case of duplicate columns being present

mzeitlin11 · 2021-07-13T14:17:20Z

pandas/tests/io/excel/test_readers.py

+            }
+        )
+        assert dtype_dict == dtype_dict_copy, "dtype dict changed"
+        tm.assert_frame_equal(read, expected)


Nit: can you please rename read -> result (that's the naming convention we use for all tests comparing result to expected)

debnathshoham · 2021-07-13T18:23:36Z

The dict is being changed below. Preventing this will require a bigger change.

pandas/pandas/io/parsers/python_parser.py

Lines 428 to 434 in 3c10f1f

    
           if ( 
        
               self.dtype is not None 
        
               and is_dict_like(self.dtype) 
        
               and self.dtype.get(old_col) is not None 
        
               and self.dtype.get(col) is None 
        
           ): 
        
               self.dtype.update({col: self.dtype.get(old_col)})

mzeitlin11 · 2021-07-13T19:37:17Z

pandas/io/parsers/python_parser.py

@@ -81,7 +81,10 @@ def __init__(self, f: FilePathOrBuffer | list, **kwds):
        self.verbose = kwds["verbose"]
        self.converters = kwds["converters"]

-        self.dtype = kwds["dtype"]
+        if isinstance(kwds["dtype"], dict):


Can you please a small comment why this is necessary (even just pointing back to the issue)

@mzeitlin11 added comment pointing back to issue

mzeitlin11

LGTM, thanks @debnathshoham!

jreback · 2021-07-14T23:48:22Z

pandas/tests/io/excel/test_readers.py

+    def test_dtype_dict_unchanged_with_duplicate_columns(self, read_ext):
+        # GH 42462
+
+        filename = "test_common_headers" + read_ext


can you not use existing data? or simply do a round trip; i don't want to add even more files like this

jreback · 2021-07-14T23:48:39Z

pandas/io/parsers/python_parser.py

@@ -81,7 +81,11 @@ def __init__(self, f: FilePathOrBuffer | list, **kwds):
        self.verbose = kwds["verbose"]
        self.converters = kwds["converters"]

-        self.dtype = kwds["dtype"]
+        # GH 42462 : dtype dict parameter changes with duplicate columns in input data


just copy.copy is the answer here

or better to actually see where it is being modified and fix it there

phofl · 2021-07-15T01:15:06Z

This was introduced in #41411. You can try changing the modified Code there to avoid this and you can use the Tests I have added there

phofl · 2021-07-18T07:49:39Z

pandas/io/parsers/python_parser.py

                        cur_count = counts[col]

                        if cur_count > 0:
                            while cur_count > 0:
                                counts[col] = cur_count + 1
                                col = f"{col}.{cur_count}"
                                cur_count = counts[col]
-                            if (


The Same is done in the c parser, you should remove that too probably?

made the changes in c parser as well

Test failure Looks related

I think that particular failure is resolved now. I am not getting the current failures in local.

debnathshoham · 2021-07-24T16:58:40Z

please review

debnathshoham · 2021-07-27T16:11:47Z

HI @phofl @jreback @mzeitlin11 - do you mind taking a look, if this is okay now?

phofl · 2021-07-27T19:05:35Z

I‘ll have a look shortly

jreback · 2021-07-28T01:05:29Z

pandas/_libs/parsers.pyx

@@ -989,6 +981,10 @@ cdef class TextReader:
                        col_dtype = self.dtype[name]
                    elif i in self.dtype:
                        col_dtype = self.dtype[i]
+                    else:


umm, can we not leave the original and simply copy self.dtype?

Actually I had done that originally (send a copy of dtype, so that the original is unaffected). But then you suggested to try and not modify the dict at all(#42508 (comment)) 😅 . Also @phofl had mentioned that this was a unwanted side effect of #41411. So, I tried to incorporate those comments.

please let me know if you want me to revert all the changes, and just pass a copy of dtype dictionary

I would prefer keeping the mangle_dup_cols logic together, e.g. simply copying the dict would be the best here. If we spread this over the code base it will be harder to refactor

This reverts commit cf27280.

This reverts commit b63aef2.

debnathshoham · 2021-08-04T04:36:31Z

Hi @jreback @phofl could you pls take a look, if this is fine?

jreback

thanks, can you add a whatsnew note, bug fixes I/O section for 1.4. ping on green.

phofl · 2021-08-04T12:43:56Z

Lets do this for 1.3.2, was a regression

pep8speaks · 2021-08-04T13:47:34Z

Hello @debnathshoham! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-08-04 14:23:43 UTC

debnathshoham · 2021-08-04T15:27:42Z

I have added the whatsnew entry in 1.3.2 as @phofl suggested.
@jreback Greenish (unrelated failures).

jreback · 2021-08-04T21:55:19Z

thanks @debnathshoham

jreback · 2021-08-04T21:55:28Z

@meeseeksdev backport 1.3.x

…ct when accessing file with duplicate column

lumberbot-app · 2021-08-04T21:55:47Z

Something went wrong ... Please have a look at my logs.

…cessing file with duplicate column (#42893) Co-authored-by: Shoham Debnath <[email protected]>

…plicate column (pandas-dev#42508)

debnathshoham added 3 commits July 12, 2021 21:37

BUG: read excel changes dtype param

cf40fb3

BUG: read excel changes dtype param

a45535c

BUG: added xlsb

6c98fa3

debnathshoham changed the title ~~Read excel bug dtype~~ BUG: Read excel bug dtype Jul 12, 2021

mzeitlin11 reviewed Jul 12, 2021

View reviewed changes

mzeitlin11 added Bug IO Excel read_excel, to_excel labels Jul 12, 2021

amended test

e79e9a1

debnathshoham requested a review from mzeitlin11 July 13, 2021 14:13

mzeitlin11 reviewed Jul 13, 2021

View reviewed changes

modified as suggested

c10b931

debnathshoham requested a review from mzeitlin11 July 13, 2021 18:23

mzeitlin11 reviewed Jul 13, 2021

View reviewed changes

mzeitlin11 approved these changes Jul 13, 2021

View reviewed changes

jreback requested changes Jul 14, 2021

View reviewed changes

debnathshoham force-pushed the read-excel-bug-dtype branch from 8786256 to c10b931 Compare July 17, 2021 16:02

debnathshoham added 2 commits July 17, 2021 21:38

suggested edits

4419146

checked for str

b63aef2

debnathshoham requested a review from jreback July 17, 2021 18:56

phofl reviewed Jul 18, 2021

View reviewed changes

removed updation of dtype dict from c parser

f0f3022

debnathshoham requested a review from phofl July 18, 2021 08:59

included dtype conversion of mangled cols in c parser

cf27280

Merge branch 'master' into read-excel-bug-dtype

52c9bd5

jreback reviewed Jul 28, 2021

View reviewed changes

debnathshoham added 4 commits July 30, 2021 21:24

Revert "included dtype conversion of mangled cols in c parser"

cb369bb

This reverts commit cf27280.

Revert "checked for str"

7bcb504

This reverts commit b63aef2.

changed refernce to issue in test

ffb5852

reverted changes

72d50f4

debnathshoham requested review from jreback and phofl July 31, 2021 07:31

jreback added this to the 1.4 milestone Aug 4, 2021

jreback changed the title ~~BUG: Read excel bug dtype~~ read_excel() modifies provided types dict when accessing file with duplicate column Aug 4, 2021

jreback approved these changes Aug 4, 2021

View reviewed changes

jreback requested changes Aug 4, 2021

View reviewed changes

phofl modified the milestones: 1.4, 1.3.2 Aug 4, 2021

phofl approved these changes Aug 4, 2021

View reviewed changes

debnathshoham added 2 commits August 4, 2021 18:50

added whatsnew 1.3.2

57c65e5

Merge branch 'master' into read-excel-bug-dtype

3df5cf3

debnathshoham force-pushed the read-excel-bug-dtype branch from 13b4d87 to 3df5cf3 Compare August 4, 2021 14:23

jreback approved these changes Aug 4, 2021

View reviewed changes

jreback merged commit c182565 into pandas-dev:master Aug 4, 2021

meeseeksmachine mentioned this pull request Aug 4, 2021

Backport PR #42508 on branch 1.3.x (read_excel() modifies provided types dict when accessing file with duplicate column) #42893

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 4, 2021

Backport PR pandas-dev#42508: read_excel() modifies provided types di…

b907417

…ct when accessing file with duplicate column

debnathshoham deleted the read-excel-bug-dtype branch August 5, 2021 04:02

phofl pushed a commit that referenced this pull request Aug 5, 2021

Backport PR #42508: read_excel() modifies provided types dict when ac…

271f0d0

…cessing file with duplicate column (#42893) Co-authored-by: Shoham Debnath <[email protected]>

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

read_excel() modifies provided types dict when accessing file with du…

0bece99

…plicate column (pandas-dev#42508)

	# locals() should never be modified
	kwds = locals().copy()
	del kwds["filepath_or_buffer"]
	del kwds["sep"]

	kwds_defaults = _refine_defaults_read(
	dialect,
	delimiter,
	delim_whitespace,
	engine,
	sep,
	error_bad_lines,
	warn_bad_lines,
	on_bad_lines,
	names,
	prefix,
	defaults={"delimiter": ","},
	)
	kwds.update(kwds_defaults)

	return _read(filepath_or_buffer, kwds)

read_excel() modifies provided types dict when accessing file with duplicate column #42508

read_excel() modifies provided types dict when accessing file with duplicate column #42508

Conversation

debnathshoham commented Jul 12, 2021

mzeitlin11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

debnathshoham commented Jul 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzeitlin11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Jul 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

debnathshoham commented Jul 24, 2021

debnathshoham commented Jul 27, 2021

phofl commented Jul 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

debnathshoham commented Aug 4, 2021

jreback left a comment

Choose a reason for hiding this comment

phofl commented Aug 4, 2021

pep8speaks commented Aug 4, 2021 • edited Loading

Comment last updated at 2021-08-04 14:23:43 UTC

debnathshoham commented Aug 4, 2021

jreback commented Aug 4, 2021

jreback commented Aug 4, 2021

lumberbot-app bot commented Aug 4, 2021

pep8speaks commented Aug 4, 2021 •

edited

Loading