Skip to content

Bug in read_csv and read_excel not applying dtype to second col with dup cols #41411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 12, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented May 10, 2021

@phofl phofl added IO CSV read_csv, to_csv IO Excel read_excel, to_excel labels May 10, 2021
@jreback jreback added this to the 1.3 milestone May 12, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @phofl

counts[name] = count + 1
name = f'{name}.{count}'
count = counts.get(name, 0)
if count > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to unify this code between here and the python parser (followon)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely, but will have to refactor the PythonParser quite a bit and split into 2 classes to be able to inherit from TextReader respectively a generic cython class where TextReader and something like PythonTextReader can inherit from.

I am planning to do this in the (probably medium-term) future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds great!

feels free to open an issue for tracking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though about using #39345 for this

@jreback jreback merged commit 76792f1 into pandas-dev:master May 12, 2021
@phofl phofl deleted the 35211 branch May 12, 2021 13:59
@fiendish
Copy link
Contributor

FYI this breaks parsing with a non-dict dtype.

@phofl
Copy link
Member Author

phofl commented Jun 15, 2021

Could you provide an example?

@fiendish
Copy link
Contributor

Documentation for read_csv says dtype may be "a type name or dict".
pandas.read_csv("test.csv", engine="python", dtype="str") and pandas.read_csv("test.csv", engine="python", dtype=str) now error because this code's new calls to .get assume a dict.

@phofl
Copy link
Member Author

phofl commented Jun 15, 2021

Your test file has duplicate columns too? Will look into this later

@fiendish
Copy link
Contributor

fiendish commented Jun 15, 2021

Your test file has duplicate columns too? Will look into this later

Indeed it does, good eye. That's the unfortunate reality when non-programmer people make documents.

Here's my test file:
test.csv

But of course

A,B,B
1,1,1

shows it too.

@fiendish
Copy link
Contributor

I should have filed an issue first. I've done that now: #42022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Excel read_excel, to_excel
Projects
None yet
3 participants