Skip to content

ENH: read_excel dtypes and converts #8212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
iosonofabio opened this issue Sep 8, 2014 · 8 comments · Fixed by #14786
Closed

ENH: read_excel dtypes and converts #8212

iosonofabio opened this issue Sep 8, 2014 · 8 comments · Fixed by #14786
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO Excel read_excel, to_excel

Comments

@iosonofabio
Copy link

Dear all,

Here from the mailing list: https://groups.google.com/forum/#!topic/pydata/jKiPOvYUQ1c

I have an excel table about family ages like this

Family People Mean size [cm]
Foo 5 173.0
Bar 3 189.0

and I would like to use read_excel to parse it into Python. I would like "People" to be read as an integer, "Mean size [cm]" as a float. (And "Family" as a string, but that might be a different issue.) Now:

  • if I set convert_float=True, the last column reads as int
  • if I set convert_float=False, the second column reads as float

Neither one is correct, for a stupid reason: there happen to be those .0 in all sizes! So I would like to specify something like:

convert_float = ['People']

so only that column gets converted. An even better solution would be to be explicit about types of some columns, letting pandas perform the automagic for the others, such as:

read_excel('foo.xlsx', types={'People': np.uint8, 'Family': 'S3'})

but this changes the signature of the function more significantly.

Are you folks in favour of any of this? If yes, I can get a look and try to code it in.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel labels Sep 8, 2014
@jreback jreback added this to the 0.15.1 milestone Sep 8, 2014
@jreback
Copy link
Contributor

jreback commented Sep 8, 2014

you could add a dtypes= parmeter to coerce similarly to how read_csv. I would leave convert_float alone (its a bit non-standard; maybe even deprecate once dtypes is in place)

@jtratner ?

@jtratner
Copy link
Contributor

Yes I think this is a good idea to handle coercion like that. Related #8272

@iosonofabio
Copy link
Author

Hi,

I finally had some time to look into this. It is evident from our test suites (pandas/io/tests/test_excel.py, function 'test_reader_special_dtypes') that this is actually already implemented!

The way it works is via the 'converters' keyword argument to pandas.read_excel. This is passed down all the way to the parser:

read_excel --> ExcelFile._parse_excel --> TextParser --> TextFileReader._clean_options/_make_engine

So we just need to document this. If that's ok with you folks, I'll just make a minimal commit containing another parameter in the docstring(s) and make a pull request.
Or do you guys prefer to use 'dtypes' as a kwarg instead of 'converters'? numpy.loadtxt and numpy.genfromtxt support 'dtype' and 'converters', but not 'dtypes', so I'd be for keeping 'converters'.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

dtypes is the standard parameter to specify dtype mappings (numpy text loaders are much less robust, much slower and not used nearly as much as read_csv so it is the de-facto standard now adays)

converters is somewhat duplicative but that's a different argument
convert_float is used in several places in pandas as well.

I think the converters are used, because ultimately the PythonParser which is called when this is actually parsed doesn't support dtypes, but does support converters (not implemented).

I think for consistency would like to add dtypes argument and fix the parser to support it.

In the meantime if you are wanting to document converers yes? then that is ok (copy doc-string from read_csv)

@iosonofabio
Copy link
Author

Actually, I tested converters a bit more and it has issues with missing values. For instance, if you write converters={'mycol': int} and a cell is empty, the parser defaults to empty string so when it tries to call int(mycell) it raises

ValueError: invalid literal for int() with base 10: ''

This happens in _convert_to_ndarrays in pandas.io.parsers.py, while calling pandas.lib.map_infer. This is in my opinion a plain bug. The parser should first check for missing values, and then convert only cells that do not contain missing values, but now it's the opposite: it first tries to convert everything to int, fails with an exception, and never gets to see that there was nothing to convert!

So my suggestion: I close this bug by reversing the order of the operations:

check for missing <--> convert dtype

and then I add a docstring to read_excel for converters. The dtypes change can come in a second step.

@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

if a convert raises it's on the user
pandas can't guess what to do
you would have to have a try except in that converter

pls submit a pr and I'll take a look

@iosonofabio
Copy link
Author

Hi,
If the user specifies converters --> int, pandas needs not guess anything. It just needs to convert to int, keeping empty cells as NaNs (or masked). I am fixing it and will submit a pull request ASAP, but just to give you the idea, if your table is:

name age
Alice
Bob 30

with alice's age missing, pandas does this:

  • read_excel('mytable.xlsx')['age'] --> float64
  • read_excel('mytable.xlsx', converters={'age': int})['age'] --> ValueError!
  • read_excel('mytable.xlsx', converters={'age': (lambda x: int(x) if x else 0)} --> correct (except you don't get masked)

You will agree that this behavious is not very friendly... but let me code it and then you can have a look.

@bolshoibooze
Copy link

Thanks for the clear explanation @iosonofabio :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants