-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: read_excel: try converting numeric to int #5394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -80,7 +80,10 @@ def read_excel(io, sheetname, **kwds): | |||
engine: string, default None | |||
If io is not a buffer or path, this must be set to identify io. | |||
Acceptable values are None or xlrd | |||
coerce_number : boolean, default True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use convert_numeric
elsewhere (and coerce_floats
too...), prob not the best descriptions....ok with this...maybe should change convert_objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to anything. try_coerce_to_int
? Coerce isn't great because it suggests forcing it, is downcast/upcast appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert_numeric
is the standard (now), either that or what you have is fine (and we change the latter at some point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is different though and needs to be different than what this means elsewhere.
Difference is: convert_numeric
converts str, float and int to number
whereas coerce_number
takes numeric Excel cells and tries to convert float to int. (so it's only applicable to Excel and possibly to JSON too).
So pd.DataFrame({"A": ["1", "2", "3"]}).to_excel()
would read back in as ["1", "2", "3"]
if convert_numeric is False
but pd.DataFrame({"A": [1, 2, 3]}).to_excel()
would read back in as [1.0, 2.0, 3.0]
if coerce_number is False, (but be unaffected by convert_numeric. Also affects indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about convert_numbers
then...I think convert
is better than coerce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe convert_float
would be clearest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
I think you mean |
Yes, that was from the original issue. |
you can do this at the very end by (should be faster)
|
Three issues: 1 - a column like ['ID', '(All)', 100, 312, 15, 22] will be read in as Series(['(All)', 100.0, 312.0, 15.0, 22.0], name='ID']) and then you end up with all floats in the column when you remove the '(All)' while cleaning. This wouldn't be a problem if it weren't that Index(1.0, 2.0, 3.0) is very different from Index(1, 2, 3).
DataFrame({"A": [1.0, 2.0], "B": ["1.0", "2.0"]}, index=[1.0, 2.0])._data.downcast(dtype='infer')
../pandas/core/internals.pyc in apply(self, f, *args, **kwargs)
2145 applied = f(blk, *args, **kwargs)
2146 else:
-> 2147 applied = getattr(blk, f)(*args, **kwargs)
2148
2149 if isinstance(applied, list):
TypeError: downcast() got an unexpected keyword argument 'dtype' |
sorry, meant this just calls it is pretty conservative about what it tries to do, you shouldn't see false positives.
|
Okay, well that suggests that performance difference should be pretty big, given that downcast will be vectorized vs. applied to every element, but I'll need to test it out. The key would be able to downcast Index as well. (even if it's just "Hey! Are you a Float64Index? Could you be an Int64Index? Pretty please?!? Great!") |
just pushing a fix so that this now works (apparently you can't round an object array)
so you can do this (should probl build a downcast method into Index), but for now
|
@jtratner ok...good to go |
It turns out the performance hit is really minimal (about 10% at most) and in exchange you get integers in object dtypes as integers as well: Here's using downcast on 4 column DataFrame and converting Float64Index to Int64Index:
Whereas the non-vectorized int conversion is:
I just think the benefits of converting to int far outweigh the downsides, given the small perf hit. [ran this multiple times and came out similar] Downcasting down via: if convert_float:
if isinstance(result.index, pd.Float64Index):
casted = com._possibly_downcast_to_dtype(result.index.view(np.ndarray).astype(float),
'infer')
if not issubclass(casted.dtype.type, (np.floating, np.complexfloating)):
result.index = pd.Index(casted)
# TODO: Downcast index too!
result._data = result._data.downcast(dtypes='infer') |
The reason that's so awkward is that possibly_downcast_to_dtype refuses to downcast object dtype, which means that you have to view as ndarray and then try to get to integer. Does it make sense to have the Index type inference try to go to Int64Index if the inferred_type is integer? That would resolve this and other issues (though in 0.12, |
All Excel numeric data is stored as floats, so have to convert specifically. Changing default because it's suprising to save something with what looks like a row/column of integers and get a column of floats instead. (especially because it can lead to annoying Float64Indexes)
You have something funky in your perf tests. and you don't need to view and all that, just use
And works just fine with
|
I'm not really surprised by it, because you still have to iterate over every value regardless to do the other type conversions... |
but I'll tweak again and see what I get. |
well...that's the thing....why are you doing that? you should just convert them all at once; and you have a big advantage over read_csv, you already know what they are supposed to be, you don't need to try them in sequence) (this of course may need to be pushed to 0.14 - the changing of the conversion method), just a suggestion but this PR good for 0.13 |
@jreback because Excel stores everything as float - there's no integer type in Excel. So if you have a column like this: ("Str2Col", ["a", 3, "c", "d", "e"]), it'll come back as: ("Str2Col", ["a", 3.0, "c", "d", "e"]), The part where this gets surprising is if you have an Index column like: Switched to the better way and still don't get a perf boost: 100 entries
100 loops, best of 3: 12 ms per loop
1000 entries
10 loops, best of 3: 86 ms per loop
10000 entries
1 loops, best of 3: 811 ms per loop
100000 entries
1 loops, best of 3: 8.85 s per loop It's a little faster before this PR on very large input [i.e. no downcast/index conversion and no int conversion in for loop]: 100 entries
100 loops, best of 3: 10.8 ms per loop
1000 entries
10 loops, best of 3: 84.5 ms per loop
10000 entries
1 loops, best of 3: 809 ms per loop
100000 entries
1 loops, best of 3: 8.55 s per loop But it's not even a 10% difference. |
And just to be clear, you can't tell what the dtype is for those cells, because there are only 8 cell types:
|
We're already converting date, bool, and error, so we're already incurring the loop cost... |
I realized that all the xlrd types are just integers, so I spent an hour trying to vectorize this with Cython. I came up with a vectorized format, but didn't give much of a perf boost because (as you might expect) xlrd is really the limiting factor. Might be worthwhile to bring up again - you might have a better idea of how to work it, especially because I'm betting I've missed some of the Cython optimizations when you know dtype and ndim of your array. (it's my vectorize-excel branch). |
ENH: read_excel: try converting numeric to int
I'm merging this because I think it falls under POLS, but Excel I/O can definitely still be improved speed-wise. |
@rhshadrach, just a comment: this might be the same as your updated
|
@denis-bz - sorry, I don't understand. What is the significance if it is the same or not? Not sure if this is helpful, but openpyxl is already doing conversions to int, so pandas does not need to handle it. |
@rhshadrach, my mistake, sorry, delete it. |
Thanks, I think I understand now. I believe that is the code behavior matches the documentation. If you find or think that might not be the case, please do open an issue |
Adds a convert_float=True kwarg to read_excel and parse.
All Excel numeric data is stored as floats, so have to convert it specifically.
Changing default because it's suprising to save something with what looks like
a row/column of integers and get a column of floats instead. (especially
because it can lead to annoying Float64Indexes)
Resolves recent ML issue too...