ENH: read_excel: try converting numeric to int #5394

jtratner · 2013-10-31T00:56:03Z

Adds a convert_float=True kwarg to read_excel and parse.

All Excel numeric data is stored as floats, so have to convert it specifically.
Changing default because it's suprising to save something with what looks like
a row/column of integers and get a column of floats instead. (especially
because it can lead to annoying Float64Indexes)

Resolves recent ML issue too...

jreback · 2013-10-31T01:09:45Z

pandas/io/excel.py

@@ -80,7 +80,10 @@ def read_excel(io, sheetname, **kwds):
    engine: string, default None
        If io is not a buffer or path, this must be set to identify io.
        Acceptable values are None or xlrd
-        
+    coerce_number : boolean, default True


we use convert_numeric elsewhere (and coerce_floats too...), prob not the best descriptions....ok with this...maybe should change convert_objects

I'm open to anything. try_coerce_to_int? Coerce isn't great because it suggests forcing it, is downcast/upcast appropriate?

convert_numeric is the standard (now), either that or what you have is fine (and we change the latter at some point).

this is different though and needs to be different than what this means elsewhere.

Difference is: convert_numeric converts str, float and int to number
whereas coerce_number takes numeric Excel cells and tries to convert float to int. (so it's only applicable to Excel and possibly to JSON too).

So pd.DataFrame({"A": ["1", "2", "3"]}).to_excel() would read back in as ["1", "2", "3"] if convert_numeric is False
but pd.DataFrame({"A": [1, 2, 3]}).to_excel() would read back in as [1.0, 2.0, 3.0] if coerce_number is False, (but be unaffected by convert_numeric. Also affects indexes.

how about convert_numbers then...I think convert is better than coerce

Maybe convert_float would be clearest?

cancan101 · 2013-10-31T02:57:46Z

I think you mean convert_float not coerce_number

jtratner · 2013-10-31T02:58:38Z

Yes, that was from the original issue.

jreback · 2013-10-31T10:22:40Z

you can do this at the very end by (should be faster)

df,_data  = df._data.downcast(dtype='infer')

jtratner · 2013-10-31T11:42:21Z

Three issues:

1 - a column like ['ID', '(All)', 100, 312, 15, 22] will be read in as Series(['(All)', 100.0, 312.0, 15.0, 22.0], name='ID']) and then you end up with all floats in the column when you remove the '(All)' while cleaning.
2. downcast doesn't try to convert the index (should be a kwarg for that, maybe default False)

This wouldn't be a problem if it weren't that Index(1.0, 2.0, 3.0) is very different from Index(1, 2, 3).

downcast appears not to always accept the kwarg 'dtype=infer', as this example generates an error:

DataFrame({"A": [1.0, 2.0], "B": ["1.0", "2.0"]}, index=[1.0, 2.0])._data.downcast(dtype='infer')

../pandas/core/internals.pyc in apply(self, f, *args, **kwargs)
   2145                 applied = f(blk, *args, **kwargs)
   2146             else:
-> 2147                 applied = getattr(blk, f)(*args, **kwargs)
   2148
   2149             if isinstance(applied, list):

TypeError: downcast() got an unexpected keyword argument 'dtype'

jreback · 2013-10-31T11:48:00Z

sorry, meant dtypes='infer'

this just calls com._possibly_downcast_to_dtype(arr,dtype) (dtype can be 'infer' here)., so you could add that in Index I suppose.

it is pretty conservative about what it tries to do, you shouldn't see false positives.

In [3]: df = DataFrame({"A": [1.0, 2.0], "B": ["1.0", "2.0"]}, index=[1.0, 2.0])

In [4]: df
Out[4]: 
   A    B
1  1  1.0
2  2  2.0

In [5]: df.dtypes
Out[5]: 
A    float64
B     object
dtype: object

In [6]: df_new = DataFrame(df._data.downcast(dtypes='infer'))

In [7]: df_new
Out[7]: 
   A    B
1  1  1.0
2  2  2.0

In [8]: df_new.dtypes
Out[8]: 
A     int64
B    object
dtype: object

jtratner · 2013-10-31T11:54:41Z

Okay, well that suggests that performance difference should be pretty big, given that downcast will be vectorized vs. applied to every element, but I'll need to test it out. The key would be able to downcast Index as well. (even if it's just "Hey! Are you a Float64Index? Could you be an Int64Index? Pretty please?!? Great!")

jreback · 2013-10-31T12:03:20Z

just pushing a fix so that this now works (apparently you can't round an object array)

In [1]: arr = np.array([1.0,2.0],dtype=object)

In [3]: pandas.core.common._possibly_downcast_to_dtype(arr,'infer')
Out[3]: array([1, 2])

so you can do this (should probl build a downcast method into Index), but for now

In [4]: Index([1.,2.])
Out[4]: Float64Index([1.0, 2.0], dtype='object')

In [6]: Index(pandas.core.common._possibly_downcast_to_dtype(Index([1.,2.]).values,'infer'))
Out[6]: Int64Index([1, 2], dtype='int64')

jreback · 2013-10-31T12:30:25Z

@jtratner ok...good to go

jtratner · 2013-11-01T00:03:34Z

It turns out the performance hit is really minimal (about 10% at most) and in exchange you get integers in object dtypes as integers as well:

Here's using downcast on 4 column DataFrame and converting Float64Index to Int64Index:

100 entries
100 loops, best of 3: 11.8 ms per loop
1000 entries
10 loops, best of 3: 82 ms per loop
10000 entries
1 loops, best of 3: 791 ms per loop
100000 entries
1 loops, best of 3: 8.4 s per loop

Whereas the non-vectorized int conversion is:

100 entries
100 loops, best of 3: 11.2 ms per loop
1000 entries
10 loops, best of 3: 93.3 ms per loop
10000 entries
1 loops, best of 3: 839 ms per loop
100000 entries
1 loops, best of 3: 8.44 s per loop

I just think the benefits of converting to int far outweigh the downsides, given the small perf hit. [ran this multiple times and came out similar]

Downcasting down via:

        if convert_float:
            if isinstance(result.index, pd.Float64Index):
                casted = com._possibly_downcast_to_dtype(result.index.view(np.ndarray).astype(float),
                                                'infer')
                if not issubclass(casted.dtype.type, (np.floating, np.complexfloating)):
                    result.index = pd.Index(casted)
            # TODO: Downcast index too!
            result._data = result._data.downcast(dtypes='infer')

jtratner · 2013-11-01T00:05:23Z

The reason that's so awkward is that possibly_downcast_to_dtype refuses to downcast object dtype, which means that you have to view as ndarray and then try to get to integer. Does it make sense to have the Index type inference try to go to Int64Index if the inferred_type is integer? That would resolve this and other issues (though in 0.12, Index([1.0, 2.0]) resulted in Index([1.0, 2.0]) rather than Int64Index or anything. I can set that up.

All Excel numeric data is stored as floats, so have to convert specifically. Changing default because it's suprising to save something with what looks like a row/column of integers and get a column of floats instead. (especially because it can lead to annoying Float64Indexes)

jreback · 2013-11-01T00:34:25Z

You have something funky in your perf tests. and you don't need to view and all that, just use .values.

In [4]: i = pd.Float64Index(np.arange(100000))

In [5]: from pandas.core import common as com

In [6]: com._possibly_downcast_to_dtype(i.values,'infer')
Out[6]: array([    0,     1,     2, ..., 99997, 99998, 99999])

In [7]: com._possibly_downcast_to_dtype(i.values,'infer').dtype
Out[7]: dtype('int64')

In [8]: %timeit Index(com._possibly_downcast_to_dtype(i.values,'infer'))
100 loops, best of 3: 14.1 ms per loop

And works just fine with object type.

In [13]: i = pd.Index(np.arange(100000),dtype=object)

In [14]: i.dtype
Out[14]: dtype('O')

In [15]: type(i)
Out[15]: pandas.core.index.Index

In [16]: Index(com._possibly_downcast_to_dtype(i.values,'infer')).dtype
Out[16]: dtype('int64')

In [17]: type(Index(com._possibly_downcast_to_dtype(i.values,'infer')))
Out[17]: pandas.core.index.Int64Index

In [18]: %timeit Index(com._possibly_downcast_to_dtype(i.values,'infer'))
100 loops, best of 3: 9.05 ms per loop

jtratner · 2013-11-01T00:42:08Z

I'm not really surprised by it, because you still have to iterate over every value regardless to do the other type conversions...

jtratner · 2013-11-01T00:42:21Z

but I'll tweak again and see what I get.

jreback · 2013-11-01T00:44:42Z

well...that's the thing....why are you doing that? you should just convert them all at once; and you have a big advantage over read_csv, you already know what they are supposed to be, you don't need to try them in sequence)

(this of course may need to be pushed to 0.14 - the changing of the conversion method), just a suggestion

but this PR good for 0.13

jtratner · 2013-11-01T00:50:30Z

@jreback because Excel stores everything as float - there's no integer type in Excel. So if you have a column like this:

            ("Str2Col", ["a", 3, "c", "d", "e"]),

it'll come back as:

            ("Str2Col", ["a", 3.0, "c", "d", "e"]),

The part where this gets surprising is if you have an Index column like: Series(["cruft", 1, 2, 3], name="Index") and you strip off the first row after reading in and then set index, you end up with Float64Index, because Excel doesn't have an Integer type.

Switched to the better way and still don't get a perf boost:

100 entries
100 loops, best of 3: 12 ms per loop
1000 entries
10 loops, best of 3: 86 ms per loop
10000 entries
1 loops, best of 3: 811 ms per loop
100000 entries
1 loops, best of 3: 8.85 s per loop

It's a little faster before this PR on very large input [i.e. no downcast/index conversion and no int conversion in for loop]:

100 entries
100 loops, best of 3: 10.8 ms per loop
1000 entries
10 loops, best of 3: 84.5 ms per loop
10000 entries
1 loops, best of 3: 809 ms per loop
100000 entries
1 loops, best of 3: 8.55 s per loop

But it's not even a 10% difference.

jtratner · 2013-11-01T00:54:12Z

And just to be clear, you can't tell what the dtype is for those cells, because there are only 8 cell types:

XL_CELL_EMPTY [u'']
XL_CELL_TEXT [unicode string]
XL_CELL_NUMBER [float]
XL_CELL_DATE [float]
XL_CELL_BOOLEAN [1 or 0 int]
XL_CELL_ERROR [int error code]
XL_CELL_BLANK [u'']

jtratner · 2013-11-01T00:55:19Z

We're already converting date, bool, and error, so we're already incurring the loop cost...

jtratner · 2013-11-01T02:51:56Z

I realized that all the xlrd types are just integers, so I spent an hour trying to vectorize this with Cython. I came up with a vectorized format, but didn't give much of a perf boost because (as you might expect) xlrd is really the limiting factor. Might be worthwhile to bring up again - you might have a better idea of how to work it, especially because I'm betting I've missed some of the Cython optimizations when you know dtype and ndim of your array. (it's my vectorize-excel branch).

ENH: read_excel: try converting numeric to int

jtratner · 2013-11-01T02:52:48Z

I'm merging this because I think it falls under POLS, but Excel I/O can definitely still be improved speed-wise.

denis-bz · 2021-06-06T14:50:19Z

@rhshadrach, just a comment: this might be the same as your updated _openpyxl.py _convert_cell,
seems to me consistent with the 1.2.4 doc

convert_float bool, default True
Convert integral floats to int (i.e., 1.0 –> 1).
If False, all numeric data will be read in as floats

    def _convert_cell(self, cell, convert_float: bool) -> Scalar:
        ...
        elif cell.data_type == TYPE_NUMERIC:  # -> float or int
            f = float(cell.value)
            if convert_float and f == int(f):
                return int(f)
            else:
                return f

        return cell.value

rhshadrach · 2021-06-07T04:02:38Z

@denis-bz - sorry, I don't understand. What is the significance if it is the same or not? Not sure if this is helpful, but openpyxl is already doing conversions to int, so pandas does not need to handle it.

denis-bz · 2021-06-07T15:45:04Z

@rhshadrach, my mistake, sorry, delete it.
(By "same" I meant "is code == doc" -- wan't clear to me.) Thanks

rhshadrach · 2021-06-09T03:27:30Z

Thanks, I think I understand now. I believe that is the code behavior matches the documentation. If you find or think that might not be the case, please do open an issue

jreback reviewed Oct 31, 2013
View reviewed changes

jreback mentioned this pull request Oct 31, 2013

BUG/TST: fix downcasting of float-like object array #5397

Merged

TST: Test for float --> int in the middle of object dtype

1ea301e

jtratner added a commit that referenced this pull request Nov 1, 2013

Merge pull request #5394 from jtratner/fix-int-handling-excel

fd2d7c2

ENH: read_excel: try converting numeric to int

jtratner merged commit fd2d7c2 into pandas-dev:master Nov 1, 2013

jtratner deleted the fix-int-handling-excel branch November 1, 2013 02:52

rhshadrach mentioned this pull request Feb 13, 2021

CLN: remove redundant openpyxl type conversions #39782

Merged

4 tasks

burnpanck mentioned this pull request Aug 17, 2023

fix #54564 #54567

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: read_excel: try converting numeric to int #5394

ENH: read_excel: try converting numeric to int #5394

jtratner commented Oct 31, 2013

jreback Oct 31, 2013

jtratner Oct 31, 2013

jreback Oct 31, 2013

jtratner Oct 31, 2013

jreback Oct 31, 2013

jtratner Oct 31, 2013

jreback Oct 31, 2013

cancan101 commented Oct 31, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jreback commented Oct 31, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jreback commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jreback commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

denis-bz commented Jun 6, 2021

rhshadrach commented Jun 7, 2021

denis-bz commented Jun 7, 2021

rhshadrach commented Jun 9, 2021

ENH: read_excel: try converting numeric to int #5394

ENH: read_excel: try converting numeric to int #5394

Conversation

jtratner commented Oct 31, 2013

jreback Oct 31, 2013

Choose a reason for hiding this comment

jtratner Oct 31, 2013

Choose a reason for hiding this comment

jreback Oct 31, 2013

Choose a reason for hiding this comment

jtratner Oct 31, 2013

Choose a reason for hiding this comment

jreback Oct 31, 2013

Choose a reason for hiding this comment

jtratner Oct 31, 2013

Choose a reason for hiding this comment

jreback Oct 31, 2013

Choose a reason for hiding this comment

cancan101 commented Oct 31, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jtratner commented Oct 31, 2013

jreback commented Oct 31, 2013

jreback commented Oct 31, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jreback commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jreback commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

jtratner commented Nov 1, 2013

denis-bz commented Jun 6, 2021

rhshadrach commented Jun 7, 2021

denis-bz commented Jun 7, 2021

rhshadrach commented Jun 9, 2021