Skip to content

ENH: read_excel: try converting numeric to int #5394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 1, 2013

Conversation

jtratner
Copy link
Contributor

Adds a convert_float=True kwarg to read_excel and parse.

All Excel numeric data is stored as floats, so have to convert it specifically.
Changing default because it's suprising to save something with what looks like
a row/column of integers and get a column of floats instead. (especially
because it can lead to annoying Float64Indexes)

Resolves recent ML issue too...

@@ -80,7 +80,10 @@ def read_excel(io, sheetname, **kwds):
engine: string, default None
If io is not a buffer or path, this must be set to identify io.
Acceptable values are None or xlrd
coerce_number : boolean, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use convert_numeric elsewhere (and coerce_floats too...), prob not the best descriptions....ok with this...maybe should change convert_objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to anything. try_coerce_to_int? Coerce isn't great because it suggests forcing it, is downcast/upcast appropriate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_numeric is the standard (now), either that or what you have is fine (and we change the latter at some point).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is different though and needs to be different than what this means elsewhere.

Difference is: convert_numeric converts str, float and int to number
whereas coerce_number takes numeric Excel cells and tries to convert float to int. (so it's only applicable to Excel and possibly to JSON too).

So pd.DataFrame({"A": ["1", "2", "3"]}).to_excel() would read back in as ["1", "2", "3"] if convert_numeric is False
but pd.DataFrame({"A": [1, 2, 3]}).to_excel() would read back in as [1.0, 2.0, 3.0] if coerce_number is False, (but be unaffected by convert_numeric. Also affects indexes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about convert_numbers then...I think convert is better than coerce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe convert_float would be clearest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@cancan101
Copy link
Contributor

I think you mean convert_float not coerce_number

@jtratner
Copy link
Contributor Author

Yes, that was from the original issue.

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

you can do this at the very end by (should be faster)

df,_data  = df._data.downcast(dtype='infer')

@jtratner
Copy link
Contributor Author

Three issues:

1 - a column like ['ID', '(All)', 100, 312, 15, 22] will be read in as Series(['(All)', 100.0, 312.0, 15.0, 22.0], name='ID']) and then you end up with all floats in the column when you remove the '(All)' while cleaning.
2. downcast doesn't try to convert the index (should be a kwarg for that, maybe default False)

This wouldn't be a problem if it weren't that Index(1.0, 2.0, 3.0) is very different from Index(1, 2, 3).

  1. downcast appears not to always accept the kwarg 'dtype=infer', as this example generates an error:
DataFrame({"A": [1.0, 2.0], "B": ["1.0", "2.0"]}, index=[1.0, 2.0])._data.downcast(dtype='infer')

../pandas/core/internals.pyc in apply(self, f, *args, **kwargs)
   2145                 applied = f(blk, *args, **kwargs)
   2146             else:
-> 2147                 applied = getattr(blk, f)(*args, **kwargs)
   2148
   2149             if isinstance(applied, list):

TypeError: downcast() got an unexpected keyword argument 'dtype'

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

sorry, meant dtypes='infer'

this just calls com._possibly_downcast_to_dtype(arr,dtype) (dtype can be 'infer' here)., so you could add that in Index I suppose.

it is pretty conservative about what it tries to do, you shouldn't see false positives.

In [3]: df = DataFrame({"A": [1.0, 2.0], "B": ["1.0", "2.0"]}, index=[1.0, 2.0])

In [4]: df
Out[4]: 
   A    B
1  1  1.0
2  2  2.0

In [5]: df.dtypes
Out[5]: 
A    float64
B     object
dtype: object

In [6]: df_new = DataFrame(df._data.downcast(dtypes='infer'))

In [7]: df_new
Out[7]: 
   A    B
1  1  1.0
2  2  2.0

In [8]: df_new.dtypes
Out[8]: 
A     int64
B    object
dtype: object

@jtratner
Copy link
Contributor Author

Okay, well that suggests that performance difference should be pretty big, given that downcast will be vectorized vs. applied to every element, but I'll need to test it out. The key would be able to downcast Index as well. (even if it's just "Hey! Are you a Float64Index? Could you be an Int64Index? Pretty please?!? Great!")

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

just pushing a fix so that this now works (apparently you can't round an object array)

In [1]: arr = np.array([1.0,2.0],dtype=object)

In [3]: pandas.core.common._possibly_downcast_to_dtype(arr,'infer')
Out[3]: array([1, 2])

so you can do this (should probl build a downcast method into Index), but for now

In [4]: Index([1.,2.])
Out[4]: Float64Index([1.0, 2.0], dtype='object')

In [6]: Index(pandas.core.common._possibly_downcast_to_dtype(Index([1.,2.]).values,'infer'))
Out[6]: Int64Index([1, 2], dtype='int64')

@jreback
Copy link
Contributor

jreback commented Oct 31, 2013

@jtratner ok...good to go

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

It turns out the performance hit is really minimal (about 10% at most) and in exchange you get integers in object dtypes as integers as well:

Here's using downcast on 4 column DataFrame and converting Float64Index to Int64Index:

100 entries
100 loops, best of 3: 11.8 ms per loop
1000 entries
10 loops, best of 3: 82 ms per loop
10000 entries
1 loops, best of 3: 791 ms per loop
100000 entries
1 loops, best of 3: 8.4 s per loop

Whereas the non-vectorized int conversion is:

100 entries
100 loops, best of 3: 11.2 ms per loop
1000 entries
10 loops, best of 3: 93.3 ms per loop
10000 entries
1 loops, best of 3: 839 ms per loop
100000 entries
1 loops, best of 3: 8.44 s per loop

I just think the benefits of converting to int far outweigh the downsides, given the small perf hit. [ran this multiple times and came out similar]

Downcasting down via:

        if convert_float:
            if isinstance(result.index, pd.Float64Index):
                casted = com._possibly_downcast_to_dtype(result.index.view(np.ndarray).astype(float),
                                                'infer')
                if not issubclass(casted.dtype.type, (np.floating, np.complexfloating)):
                    result.index = pd.Index(casted)
            # TODO: Downcast index too!
            result._data = result._data.downcast(dtypes='infer')

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

The reason that's so awkward is that possibly_downcast_to_dtype refuses to downcast object dtype, which means that you have to view as ndarray and then try to get to integer. Does it make sense to have the Index type inference try to go to Int64Index if the inferred_type is integer? That would resolve this and other issues (though in 0.12, Index([1.0, 2.0]) resulted in Index([1.0, 2.0]) rather than Int64Index or anything. I can set that up.

All Excel numeric data is stored as floats, so have to convert
specifically. Changing default because it's suprising to save something
with what looks like a row/column of integers and get a column of floats
instead. (especially because it can lead to annoying Float64Indexes)
@jreback
Copy link
Contributor

jreback commented Nov 1, 2013

You have something funky in your perf tests. and you don't need to view and all that, just use .values.

In [4]: i = pd.Float64Index(np.arange(100000))

In [5]: from pandas.core import common as com

In [6]: com._possibly_downcast_to_dtype(i.values,'infer')
Out[6]: array([    0,     1,     2, ..., 99997, 99998, 99999])

In [7]: com._possibly_downcast_to_dtype(i.values,'infer').dtype
Out[7]: dtype('int64')

In [8]: %timeit Index(com._possibly_downcast_to_dtype(i.values,'infer'))
100 loops, best of 3: 14.1 ms per loop

And works just fine with object type.

In [13]: i = pd.Index(np.arange(100000),dtype=object)

In [14]: i.dtype
Out[14]: dtype('O')

In [15]: type(i)
Out[15]: pandas.core.index.Index

In [16]: Index(com._possibly_downcast_to_dtype(i.values,'infer')).dtype
Out[16]: dtype('int64')

In [17]: type(Index(com._possibly_downcast_to_dtype(i.values,'infer')))
Out[17]: pandas.core.index.Int64Index

In [18]: %timeit Index(com._possibly_downcast_to_dtype(i.values,'infer'))
100 loops, best of 3: 9.05 ms per loop

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

I'm not really surprised by it, because you still have to iterate over every value regardless to do the other type conversions...

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

but I'll tweak again and see what I get.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2013

well...that's the thing....why are you doing that? you should just convert them all at once; and you have a big advantage over read_csv, you already know what they are supposed to be, you don't need to try them in sequence)

(this of course may need to be pushed to 0.14 - the changing of the conversion method), just a suggestion

but this PR good for 0.13

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

@jreback because Excel stores everything as float - there's no integer type in Excel. So if you have a column like this:

            ("Str2Col", ["a", 3, "c", "d", "e"]),

it'll come back as:

            ("Str2Col", ["a", 3.0, "c", "d", "e"]),

The part where this gets surprising is if you have an Index column like: Series(["cruft", 1, 2, 3], name="Index") and you strip off the first row after reading in and then set index, you end up with Float64Index, because Excel doesn't have an Integer type.

Switched to the better way and still don't get a perf boost:

100 entries
100 loops, best of 3: 12 ms per loop
1000 entries
10 loops, best of 3: 86 ms per loop
10000 entries
1 loops, best of 3: 811 ms per loop
100000 entries
1 loops, best of 3: 8.85 s per loop

It's a little faster before this PR on very large input [i.e. no downcast/index conversion and no int conversion in for loop]:

100 entries
100 loops, best of 3: 10.8 ms per loop
1000 entries
10 loops, best of 3: 84.5 ms per loop
10000 entries
1 loops, best of 3: 809 ms per loop
100000 entries
1 loops, best of 3: 8.55 s per loop

But it's not even a 10% difference.

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

And just to be clear, you can't tell what the dtype is for those cells, because there are only 8 cell types:

  • XL_CELL_EMPTY [u'']
  • XL_CELL_TEXT [unicode string]
  • XL_CELL_NUMBER [float]
  • XL_CELL_DATE [float]
  • XL_CELL_BOOLEAN [1 or 0 int]
  • XL_CELL_ERROR [int error code]
  • XL_CELL_BLANK [u'']

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

We're already converting date, bool, and error, so we're already incurring the loop cost...

@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

I realized that all the xlrd types are just integers, so I spent an hour trying to vectorize this with Cython. I came up with a vectorized format, but didn't give much of a perf boost because (as you might expect) xlrd is really the limiting factor. Might be worthwhile to bring up again - you might have a better idea of how to work it, especially because I'm betting I've missed some of the Cython optimizations when you know dtype and ndim of your array. (it's my vectorize-excel branch).

jtratner added a commit that referenced this pull request Nov 1, 2013
ENH: read_excel: try converting numeric to int
@jtratner jtratner merged commit fd2d7c2 into pandas-dev:master Nov 1, 2013
@jtratner jtratner deleted the fix-int-handling-excel branch November 1, 2013 02:52
@jtratner
Copy link
Contributor Author

jtratner commented Nov 1, 2013

I'm merging this because I think it falls under POLS, but Excel I/O can definitely still be improved speed-wise.

@denis-bz
Copy link

denis-bz commented Jun 6, 2021

@rhshadrach, just a comment: this might be the same as your updated _openpyxl.py _convert_cell,
seems to me consistent with the 1.2.4 doc

convert_float bool, default True
Convert integral floats to int (i.e., 1.0 –> 1).
If False, all numeric data will be read in as floats

    def _convert_cell(self, cell, convert_float: bool) -> Scalar:
        ...
        elif cell.data_type == TYPE_NUMERIC:  # -> float or int
            f = float(cell.value)
            if convert_float and f == int(f):
                return int(f)
            else:
                return f

        return cell.value

@rhshadrach
Copy link
Member

@denis-bz - sorry, I don't understand. What is the significance if it is the same or not? Not sure if this is helpful, but openpyxl is already doing conversions to int, so pandas does not need to handle it.

@denis-bz
Copy link

denis-bz commented Jun 7, 2021

@rhshadrach, my mistake, sorry, delete it.
(By "same" I meant "is code == doc" -- wan't clear to me.) Thanks

@rhshadrach
Copy link
Member

Thanks, I think I understand now. I believe that is the code behavior matches the documentation. If you find or think that might not be the case, please do open an issue

@burnpanck burnpanck mentioned this pull request Aug 17, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants