Skip to content

BUG: inconsistent and undocumented option "converters" to read_excel #8548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 15, 2014
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1992,6 +1992,30 @@ indices to be parsed.

read_excel('path_to_file.xls', 'Sheet1', parse_cols=[0, 2, 3])

.. note::

It is possible to transform the contents of Excel cells via the `converters`
option. It accepts a dictionary of functions: the keys are the names or
indices of columns to be transformed, the values are functions that take one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, I think just expand the doc-string (just a bit, maybe 1 line) for TextReader and Excel (to make them consistent). This is also true more generally for csv-type reading, so don't want it specifically here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, could you please reformulate? I can't understand.

  1. What do you want to delete from the docs? I thought you wanted to have the ..note, or shall I delete the whole thing again?
  2. Which line do you want to add to those docstrings? Could you please write here the line you want?
  3. TextReader? or TextParser?
  4. What is true more generally?
    Thanks.

edit: I tried to implement what I understood of your suggestion, please check and let me know.

input argument, the Excel cell content, and return the transformed content.
For instance, to convert a column to boolean:

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})

This options handles missing values and treats exceptions in the converters
as missing data. Transformations are applied cell by cell rather than to the
column as a whole, so the array dtype is not guaranteed. For instance, a
column of integers with missing values cannot be transformed to an array
with integer dtype, because NaN is strictly a float. You can manually mask
missing data to recover integer dtype:

.. code-block:: python

cfun = lambda x: int(x) if x else -1
read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})

To write a DataFrame object to a sheet of an Excel file, you can use the
``to_excel`` instance method. The arguments are largely the same as ``to_csv``
described above, the first argument being the name of the excel file, and the
Expand Down
9 changes: 8 additions & 1 deletion pandas/io/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ def read_excel(io, sheetname=0, **kwds):
Rows to skip at the beginning (0-indexed)
skip_footer : int, default 0
Rows at the end to skip (0-indexed)
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels
index_col : int, default None
Column to use as the row labels of the DataFrame. Pass None if
there is no such column
Expand Down Expand Up @@ -175,7 +178,7 @@ def __init__(self, io, **kwds):
def parse(self, sheetname=0, header=0, skiprows=None, skip_footer=0,
index_col=None, parse_cols=None, parse_dates=False,
date_parser=None, na_values=None, thousands=None, chunksize=None,
convert_float=True, has_index_names=False, **kwds):
convert_float=True, has_index_names=False, converters=None, **kwds):
"""Read an Excel table into DataFrame

Parameters
Expand All @@ -188,6 +191,9 @@ def parse(self, sheetname=0, header=0, skiprows=None, skip_footer=0,
Rows to skip at the beginning (0-indexed)
skip_footer : int, default 0
Rows at the end to skip (0-indexed)
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels
index_col : int, default None
Column to use as the row labels of the DataFrame. Pass None if
there is no such column
Expand Down Expand Up @@ -235,6 +241,7 @@ def parse(self, sheetname=0, header=0, skiprows=None, skip_footer=0,
thousands=thousands, chunksize=chunksize,
skip_footer=skip_footer,
convert_float=convert_float,
converters=converters,
**kwds)

def _should_parse(self, i, parse_cols):
Expand Down
9 changes: 7 additions & 2 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ class ParserWarning(Warning):
Return TextFileReader object for iteration
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine='c')
converters : dict. optional
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
verbose : boolean, default False
Expand Down Expand Up @@ -983,8 +983,13 @@ def _convert_to_ndarrays(self, dct, na_values, na_fvalues, verbose=False,
na_fvalues)
coerce_type = True
if conv_f is not None:
values = lib.map_infer(values, conv_f)
try:
values = lib.map_infer(values, conv_f)
except ValueError:
mask = lib.ismember(values, na_values).view(np.uin8)
values = lib.map_infer_mask(values, conv_f, mask)
coerce_type = False

cvals, na_count = self._convert_types(
values, set(col_na_values) | col_na_fvalues, coerce_type)
result[c] = cvals
Expand Down
Binary file added pandas/io/tests/data/test_converters.xls
Binary file not shown.
Binary file added pandas/io/tests/data/test_converters.xlsx
Binary file not shown.
25 changes: 25 additions & 0 deletions pandas/io/tests/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,31 @@ def test_reader_special_dtypes(self):
convert_float=False)
tm.assert_frame_equal(actual, no_convert_float)

# GH8212 - support for converters and missing values
def test_reader_converters(self):
_skip_if_no_xlrd()

expected = DataFrame.from_items([
("IntCol", [1, 2, -3, -1000, 0]),
("FloatCol", [12.5, np.nan, 18.3, 19.2, 0.000000005]),
("BoolCol", ['Found', 'Found', 'Found', 'Not found', 'Found']),
("StrCol", ['1', np.nan, '3', '4', '5']),
])

converters = {'IntCol': lambda x: int(x) if x != '' else -1000,
'FloatCol': lambda x: 10 * x if x else np.nan,
2: lambda x: 'Found' if x != '' else 'Not found',
3: lambda x: str(x) if x else '',
}

xlsx_path = os.path.join(self.dirpath, 'test_converters.xlsx')
xls_path = os.path.join(self.dirpath, 'test_converters.xls')

# should read in correctly and set types of single cells (not array dtypes)
for path in (xls_path, xlsx_path):
actual = read_excel(path, 'Sheet1', converters=converters)
tm.assert_frame_equal(actual, expected)

def test_reader_seconds(self):
# Test reading times with and without milliseconds. GH5945.
_skip_if_no_xlrd()
Expand Down