Skip to content

ENH: read_excel: Add a callable to access the Cell before getting the value  #49872

Closed
@Polandia94

Description

@Polandia94

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

In pandas.read_excel it's not possible to access information about the cell object before converting to dataframe. In some cases as in #47269,#46895, and #49770 it's necessary to acces the cell, especially number_format to change the values

Feature Description

Add a new parameter to read_excel, something like value_of_cell, that receive a cell and return a value. In my case i change _convert_cell in _openpyxl to:

def _convert_cell(self, cell, convert_float: bool, value_of_cell:Union[None, Callable]) -> Scalar:
        from openpyxl.cell.cell import (
            TYPE_ERROR,
            TYPE_NUMERIC,
        )
        if cell.value is None:
            return ""  # compat with xlrd
        elif cell.data_type == TYPE_ERROR:
            return np.nan
        elif value_of_cell is not None: # this is new
            return value_of_cell(cell) # this is new
        elif cell.data_type == TYPE_NUMERIC:
            # GH5394, GH46988
            if convert_float:
                val = int(cell.value)
                if val == cell.value:
                    return val
            else:
                return float(cell.value)
        return cell.value

so in my code i call

def conv(cell):
            if cell.number_format.startswith("0"):
                return "0" * (len(cell.number_format) - len(str(cell.value))) + str(cell.value)
            else:
                return cell.value
 df = pd.read_excel(file, header=None, na_filter=False, dtype=str, value_of_cell:=conv)

of course is neccesary to edit all the calls from read_excel to _convert_cell ( i think is something like 10 functions), but i will create a pull request if can pass the test. I hope this is useful for the community

Alternative Solutions

Another option is to use openpyxl directly as, but loss the extra functions of read_excel and more code is needed:

        document = load_workbook(file)
        sheet = document.active
        for row in sheet:
            for cell in row:
              cell.value = value_of_cell(cell)
        df = pd.DataFrame(sheet.values)
        df = df.fillna('')

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions