Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
In pandas.read_excel it's not possible to access information about the cell object before converting to dataframe. In some cases as in #47269,#46895, and #49770 it's necessary to acces the cell, especially number_format to change the values
Feature Description
Add a new parameter to read_excel, something like value_of_cell, that receive a cell and return a value. In my case i change _convert_cell in _openpyxl to:
def _convert_cell(self, cell, convert_float: bool, value_of_cell:Union[None, Callable]) -> Scalar:
from openpyxl.cell.cell import (
TYPE_ERROR,
TYPE_NUMERIC,
)
if cell.value is None:
return "" # compat with xlrd
elif cell.data_type == TYPE_ERROR:
return np.nan
elif value_of_cell is not None: # this is new
return value_of_cell(cell) # this is new
elif cell.data_type == TYPE_NUMERIC:
# GH5394, GH46988
if convert_float:
val = int(cell.value)
if val == cell.value:
return val
else:
return float(cell.value)
return cell.value
so in my code i call
def conv(cell):
if cell.number_format.startswith("0"):
return "0" * (len(cell.number_format) - len(str(cell.value))) + str(cell.value)
else:
return cell.value
df = pd.read_excel(file, header=None, na_filter=False, dtype=str, value_of_cell:=conv)
of course is neccesary to edit all the calls from read_excel to _convert_cell ( i think is something like 10 functions), but i will create a pull request if can pass the test. I hope this is useful for the community
Alternative Solutions
Another option is to use openpyxl directly as, but loss the extra functions of read_excel and more code is needed:
document = load_workbook(file)
sheet = document.active
for row in sheet:
for cell in row:
cell.value = value_of_cell(cell)
df = pd.DataFrame(sheet.values)
df = df.fillna('')
Additional Context
No response