Skip to content

BUG: Was trying to read an ods file and ran into UnboundLocalError in odfreader.py #35802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
taksuyu opened this issue Aug 19, 2020 · 8 comments · Fixed by #36175
Closed
2 of 3 tasks

BUG: Was trying to read an ods file and ran into UnboundLocalError in odfreader.py #35802

taksuyu opened this issue Aug 19, 2020 · 8 comments · Fixed by #36175
Labels
Bug IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue
Milestone

Comments

@taksuyu
Copy link

taksuyu commented Aug 19, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

for file in os.listdir('data'): pandas.read_excel(pathlib.Path('data', file), engine='odf')

Sorry I don't have a minimal data example at this time.

Problem description

Was trying to test pandas reading a collection of ods files and ran into this error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/michael/.local/share/virtualenvs/merge-csv-NFbvYFrS/lib/python3.8/site-packages/pandas/util/_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "/home/michael/.local/share/virtualenvs/merge-csv-NFbvYFrS/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 311, in read_excel
    return io.parse(
  File "/home/michael/.local/share/virtualenvs/merge-csv-NFbvYFrS/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 906, in parse
    return self._reader.parse(
  File "/home/michael/.local/share/virtualenvs/merge-csv-NFbvYFrS/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 443, in parse
    data = self.get_sheet_data(sheet, convert_float)
  File "/home/michael/.local/share/virtualenvs/merge-csv-NFbvYFrS/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 91, in get_sheet_data
    value = self._get_cell_value(sheet_cell, convert_float)
  File "/home/michael/.local/share/virtualenvs/merge-csv-NFbvYFrS/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 175, in _get_cell_value
    return self._get_cell_string_value(cell)
  File "/home/michael/.local/share/virtualenvs/merge-csv-NFbvYFrS/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 211, in _get_cell_string_value
    value.append(" " * spaces)
UnboundLocalError: local variable 'spaces' referenced before assignment

I took a look at the code in question and it seems like the line may be on the wrong indent level?

Expected Output

The usual dataframes 👍

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-42-generic
Version : #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@taksuyu taksuyu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2020
@WillAyd
Copy link
Member

WillAyd commented Aug 19, 2020

Your proposed fix seems reasonable - want to push a PR with a test case?

@WillAyd WillAyd added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2020
@taksuyu
Copy link
Author

taksuyu commented Aug 20, 2020

I believe I can do that for you. Assuming I can sleuth out what exactly the file is crashing on would a test case similar to this test work?

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Aug 21, 2020

I believe I can do that for you. Assuming I can sleuth out what exactly the file is crashing on would a test case similar to this test work?

hard to say what the best test would be without a reproducible failing code sample. It maybe that a simple roundtrip test could suffice.

@simonjayhawkins simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Aug 21, 2020
@openwebcc
Copy link

find a minimal .ods file here that has one column with a header and one data-cell with content Test (1) that causes the fail. Put the .ods in the same directory where you run the following python3 snippet:

python3

import pandas as pd
print(pd.show_versions())
df = pd.read_excel("bug_odfreader_l211_spaces.ods", sheet_name="Test123")

I think, the fail is related to the type of the data-cell where fragment.qname is reported as ('urn:oasis:names:tc:opendocument:xmlns:text:1.0', 'span'). As it is not handled by the if-clause, spaces is not set and value.append(" " * spaces) throws the error. Fixing the indent would solve the problem, but leave the span-clause unimplemented.

I could not reproduce creating a cell of type span in LibreOffice.

Sorry for linking to the .ods file but I was not able to upload it here as .ods is not a supported file type

@taksuyu
Copy link
Author

taksuyu commented Aug 23, 2020

I was working with some bad csv data and using libreoffice to deal with it so that could be how a span, or other element cause I think it might be a line-break, got in there. I got around the issue by returning the data to a csv file and continuing as normal. I'll be taking a look into implementing this test in the coming days now that my work is done.

Judging by the spec there is quite a few elements that aren't being checked for. I'm not all that sure what sort of guarantees pandas makes for reading data so could I get some feedback on whether we should handle all these cases or just set something of a default for spaces so that it doesn't throw?

@thisnamenotavailable
Copy link

Hi. New to coding, but not new to problem solving.

Also have this problem as of this week when I updated my system from Kubuntu 18.04 LTS to 20.04 LTS. With that, the python version changed from 3.6 to 3.8. The pandas version pre- and post- update was version 1.1.x

So, I will tell you my experience and what I have done to trouble-shoot in the hopes that this can help a fix, since the pandas read_excel / odf engine gives me exactly the same errors (line numbers and messages) as originally posted here as well as a different error if the read_excel code is within a try/except block.

Adding to my problem is that I do the same thing in 2 different python3 scripts with the same read_excel code block and I get different results, so there is little consistency in error reproduction other than Pre-update, everything worked fine. Post-update, everything fell apart and nothing works now.

Project is to graph financials and daily trading data for a number of companies. financial data for each company is hand-inserted into spreadsheets saved as .ods files using libreoffice 6. One program reads and graphs the various financial information I want to compare, and another graphs the daily trading close price and volume. Financial information is also used to create a graph of the various financials on each trading day to see how they change over time as the trading price varies.

This is the code being used to read in the various sheets which are then combined into one large pandas dataframe which is used for all the graphs. I have only included the essential lines,

import numpy as np
import pandas as pd
import pandas_path
from pathlib import Path

files_to_read = pd.DataFrame(str(s) for s in Path().glob('**/')) # read in directories which contain the security spreadsheets and graphs

financial_history_filename = security_symbol + ' Financial History.ods'    # reported financial data from company
#(Note: security_symbol is taken from the directory name which is a combination of the name and symbol)


start_file = 7     #    change to zero to run full list. Otherwise, change to any number less than number of files to read for testing
end_file = len(files_to_read)   #   run all files past the start number

for file_number in range (start_file,end_file):  # file numbers are the index of the dataframe

    financial_history_file = Path(security_files_directory[file_number],financial_history_filename[file_number])
    
##  check to see if the required files exist.
    
    if financial_history_file.is_file():    # Load the financial history file for the security if it exists
        
        # load introduction sheet
        security_introduction = pd.read_excel(financial_history_file, 'Introduction', index_col=0, engine='odf')
        
        ##  Load Distribution sheets according to new (2-sheet) or old (1-sheet) format       
        try:    # new format. Separate sheets for declared distributions and the different currencies they are paid in.

            distribution_declared_sheet = pd.read_excel(financial_history_file, 'Distributions Declared', header=1, engine='odf')
            distribution_paid_sheet = pd.read_excel(financial_history_file, 'Distribution Payment Options', header=1, engine='odf')

        except: # old format where only one distribution sheet exists (called Distributions Paid)

            distribution_paid_sheet = pd.read_excel(financial_history_file, 'Distributions Paid', header=1, engine='odf')
            distribution_declared_sheet = distribution_paid_sheet
        
        try:
            financial_summary_sheet = pd.read_excel(financial_history_file, 'Financial Summary', header=1, engine='odf')   # new format

        except:
            financial_summary_sheet = pd.read_excel(financial_history_file, 'Base Financials', header=1, engine='odf') # old format
        
        
    else:                                   # If not, then skip this security (it is essential that Financial history file exists)
        print(security_name[file_number], 'Financial History File does not exist. Can not chart this security until file is in security directory.')
        continue

Of the 100+ securities being read and graphed, some are read in correctly, most are not.

Error # 1:

Traceback (most recent call last):
File "Combined Trading History Graphs.py", line 337, in
security_introduction = pd.read_excel(financial_history_file, 'Introduction', index_col=0, engine='odf')
File "/home/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 311, in read_excel
return io.parse(
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 906, in parse
return self._reader.parse(
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 443, in parse
data = self.get_sheet_data(sheet, convert_float)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 91, in get_sheet_data
value = self._get_cell_value(sheet_cell, convert_float)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 175, in _get_cell_value
return self._get_cell_string_value(cell)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 211, in _get_cell_string_value
value.append(" " * spaces)
UnboundLocalError: local variable 'spaces' referenced before assignment

I see 2 trends in the spreadsheets that cause problems.

  1. In the 'Introduction' sheet
    If the cell contains bold text where the cell is not bold. eg: This text is plain, this text is Bold, the error occurs.
    Remove the bold text or make the entire cell bold, and the UnboundLocalError does not exist.

  2. In the 'Distribution' Sheet:
    If the cell contains superscript, the error occurs.
    LibreOffice defaults to automatically changing the letters in '1st','2nd', etc to a small superscript.
    Remove the superscript and the UnboundLocalError does not exist.

It seems that mixed formatting within the cells causes this error.

However, there are some spreadsheets without data (just templates containing column headers and formulas) that cause this error, but there are also some 'empty' spreadsheets that are read in properly.

So, no real consistency that I can see.

The other problem I encounter is if the spreadsheet being read in is within the Try / Except block where the file may have a spreadsheet with one of 2 names. In this case I get the following errors:

Error # 2:

Traceback (most recent call last):
File "Combined Trading History Graphs.py", line 351, in
financial_summary_sheet = pd.read_excel(financial_history_file, 'Financial Summary', header=1, engine='odf') # new format
File "/home/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 311, in read_excel
return io.parse(
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 906, in parse
return self._reader.parse(
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 439, in parse
sheet = self.get_sheet_by_name(asheetname)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 66, in get_sheet_by_name
raise ValueError(f"sheet {name} not found")
ValueError: sheet Financial Summary not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "Combined Trading History Graphs.py", line 353, in
financial_summary_sheet = pd.read_excel(financial_history_file, 'Base Financials', header=1, engine='odf') # old format
File "/home/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 311, in read_excel
return io.parse(
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 906, in parse
return self._reader.parse(
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 443, in parse
data = self.get_sheet_data(sheet, convert_float)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 91, in get_sheet_data
value = self._get_cell_value(sheet_cell, convert_float)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 175, in _get_cell_value
return self._get_cell_string_value(cell)
File "/home/.local/lib/python3.8/site-packages/pandas/io/excel/_odfreader.py", line 211, in _get_cell_string_value
value.append(" " * spaces)
UnboundLocalError: local variable 'spaces' referenced before assignment

I do not see any similar trend in this spreadsheet. No mixed fonts, etc.
Note that I also get this 2 part error for the Distribution Sheets where the try/except is used

the same code worked fine in Python 3.6 with the same / latest version of pandas (1.1.2). Or maybe it isn't since I get different pandas versions when I try to list/upgrade the module.

I am using these versions of the modules being imported, according to 'pip3 freeze':

numpy==1.17.4
pandas==1.1.1
pandas-path==0.1.2

If I try to install and upgrade pandas using pip, I get the following results:
$ sudo pip3 install pandas --upgrade
Requirement already up-to-date: pandas in /usr/local/lib/python3.8/dist-packages (1.1.2)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/lib/python3/dist-packages (from pandas) (2019.3)
Requirement already satisfied, skipping upgrade: numpy>=1.15.4 in /usr/lib/python3/dist-packages (from pandas) (1.17.4)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /usr/lib/python3/dist-packages (from pandas) (2.7.3)

As with the original post, I can't attach an offending spreadsheet for inspection since GitHub does not support the .ods file type.

@asishm
Copy link
Contributor

asishm commented Oct 7, 2020

@thisnamenotavailable this is fixed in pandas v1.1.3
Can you check after upgrading?

if it still happens - I believe you can upload a zip that has the .ods file in it.

@thisnamenotavailable
Copy link

thisnamenotavailable commented Oct 7, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants