Skip to content

API: integrate with openpyxl 2.0-2.1 changes #8342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jtratner opened this issue Sep 21, 2014 · 31 comments
Closed

API: integrate with openpyxl 2.0-2.1 changes #8342

jtratner opened this issue Sep 21, 2014 · 31 comments
Labels
Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel
Milestone

Comments

@jtratner
Copy link
Contributor

Builds are failing on Travis due to some issue with Excel (weirdly passed on merge but then started failing a few commits later). Currently investigating. C.f. https://travis-ci.org/pydata/pandas/jobs/35873580

@neirbowj - any idea why these tests stopped passing?

It's all some variation on the following:

======================================================================
ERROR: test_tsframe (pandas.io.tests.test_excel.OpenpyxlTests_NoMerge)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.4.1/lib/python3.4/site-packages/pandas/io/tests/test_excel.py", line 562, in test_tsframe
    df.to_excel(path, 'test1')
  File "/home/travis/virtualenv/python3.4.1/lib/python3.4/site-packages/pandas/util/decorators.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/travis/virtualenv/python3.4.1/lib/python3.4/site-packages/pandas/core/frame.py", line 1228, in to_excel
    startrow=startrow, startcol=startcol)
  File "/home/travis/virtualenv/python3.4.1/lib/python3.4/site-packages/pandas/io/excel.py", line 687, in write_cells
    'number_format':{'format_code': self.datetime_format}}))
  File "/home/travis/virtualenv/python3.4.1/lib/python3.4/site-packages/pandas/io/excel.py", line 757, in _convert_to_style_kwargs
    new_v = _conv_to_x(v)
  File "/home/travis/virtualenv/python3.4.1/lib/python3.4/site-packages/pandas/io/excel.py", line 1043, in _convert_to_number_format
    from openpyxl.styles import NumberFormat
ImportError: cannot import name 'NumberFormat'
----------------------------------------------------------------------
@jreback
Copy link
Contributor

jreback commented Sep 21, 2014

Ha, we just had this discussion!: #8340

apparently openpyxl released 2.0.5 (today 9/21). and made backward incompatible changes (no suprise their). The 3.4 build tests with the latest so picked it up.

So whatever fix is needed, prob should change say the 3.3 build to use the last good version as well (2.04?), though 2.6 build uses 2.03 (to make sure that WE maintain compat).

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel labels Sep 21, 2014
@jreback jreback added this to the 0.15.0 milestone Sep 21, 2014
@jtratner
Copy link
Contributor Author

it looks like a really minor change - maybe for now just switch the import
to openpyxl.styles.number.NumberFormat?

On Sun, Sep 21, 2014 at 12:43 PM, jreback [email protected] wrote:

Ha, we just had this discussion!: #8340
#8340

apparently openpyxl released 2.0.5 (today 9/21). and made backward
incompatible changes (no suprise their). The 3.4 build tests with the
latest so picked it up.

So whatever fix is needed, prob should change say the 3.3 build to use the
last good version as well (2.04?), though 2.6 build uses 2.03 (to make sure
that WE maintain compat).


Reply to this email directly or view it on GitHub
#8342 (comment).

@jreback
Copy link
Contributor

jreback commented Sep 21, 2014

@jtratner prob, just need to do it in a try: except: because have to keep the existing one around. Its really annoying when people change the API's around like that

@jreback
Copy link
Contributor

jreback commented Sep 21, 2014

hmm, might be something else as 2.0.5 seems to have been working for last few days (though its possible that what was called 2.0.5 changed), which is a no-no, but its possible

@jorisvandenbossche
Copy link
Member

It is version 2.1.0 that is released, not 2.0.5 I think: http://openpyxl.readthedocs.org/en/latest/changes.html

@jtratner
Copy link
Contributor Author

I just pushed a potential fix. If it works, let's just stick with the minor
changes rather than contorting ourselves to make things work with openpyxl
v2. good thing you said to mark it as experimental :)

On Sun, Sep 21, 2014 at 1:12 PM, Joris Van den Bossche <
[email protected]> wrote:

It is version 2.1.0 that is released, not 2.0.5 I think.


Reply to this email directly or view it on GitHub
#8342 (comment).

@jreback
Copy link
Contributor

jreback commented Sep 21, 2014

@jorisvandenbossche right

@neirbowj
Copy link
Contributor

Grr. The OpenPyxl devs are making me regret getting roped into this.

It's not clear that using the deprecated NumberFormat class is a viable approach. It's been deprecated pretty emphatically:

>>> from openpyxl.styles import Style
>>> from openpyxl.styles.numbers import NumberFormat
>>> nf = NumberFormat(format_code='0.00')
openpyxl.styles.numbers:1: UserWarning: Call to deprecated function or class NumberFormat (Number formats are strings. Use module functions).
>>> s = Style(number_format=nf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "openpyxl/styles/__init__.py", line 45, in __init__
    self.number_format = number_format
  File "openpyxl/styles/hashable.py", line 54, in __setattr__
    return object.__setattr__(self, *args, **kwargs)
  File "openpyxl/styles/numbers.py", line 194, in __set__
    super(NumberFormatDescriptor, self).__set__(instance, value)
  File "openpyxl/descriptors/__init__.py", line 32, in __set__
    raise TypeError('expected ' + str(self.expected_type))
TypeError: expected <type 'basestring'>

It looks like the main alternatives for us are to A) build in separate compatibility for the two flavors of number formats, or B) declare 2.x.y prior to 2.1.0 unsupported, and treat "number_format is a string" as the one true interface. Either way, it seems like this kind of breakage is going to be a fact of life with OpenPyxl going forward.

Thoughts?

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

can u ping their mailing list / issues
and see what the heck is up with all if these API changes

if they come back with - this is the API and it's stable - then if u can fix up
we can just make 2.1 min for 2.x series

otherwise maybe just revert and wait for stability

@neirbowj
Copy link
Contributor

Yeah, I'll see what I can find out.

@neirbowj
Copy link
Contributor

I've opened the hailing frequencies.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

perfect - of course this problem of crazy API changes happened in v2 so not surprised

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@neirbowj what do you think is best to do in short-run

  • revert your original PR
  • skip tests for openpyxl2
  • comment something out

?

want to do a quick for one of these?

@neirbowj
Copy link
Contributor

I'll be in transit most of the day without access to the Internet (because I'll be damned if I ever knowingly buy GoGo Inflight Internet again). If this doesn't answer the immediate need, the next least worst option is probably to skip tests for openpyxl2, then next to revert. I'll be back online this evening.

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 22, 2014
@jreback jreback changed the title Failed builds due to Excel issue API: integrate with openpyxl 2.0-2.1 changes Sep 22, 2014
@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@neirbowj ok I merged in your changes (needed a couple of more for compat). Will leave this issue open for you to come back at some point and validate / fix for the continued openpyxl API changes.

thanks!

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@stancikcom
Copy link

Guys, I believe IO modules are essential for Pandas usability (besides Numpy ;-) ).
I have fallen recently into dependency hell as another module (PETL) requires a newer version of Openpyxl. Please, consider to implement an adapter pattern that will support (reasonable) legacy versions.

I think Pandas is a great package and for a production usage it is worth to invest more into reliability (that also means modules).

@khatriindu
Copy link

Hi, I am trying to create an Excel File(showing the differences of two Excel files). I am getting this error:TypeError: expected <class 'openpyxl.styles.fonts.Font'>.
I have already imported : >>> import openpyxl.styles

from openpyxl.styles import fonts

Any solutions please?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2015

@stancikcom and @Anon3
pandas has been compat with all released versions of openpyxl for as far back as I can remember
there were a couple of hiccups iirc 0.15.0
0.15.2 has been out for quite some time and 0.16.0 just released
if u are having difficulties thrn pls pd.show_versions() in your report

@khatriindu
Copy link

@jreback The version is pandas: 0.15.2. Would you suggest me to upgrade it to 0.16.0 ?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2015

@Anon3 well pls report pd.show_versions() as well as as showing what is causing the error

@stancikcom
Copy link

Hi all,

after upgrade, pandas works fine with new version of openpyxl.
my import pandas; pandas.show_versions() dump:

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0
nose: 1.3.4
Cython: 0.21.2
numpy: 1.9.2
scipy: 0.15.0
statsmodels: 0.6.1
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.2.1
dateutil: 2.4.1
pytz: 2015.2
bottleneck: 1.0.0
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.4.2
openpyxl: 2.2.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 0.9.9
pymysql: 0.6.6.None
psycopg2: 2.6 (dt dec pq3 ext lo64)

@khatriindu
Copy link

@jreback this is what I used:

import pandas as pd
pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.2.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.0
nose: None
Cython: None
numpy: 1.9.2
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: 2.2.0
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

@khatriindu
Copy link

@jreback and @stancikcom even with 0.16.0 i get this error: TypeError: expected <class 'openpyxl.styles.fonts.Font'>

@stancikcom
Copy link

@Anon3 try to share a snippet of your code which makes you a trouble. We can try to replicate / debug it. Cheers. (ms)

@khatriindu
Copy link

@stancikcom : here is the code that I found (and I am using for official purpose) to compare two excels:

import pandas as pd
import numpy as np
import xlrd

Define the diff function to show the changes in each field

def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

We want to be able to easily tell which rows have changes

def has_change(row):
if "--->" in row.to_string():
return "Y"
else:
return "N"

Read in both excel files

df1 = pd.read_excel('Excel1.xlsx', 'Sheet2', na_values=['NA'])
df2 = pd.read_excel('Excel2.xlsx', 'Sheet2', na_values=['NA'])

Make sure we order by account number so the comparisons work

df1.sort(columns="Col1")
df1=df1.reindex()
df2.sort(columns="Col1")
df2=df2.reindex()

Create a panel of the two dataframes

diff_panel = pd.Panel(dict(df1=df1,df2=df2))

#Apply the diff function
diff_output = diff_panel.apply(report_diff, axis=0)

Flag all the changes

diff_output['has_change'] = diff_output.apply(has_change, axis=1)

#Save the changes to excel but only include the columns we care about
diff_output[(diff_output.has_change == 'Y')].to_excel('my-diff-1.xlsx',index=False,columns=["Col1",
"Col2","Col3","Col4","Col5"])

@stancikcom
Copy link

@Anon3 I have tried your code on my data sets (of course I do have different data than you) and I can confirme that the pure IO from/to excel worked without any problem in my python environment.

However, I think the issues are in your code logic. Currently it won't work, if the data sets are different size, it won't work if the account order is altered e.g. due to added/deleted account in one of the set. It is a total overkill in this case using a 3d pandas.Panel() ;-)

If I were you I would refactor the logic as follows:
-assuming account is a unique primary key for each row, then even without Pandas you can write your function using built_in hasattr() function. Accounts could be a hash key and its value the account attributes (a tuple where each member is taken from one cell from excel). So then

data_set = { Account : (Col1,Col2,Col3,...Coln), ... } 

Consider to address this to stackoverflow which is more focused to optimize algorythms.

@khatriindu
Copy link

@stancikcom thanks for your valuable comments. I am a newbie .. never used hasattr() function.
Let me try learning and using hasattr()

@stancikcom
Copy link

@Anon3 with #python and #pandas you are in the right track... :-)
bdw. if you want to still solve your problem with pandas, merge df1 and df2 on the account column, and then write short function to compare 'left' and 'right' side of the table.

pd.merge(df1,df2,on='account',how='left')

@khatriindu
Copy link

@stancikcom ,
Yes. I wrote something something and I get : ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This is the additional piece of code:
while(df1 and df2):
if(df1[0:len(df1)]==df2[0:len(df2)]):
continue
else:
# something

PS: I am sure there is something wrong in comparison.I'll try to use either of these:a.empty, a.bool(), a.item(), a.any() or a.all()
I'll keep you updated with my final code

@khatriindu
Copy link

@stancikcom ,

how to compare the left and right column from the merged dataframe?

I was trying to find some reference over the stackoverflow, didn't find for the merge but found a different way to compare two data frames. Here is the piece of code which works fine:
import pandas as pd
import numpy as np
import xlrd

Read in both excel files

df1 = pd.read_excel('Excel1.xlsx', 'Sheet2', na_values=['NA'])
df2 = pd.read_excel('Excel2.xlsx', 'Sheet2', na_values=['NA'])

Make sure we order by account number so the comparisons work

df1.sort(columns="Col1")
df1=df1.reindex()
df2.sort(columns="Col1")
df2=df2.reindex()

ne_stacked = (df1 != df2).stack()

changed = ne_stacked[ne_stacked]

changed.index.names = ['id', 'col']

difference_locations = np.where(df1 != df2)

changed_from = df1.values[difference_locations]

changed_to = df2.values[difference_locations]

print(pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index))

@jreback jreback modified the milestones: 0.17.0, Next Major Release Sep 21, 2015
@jreback
Copy link
Contributor

jreback commented Sep 21, 2015

closed in favor of #10125

@jreback jreback closed this as completed Sep 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

6 participants