Fix read parquet import error message #33361

jfcorbett · 2020-04-07T09:46:52Z

closes BUG: Bad error message on read_parquet() when wrong version of pyarrow is installed #33313
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry N/A

pandas.io.parquet.get_engine() uses handling of ImportErrors for flow control to decide which parquet reader engine is used. In doing so, it quashed lower-level error messages that would have been helpful to the user attempting to diagnose the error, replacing it with a misleading error message.

I refactored the error handling, allowing for these lower-level error messages to be collected and explicitly "bubbled up". Thus fixing the incorrect error message.

No tests added -- this behaviour is not worthy of testing. Presumably not worthy of a whatsnew entry either.

…import-error-message

pandas/io/parquet.py

datapythonista

Thanks @jfcorbett

There is a mypy error, and would be good to have a test to check the new error message.

Added few comments about readability, but looks good.

pandas/io/parquet.py

Strip error string for great minification benefit Co-Authored-By: Marc Garcia <[email protected]>

jfcorbett · 2020-04-07T14:00:06Z

@datapythonista Could I get you to point me to an existing test that tests for error messages? It isn't something I've done before, so I'd like to lean on something pre-existing.

jfcorbett · 2020-04-07T14:06:10Z

@datapythonista In particular, I'm not sure what the best way is to mock the absence or wrong version of a dependency, respectively...

datapythonista · 2020-04-07T15:42:57Z

I'm in my phone right now. Can you grep for pytest.raises(ImportError in the tests directory? There is a match parameter to check the error message. Thanks!

jfcorbett · 2020-04-07T20:04:03Z

Ok, found in test_optional_dependency.py.
My problem is, I'm not sure how to monkeypatch the version of pyarrow to something bad, without screwing up later tests. I guess I could make a fixture that sets pyarrow.__version__='bad.version.42', and then set the version back to what it was during teardown, but that seems... not perfectly safe? Or is it ok?
Bit beyond what I've done before, therefore this handholding!

datapythonista · 2020-04-07T20:13:45Z

Sorry, wasn't my idea to make things too complicated.

In general, we use pytest markers to run or skip tests depending on whether certain dependencies are installed. I think we're already doing it for pyarrow and fastparquet. For testing the error when the version is too old, I think you can simply use an if pyarrow.__version__ > whatever. Then you'll be testing one error message or the other depending on which version of pyarrow is installed.

In the CI we should have builds with new versions, and with the oldest we support. So, both tests should be executed in one build or another.

Does this sound more reasonable? I guess we can live without this test if it's too complicated, but I think doing this should be quite straight forward.

jreback · 2020-04-07T23:28:59Z

can you show what the error message results in now?

…import-error-message

jfcorbett · 2020-04-08T11:36:33Z

@jreback Sure. See the example below, the last three lines in particular.
Also the test I just added documents this. (pandas/tests/io/test_parquet.py::test_get_engine_auto_error_message)

>>> from pandas.io.parquet import get_engine
>>> import pyarrow
>>> import fastparquet
>>> pyarrow.__version__ = '0.0.42'
>>> fastparquet.__version__ = '0.0.1'
>>> get_engine("auto")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\JEACO\repos\pandas\pandas\io\parquet.py", line 32, in get_engine
    "Unable to find a usable engine; "
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
Pandas requires version '0.13.0' or newer of 'pyarrow' (version '0.0.42' currently installed).
Pandas requires version '0.3.2' or newer of 'fastparquet' (version '0.0.1' currently installed).

datapythonista

Thanks. Looks cool, I added couple of comments, that could improve readability, but changes look good.

datapythonista · 2020-04-08T11:52:21Z

pandas/tests/io/test_parquet.py

+
+    # Do we have engines installed, but a bad version of them?
+    pa_min_ver = VERSIONS.get("pyarrow")
+    fp_min_ver = VERSIONS.get("fastparquet")


Looks like you're repeating the same twice for pyarrow and fastparquet (which makes sense). But it'd probably be worth to just implement things once, and parametrize (with pytest).

You can search for @pytest.mark.parametrize, and you'll find lots of examples of parametrized tests. The idea is that the test will receive a set of variables for each of pyarrow and fastparquet, and pytest will execute it twice with each set of variables.

Yeah, I thought about parametrizing, but in this case it's tricky: the error message will only ever show up if both pyarrow and fastparquet are inadequately installed (i.e. not installed, or bad version installed). This is embodied in the conditional:

if not have_usable_pa and not have_usable_fp:

The and'ed conditions can't be decoupled. So we'd still need both the lines of code you highlight above.

The only thing that could be de-duplicated with parametrization, is the contents of the if block. But even that doesn't feel quite right; it's just two aspects of the same situation.

Maybe the best thing to do if we want to be absolutely strict, is to take all the boolean flagging that is currently in these two sections

# Do we have engines installed, but a bad version of them? [...] # Do we have usable engines installed?

and move it outside of the test function, to module level. And use pytest.mark.skipif (or one of those wacky fixtures that inject pytest.mark.skip) to only run the test when we expect an error message.

At this point, I'm going to ask: how important is it that we do it this way? Because I'm slowly running out of steam for this.

this test is kind of overkill , but ok.

I see, didn't realize you need to know about both versions at the same time.

I think you could write this in a very compact way, if we create the _HAVE_USABLE_PYARROW... variables at the beginning of the file, like _HAVE_PYARROW. Then the parametrization would be trivial. But not that important.

Thanks for the work on this!

Cool. I'm happy to come back to it later if someone sees this as adding value. Though I tend to agree with @jreback that we're bordering overkill. Anyhoo, I'm glad to have this closed before everything drops out of mental cache over Easter.
Cheers all for the good input!!

pandas/io/parquet.py

…import-error-message

jreback · 2020-04-08T16:45:49Z

thanks @jfcorbett

jfcorbett added 3 commits April 6, 2020 13:53

Collect import error messages and display them

3d5d488

black

29bfc49

Merge remote-tracking branch 'upstream/master' into fix-read-parquet-…

43de45f

…import-error-message

ShaharNaveh reviewed Apr 7, 2020

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

datapythonista added Error Reporting Incorrect or improved errors from pandas IO Parquet parquet, feather labels Apr 7, 2020

datapythonista reviewed Apr 7, 2020

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

jfcorbett and others added 3 commits April 7, 2020 15:46

Placate mypy who insists that "ImportError" has no attribute "msg"

bca4bd1

Rename variables

2bc7dd0

Apply suggestions from code review

ed51950

Strip error string for great minification benefit Co-Authored-By: Marc Garcia <[email protected]>

Refactor extract variable joined_error_messages

7eb45b8

jreback added this to the 1.1 milestone Apr 7, 2020

jfcorbett added 2 commits April 8, 2020 13:26

Add test for get_engine(engine="auto") error messages

59a3877

Merge remote-tracking branch 'upstream/master' into fix-read-parquet-…

c5fdadc

…import-error-message

datapythonista reviewed Apr 8, 2020

View reviewed changes

jfcorbett added 3 commits April 8, 2020 16:28

Rename

dd529c4

Refactor collection of error messages: use string instead of list

f348001

Merge remote-tracking branch 'upstream/master' into fix-read-parquet-…

7d58483

…import-error-message

jreback merged commit 60d6f28 into pandas-dev:master Apr 8, 2020

turbach mentioned this pull request Sep 24, 2020

pyarrow ImportError kutaslab/spudtr#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix read parquet import error message #33361

Fix read parquet import error message #33361

jfcorbett commented Apr 7, 2020 •

edited

Loading

datapythonista left a comment

jfcorbett commented Apr 7, 2020 •

edited

Loading

jfcorbett commented Apr 7, 2020

datapythonista commented Apr 7, 2020

jfcorbett commented Apr 7, 2020

datapythonista commented Apr 7, 2020

jreback commented Apr 7, 2020

jfcorbett commented Apr 8, 2020 •

edited

Loading

datapythonista left a comment

datapythonista Apr 8, 2020

jfcorbett Apr 8, 2020 •

edited

Loading

jreback Apr 8, 2020

datapythonista Apr 8, 2020

jfcorbett Apr 8, 2020

jreback commented Apr 8, 2020

Fix read parquet import error message #33361

Fix read parquet import error message #33361

Conversation

jfcorbett commented Apr 7, 2020 • edited Loading

datapythonista left a comment

Choose a reason for hiding this comment

jfcorbett commented Apr 7, 2020 • edited Loading

jfcorbett commented Apr 7, 2020

datapythonista commented Apr 7, 2020

jfcorbett commented Apr 7, 2020

datapythonista commented Apr 7, 2020

jreback commented Apr 7, 2020

jfcorbett commented Apr 8, 2020 • edited Loading

datapythonista left a comment

Choose a reason for hiding this comment

datapythonista Apr 8, 2020

Choose a reason for hiding this comment

jfcorbett Apr 8, 2020 • edited Loading

Choose a reason for hiding this comment

jreback Apr 8, 2020

Choose a reason for hiding this comment

datapythonista Apr 8, 2020

Choose a reason for hiding this comment

jfcorbett Apr 8, 2020

Choose a reason for hiding this comment

jreback commented Apr 8, 2020

jfcorbett commented Apr 7, 2020 •

edited

Loading

jfcorbett commented Apr 7, 2020 •

edited

Loading

jfcorbett commented Apr 8, 2020 •

edited

Loading

jfcorbett Apr 8, 2020 •

edited

Loading