Skip to content

Fix read parquet import error message #33361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 8, 2020
Merged

Fix read parquet import error message #33361

merged 12 commits into from
Apr 8, 2020

Conversation

jfcorbett
Copy link
Contributor

@jfcorbett jfcorbett commented Apr 7, 2020

pandas.io.parquet.get_engine() uses handling of ImportErrors for flow control to decide which parquet reader engine is used. In doing so, it quashed lower-level error messages that would have been helpful to the user attempting to diagnose the error, replacing it with a misleading error message.

I refactored the error handling, allowing for these lower-level error messages to be collected and explicitly "bubbled up". Thus fixing the incorrect error message.

No tests added -- this behaviour is not worthy of testing. Presumably not worthy of a whatsnew entry either.

@datapythonista datapythonista added Error Reporting Incorrect or improved errors from pandas IO Parquet parquet, feather labels Apr 7, 2020
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jfcorbett

There is a mypy error, and would be good to have a test to check the new error message.

Added few comments about readability, but looks good.

@jfcorbett
Copy link
Contributor Author

jfcorbett commented Apr 7, 2020

@datapythonista Could I get you to point me to an existing test that tests for error messages? It isn't something I've done before, so I'd like to lean on something pre-existing.

@jfcorbett
Copy link
Contributor Author

@datapythonista In particular, I'm not sure what the best way is to mock the absence or wrong version of a dependency, respectively...

@datapythonista
Copy link
Member

I'm in my phone right now. Can you grep for pytest.raises(ImportError in the tests directory? There is a match parameter to check the error message. Thanks!

@jfcorbett
Copy link
Contributor Author

Ok, found in test_optional_dependency.py.
My problem is, I'm not sure how to monkeypatch the version of pyarrow to something bad, without screwing up later tests. I guess I could make a fixture that sets pyarrow.__version__='bad.version.42', and then set the version back to what it was during teardown, but that seems... not perfectly safe? Or is it ok?
Bit beyond what I've done before, therefore this handholding!

@datapythonista
Copy link
Member

Sorry, wasn't my idea to make things too complicated.

In general, we use pytest markers to run or skip tests depending on whether certain dependencies are installed. I think we're already doing it for pyarrow and fastparquet. For testing the error when the version is too old, I think you can simply use an if pyarrow.__version__ > whatever. Then you'll be testing one error message or the other depending on which version of pyarrow is installed.

In the CI we should have builds with new versions, and with the oldest we support. So, both tests should be executed in one build or another.

Does this sound more reasonable? I guess we can live without this test if it's too complicated, but I think doing this should be quite straight forward.

@jreback jreback added this to the 1.1 milestone Apr 7, 2020
@jreback
Copy link
Contributor

jreback commented Apr 7, 2020

can you show what the error message results in now?

@jfcorbett
Copy link
Contributor Author

jfcorbett commented Apr 8, 2020

@jreback Sure. See the example below, the last three lines in particular.
Also the test I just added documents this. (pandas/tests/io/test_parquet.py::test_get_engine_auto_error_message)

>>> from pandas.io.parquet import get_engine
>>> import pyarrow
>>> import fastparquet
>>> pyarrow.__version__ = '0.0.42'
>>> fastparquet.__version__ = '0.0.1'
>>> get_engine("auto")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\JEACO\repos\pandas\pandas\io\parquet.py", line 32, in get_engine
    "Unable to find a usable engine; "
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
Pandas requires version '0.13.0' or newer of 'pyarrow' (version '0.0.42' currently installed).
Pandas requires version '0.3.2' or newer of 'fastparquet' (version '0.0.1' currently installed).

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks cool, I added couple of comments, that could improve readability, but changes look good.


# Do we have engines installed, but a bad version of them?
pa_min_ver = VERSIONS.get("pyarrow")
fp_min_ver = VERSIONS.get("fastparquet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're repeating the same twice for pyarrow and fastparquet (which makes sense). But it'd probably be worth to just implement things once, and parametrize (with pytest).

You can search for @pytest.mark.parametrize, and you'll find lots of examples of parametrized tests. The idea is that the test will receive a set of variables for each of pyarrow and fastparquet, and pytest will execute it twice with each set of variables.

Copy link
Contributor Author

@jfcorbett jfcorbett Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about parametrizing, but in this case it's tricky: the error message will only ever show up if both pyarrow and fastparquet are inadequately installed (i.e. not installed, or bad version installed). This is embodied in the conditional:

if not have_usable_pa and not have_usable_fp:

The and'ed conditions can't be decoupled. So we'd still need both the lines of code you highlight above.

The only thing that could be de-duplicated with parametrization, is the contents of the if block. But even that doesn't feel quite right; it's just two aspects of the same situation.

Maybe the best thing to do if we want to be absolutely strict, is to take all the boolean flagging that is currently in these two sections

    # Do we have engines installed, but a bad version of them?
[...]
    # Do we have usable engines installed?

and move it outside of the test function, to module level. And use pytest.mark.skipif (or one of those wacky fixtures that inject pytest.mark.skip) to only run the test when we expect an error message.

At this point, I'm going to ask: how important is it that we do it this way? Because I'm slowly running out of steam for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is kind of overkill , but ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, didn't realize you need to know about both versions at the same time.

I think you could write this in a very compact way, if we create the _HAVE_USABLE_PYARROW... variables at the beginning of the file, like _HAVE_PYARROW. Then the parametrization would be trivial. But not that important.

Thanks for the work on this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I'm happy to come back to it later if someone sees this as adding value. Though I tend to agree with @jreback that we're bordering overkill. Anyhoo, I'm glad to have this closed before everything drops out of mental cache over Easter.
Cheers all for the good input!!

@jreback jreback merged commit 60d6f28 into pandas-dev:master Apr 8, 2020
@jreback
Copy link
Contributor

jreback commented Apr 8, 2020

thanks @jfcorbett

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Bad error message on read_parquet() when wrong version of pyarrow is installed
4 participants