-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix read parquet import error message #33361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix read parquet import error message #33361
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jfcorbett
There is a mypy error, and would be good to have a test to check the new error message.
Added few comments about readability, but looks good.
Strip error string for great minification benefit Co-Authored-By: Marc Garcia <[email protected]>
@datapythonista Could I get you to point me to an existing test that tests for error messages? It isn't something I've done before, so I'd like to lean on something pre-existing. |
@datapythonista In particular, I'm not sure what the best way is to mock the absence or wrong version of a dependency, respectively... |
I'm in my phone right now. Can you grep for |
Ok, found in |
Sorry, wasn't my idea to make things too complicated. In general, we use pytest markers to run or skip tests depending on whether certain dependencies are installed. I think we're already doing it for In the CI we should have builds with new versions, and with the oldest we support. So, both tests should be executed in one build or another. Does this sound more reasonable? I guess we can live without this test if it's too complicated, but I think doing this should be quite straight forward. |
can you show what the error message results in now? |
@jreback Sure. See the example below, the last three lines in particular.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks cool, I added couple of comments, that could improve readability, but changes look good.
|
||
# Do we have engines installed, but a bad version of them? | ||
pa_min_ver = VERSIONS.get("pyarrow") | ||
fp_min_ver = VERSIONS.get("fastparquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're repeating the same twice for pyarrow
and fastparquet
(which makes sense). But it'd probably be worth to just implement things once, and parametrize (with pytest).
You can search for @pytest.mark.parametrize
, and you'll find lots of examples of parametrized tests. The idea is that the test will receive a set of variables for each of pyarrow
and fastparquet
, and pytest will execute it twice with each set of variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about parametrizing, but in this case it's tricky: the error message will only ever show up if both pyarrow and fastparquet are inadequately installed (i.e. not installed, or bad version installed). This is embodied in the conditional:
if not have_usable_pa and not have_usable_fp:
The and
'ed conditions can't be decoupled. So we'd still need both the lines of code you highlight above.
The only thing that could be de-duplicated with parametrization, is the contents of the if
block. But even that doesn't feel quite right; it's just two aspects of the same situation.
Maybe the best thing to do if we want to be absolutely strict, is to take all the boolean flagging that is currently in these two sections
# Do we have engines installed, but a bad version of them?
[...]
# Do we have usable engines installed?
and move it outside of the test function, to module level. And use pytest.mark.skipif
(or one of those wacky fixtures that inject pytest.mark.skip
) to only run the test when we expect an error message.
At this point, I'm going to ask: how important is it that we do it this way? Because I'm slowly running out of steam for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is kind of overkill , but ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, didn't realize you need to know about both versions at the same time.
I think you could write this in a very compact way, if we create the _HAVE_USABLE_PYARROW
... variables at the beginning of the file, like _HAVE_PYARROW
. Then the parametrization would be trivial. But not that important.
Thanks for the work on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I'm happy to come back to it later if someone sees this as adding value. Though I tend to agree with @jreback that we're bordering overkill. Anyhoo, I'm glad to have this closed before everything drops out of mental cache over Easter.
Cheers all for the good input!!
thanks @jfcorbett |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
pandas.io.parquet.get_engine()
uses handling ofImportError
s for flow control to decide which parquet reader engine is used. In doing so, it quashed lower-level error messages that would have been helpful to the user attempting to diagnose the error, replacing it with a misleading error message.I refactored the error handling, allowing for these lower-level error messages to be collected and explicitly "bubbled up". Thus fixing the incorrect error message.
No tests added -- this behaviour is not worthy of testing. Presumably not worthy of a whatsnew entry either.