-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG, DOC: Improve dialect handling in read_csv #14911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -193,6 +193,7 @@ Map on Index types now return other Index types | |||
Other API Changes | |||
^^^^^^^^^^^^^^^^^ | |||
|
|||
- ``pd.read_csv()`` will now issue a ``UserWarning`` whenever there are conflicting values provided by the ``dialect`` parameter and the user (:issue:`14898`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the kind of warning we typically have for other parsing warnings?
maybe we should have a ParserWarning (subclass of UserWarning)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, to be consistent with what we do with converter
and dtype
conflicting, I should also issue a ParserWarning
. Good catch.
See csv.Dialect documentation for more details | ||
If provided, this parameter will override values for the following | ||
parameters: `delimiter`, `doublequote`, `escapechar`, `skipinitialspace`, | ||
`quotechar`, and `quoting`. See csv.Dialect documentation for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe say that you will get a warning if any of these are non-default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because a user could want the default values on purpose, and they should still know too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what happens in this case? e.g. does the default set, THEN the values override? (vice versa doesn't make much sense I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct: defaults are set and then are overriden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, maybe make that clear in the doc-string (and warning if not already done so)
conflict_msgs.append(( | ||
"Conflicting values for '{param}': '{val}' was " | ||
"provided, but the dialect specifies '{diaval}'. " | ||
"Using the dialect-specified value.".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see this is contrary to your previous statement. Is the dialect or the override used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is unclear about "dialect-specified value." For example, if you pass in explicitly delimiter='.'
, but the dialect specifies delimiter=','
, we use delimiter=','
a3f3a88
to
269d07a
Compare
Current coverage is 84.77% (diff: 100%)@@ master #14911 diff @@
==========================================
Files 145 145
Lines 51118 51129 +11
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 43343 43343
- Misses 7775 7786 +11
Partials 0 0
|
lgtm. |
c108095
to
ba1bb73
Compare
I am not sure this is worth the discussion (I never user the dialect keyword, not sure if many people do), but, if we are adding this warning, I have the following problem (as also noted in the issue):
|
@jorisvandenbossche : I am not sure I fully understand your concern. What do you suggest then I do? It's better to be safe than sorry if we have conflicting values. |
At some point, yes. But there the
So here the user did not provide Maybe this is only a rare case, so we should not care about it too much. But the same could be said IMO for the warning itself which also adds complexity to the code (my saying: we could also limit it to just a clear documentation of the behaviour) |
@jorisvandenbossche : However, to your point that we should only document this behavior, I am inclined to disagree at this point. We currently warn about conflicts between |
I agree that warning is good when the user provides two conflicting kwargs like dtype and converters (and we indeed do this!). But, my point is that in this case, it is not the user who provides them both, and as such has no control over it. It are the default values that raise the warning. |
Yes, but how do you differentiate between passing the default argument in intentionally and not passing in any argument at all? If you can't tell the difference, then you warn to be safe. More information is better in such cases of uncertainty. |
@gfyoung Just to be clear: ideally, we would be able to deduce whether a user has passed a conflicting kwarg themselves (eg However, this is not possible in all cases, as we cannot distuinguish between a default value or the user passing a default value (in principle we could detect when the user passed a non-default value and only warn in those cases if there is a conflicting value. However, IMO, that is too much effort not worth this specific case. A comment on the actual PR added docs: can you also update io.rst (with the explanation that you updated of the dialect kwarg)? |
See my follow-up comment I was already writing :-) |
ba1bb73
to
4a6bcce
Compare
@jorisvandenbossche : Ah, yes, good catch. |
@jreback , @jorisvandenbossche : I have addressed all comments here. Ready to merge if there are no other concerns. |
# If any are missing, we will raise automatically. | ||
for param in ('delimiter', 'doublequote', 'escapechar', | ||
'skipinitialspace', 'quotechar', 'quoting'): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems kind of odd to check for, IOW, you are validating the dialect itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW, has this ever happened in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am validating the dialect itself (because we don't check otherwise). I can add a test for this to make sure.
lgtm. just that small comment (no big deal really). |
4a6bcce
to
9b79653
Compare
Still, I am -1 on adding a warning that comes up for overriding default values (as argued above, this is the exact purpose of the keyword). But if nobody else makes the same objection, not going to argue. |
I will disagree with @jorisvandenbossche here. I think the warning is useful (not that I have ever used a dialect!). But I would for sure want to know that I am overriding things. (needs rebase as merged bunches of stuff). |
To give a bit an extreme case. But suppose I specified my own custom dialect:
So if I use this with read_csv:
this will generate (with this PR) a long warning indicating that I am overriding a lot of default values. If you want to get rid of those warnings, you have to do:
so specifying them all individually to not override them. What is then the point of having the But as I never use dialect, I will not argue further:-) (which I actually now did ..). But I find it just introduction of a lot of code for something not useful. |
1) Update documentation about how the dialect parameter is handled. 2) Verify that the dialect parameter passed in is valid before accessing the dialect attributes. Closes pandas-devgh-14898.
9b79653
to
16e48bc
Compare
@jorisvandenbossche : Fair enough - I guess the users shall have the final word 😄 |
ok lgtm. @gfyoung ping on green. |
@jreback : Everything is green. Ready to merge if there are no other concerns. |
thanks! |
getting this warning here: https://travis-ci.org/pandas-dev/pandas/jobs/193465661
? |
Update documentation about how the
dialect
parameter is handled.Verify that the
dialect
parameter passed in is valid before accessing the dialect attributes.Closes #14898.