Skip to content

BUG, DOC: Improve dialect handling in read_csv #14911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 18, 2016

  1. Update documentation about how the dialect parameter is handled.

  2. Verify that the dialect parameter passed in is valid before accessing the dialect attributes.

Closes #14898.

@@ -193,6 +193,7 @@ Map on Index types now return other Index types
Other API Changes
^^^^^^^^^^^^^^^^^

- ``pd.read_csv()`` will now issue a ``UserWarning`` whenever there are conflicting values provided by the ``dialect`` parameter and the user (:issue:`14898`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the kind of warning we typically have for other parsing warnings?

maybe we should have a ParserWarning (subclass of UserWarning)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, to be consistent with what we do with converter and dtype conflicting, I should also issue a ParserWarning. Good catch.

See csv.Dialect documentation for more details
If provided, this parameter will override values for the following
parameters: `delimiter`, `doublequote`, `escapechar`, `skipinitialspace`,
`quotechar`, and `quoting`. See csv.Dialect documentation for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe say that you will get a warning if any of these are non-default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because a user could want the default values on purpose, and they should still know too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what happens in this case? e.g. does the default set, THEN the values override? (vice versa doesn't make much sense I guess)

Copy link
Member Author

@gfyoung gfyoung Dec 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct: defaults are set and then are overriden.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe make that clear in the doc-string (and warning if not already done so)

@jreback jreback added Docs IO CSV read_csv, to_csv labels Dec 18, 2016
conflict_msgs.append((
"Conflicting values for '{param}': '{val}' was "
"provided, but the dialect specifies '{diaval}'. "
"Using the dialect-specified value.".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see this is contrary to your previous statement. Is the dialect or the override used?

Copy link
Member Author

@gfyoung gfyoung Dec 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is unclear about "dialect-specified value." For example, if you pass in explicitly delimiter='.', but the dialect specifies delimiter=',', we use delimiter=','

@gfyoung gfyoung force-pushed the csv-dialect-override branch from a3f3a88 to 269d07a Compare December 18, 2016 21:08
@codecov-io
Copy link

codecov-io commented Dec 19, 2016

Current coverage is 84.77% (diff: 100%)

Merging #14911 into master will decrease coverage by 0.01%

@@             master     #14911   diff @@
==========================================
  Files           145        145          
  Lines         51118      51129    +11   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43343      43343          
- Misses         7775       7786    +11   
  Partials          0          0          

Powered by Codecov. Last update caab85b...16e48bc

@jreback jreback added this to the 0.20.0 milestone Dec 19, 2016
@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

lgtm.

@jorisvandenbossche

@gfyoung gfyoung force-pushed the csv-dialect-override branch 5 times, most recently from c108095 to ba1bb73 Compare December 24, 2016 20:22
@jorisvandenbossche
Copy link
Member

I am not sure this is worth the discussion (I never user the dialect keyword, not sure if many people do), but, if we are adding this warning, I have the following problem (as also noted in the issue):

  • For certain cases the default value conflicts with the one specified in a dialect. In those cases you even get a warning when just doing pd.read_csv(path, dialect='specific-dialect'), which seems to defeat the purpose of the dialect a bit, i.e. to specify/override a bunch of arguments together.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 28, 2016

@jorisvandenbossche : I am not sure I fully understand your concern. What do you suggest then I do? It's better to be safe than sorry if we have conflicting values.

@jorisvandenbossche
Copy link
Member

It's better to be safe than sorry if we have conflicting values.

At some point, yes. But there the dialect purpose is to override default values, so it would be strange that you get a warning for the intended use of the dialect kwarg. (if I misunderstand the use, that may well be, as I never use it).
Eg:

In [1]: s = """a,b,c
   ...: 1,2,3"""

In [3]: import csv

In [4]: pd.read_csv(StringIO(s), dialect=csv.unix_dialect)
/home/joris/scipy/pandas/pandas/io/parsers.py:385: ParserWarning: Conflicting values for 'quoting': '0' was provided, but the dialect specifies '1'. Using the dialect-specified value.
  parser = TextFileReader(filepath_or_buffer, **kwds)
Out[4]: 
   a  b  c
0  1  2  3

So here the user did not provide quoting=0 him/herself as the message would indicate, it was just the default.

Maybe this is only a rare case, so we should not care about it too much. But the same could be said IMO for the warning itself which also adds complexity to the code (my saying: we could also limit it to just a clear documentation of the behaviour)

@gfyoung
Copy link
Member Author

gfyoung commented Dec 28, 2016

@jorisvandenbossche : dialect=None by default, so if somebody does pass in a dialect argument, they are of course intending to use it.

However, to your point that we should only document this behavior, I am inclined to disagree at this point. We currently warn about conflicts between dtype and converters for the C engine. If you have conflicting values for a read_csv parameter, I don't see why that would be any different.

@jorisvandenbossche
Copy link
Member

We currently warn about conflicts between dtype and converters for the C engine. If you have conflicting values for a read_csv parameter, I don't see why that would be any different.

I agree that warning is good when the user provides two conflicting kwargs like dtype and converters (and we indeed do this!). But, my point is that in this case, it is not the user who provides them both, and as such has no control over it. It are the default values that raise the warning.
And it is exactly the point of passing a dialect to override default values without having to pass each of those individual kwargs (and as a result, will give conflicts).

@gfyoung
Copy link
Member Author

gfyoung commented Dec 29, 2016

Yes, but how do you differentiate between passing the default argument in intentionally and not passing in any argument at all? If you can't tell the difference, then you warn to be safe. More information is better in such cases of uncertainty.

@jorisvandenbossche
Copy link
Member

@gfyoung Just to be clear: ideally, we would be able to deduce whether a user has passed a conflicting kwarg themselves (eg pd.read_csv(.., dialect="excel", quotechar=None) where quotechar=None would be ignored due to dialect="excel"), and warn in such cases (that I would be totally be OK with).

However, this is not possible in all cases, as we cannot distuinguish between a default value or the user passing a default value (in principle we could detect when the user passed a non-default value and only warn in those cases if there is a conflicting value. However, IMO, that is too much effort not worth this specific case.

A comment on the actual PR added docs: can you also update io.rst (with the explanation that you updated of the dialect kwarg)?

@jorisvandenbossche
Copy link
Member

Yes, but how do you differentiate between passing the default argument in intentionally and not passing in any argument at all? If you can't tell the difference, then you warn to be safe. More information is better.

See my follow-up comment I was already writing :-)
More information is better -> good information is good, but unnecessary warnings are annoying IMO.

@gfyoung gfyoung force-pushed the csv-dialect-override branch from ba1bb73 to 4a6bcce Compare December 29, 2016 01:13
@gfyoung
Copy link
Member Author

gfyoung commented Dec 29, 2016

@jorisvandenbossche : Ah, yes, good catch. io.rst updated as requested.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 30, 2016

@jreback , @jorisvandenbossche : I have addressed all comments here. Ready to merge if there are no other concerns.

# If any are missing, we will raise automatically.
for param in ('delimiter', 'doublequote', 'escapechar',
'skipinitialspace', 'quotechar', 'quoting'):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems kind of odd to check for, IOW, you are validating the dialect itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW, has this ever happened in tests?

Copy link
Member Author

@gfyoung gfyoung Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am validating the dialect itself (because we don't check otherwise). I can add a test for this to make sure.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

lgtm. just that small comment (no big deal really).

@gfyoung gfyoung force-pushed the csv-dialect-override branch from 4a6bcce to 9b79653 Compare December 30, 2016 19:24
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 30, 2016

Still, I am -1 on adding a warning that comes up for overriding default values (as argued above, this is the exact purpose of the keyword). But if nobody else makes the same objection, not going to argue.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

I will disagree with @jorisvandenbossche here. I think the warning is useful (not that I have ever used a dialect!). But I would for sure want to know that I am overriding things.

(needs rebase as merged bunches of stuff).

@jorisvandenbossche
Copy link
Member

To give a bit an extreme case. But suppose I specified my own custom dialect:

In [13]: class my_dialect(csv.Dialect):
    ...:     delimiter = '\t'
    ...:     quotechar = "'"
    ...:     doublequote = False
    ...:     skipinitialspace = False
    ...:     quoting = 2

So if I use this with read_csv:

 pd.read_csv(..., dialect=my_dialect)

this will generate (with this PR) a long warning indicating that I am overriding a lot of default values. If you want to get rid of those warnings, you have to do:

pd.read_csv(..., dialect=my_dialect, quotechar="'", doublequote=False, skipinitialspace=False, quoting=2)

so specifying them all individually to not override them. What is then the point of having the dialect kwarg with which you can override a bunch of keywords at once? (as this is the point of the keyword, to "override values (default or not)" (from the docstring)).

But as I never use dialect, I will not argue further:-) (which I actually now did ..). But I find it just introduction of a lot of code for something not useful.

1) Update documentation about how the dialect
parameter is handled.

2) Verify that the dialect parameter passed in
is valid before accessing the dialect attributes.

Closes pandas-devgh-14898.
@gfyoung gfyoung force-pushed the csv-dialect-override branch from 9b79653 to 16e48bc Compare December 30, 2016 20:56
@gfyoung
Copy link
Member Author

gfyoung commented Dec 30, 2016

@jorisvandenbossche : Fair enough - I guess the users shall have the final word 😄

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

ok lgtm. @gfyoung ping on green.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 31, 2016

@jreback : Everything is green. Ready to merge if there are no other concerns.

@jreback jreback closed this in 5d4e92c Dec 31, 2016
@jreback
Copy link
Contributor

jreback commented Dec 31, 2016

thanks!

@gfyoung gfyoung deleted the csv-dialect-override branch December 31, 2016 18:52
@jreback
Copy link
Contributor

jreback commented Jan 19, 2017

getting this warning here: https://travis-ci.org/pandas-dev/pandas/jobs/193465661

>>>-------------------------------------------------------------------------
Warning in /tmp/doc/source/io.rst at block ending on line 426
Specify :okwarning: as an option in the ipython:: block to suppress this message
----------------------------------------------------------------------------
/home/travis/build/pandas-dev/pandas/pandas/io/parsers.py:389: ParserWarning: Conflicting values for 'quoting': '0' was provided, but the dialect specifies '3'. Using the dialect-specified value.
  parser = TextFileReader(filepath_or_buffer, **kwds)
<<<-------------------------------------------------------------------------

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants