BUG, DOC: Improve dialect handling in read_csv #14911

gfyoung · 2016-12-18T20:35:30Z

Update documentation about how the dialect parameter is handled.
Verify that the dialect parameter passed in is valid before accessing the dialect attributes.

jreback · 2016-12-18T20:36:39Z

doc/source/whatsnew/v0.20.0.txt

@@ -193,6 +193,7 @@ Map on Index types now return other Index types
 Other API Changes
 ^^^^^^^^^^^^^^^^^

+- ``pd.read_csv()`` will now issue a ``UserWarning`` whenever there are conflicting values provided by the ``dialect`` parameter and the user (:issue:`14898`)


is this the kind of warning we typically have for other parsing warnings?

maybe we should have a ParserWarning (subclass of UserWarning)?

Actually, to be consistent with what we do with converter and dtype conflicting, I should also issue a ParserWarning. Good catch.

jreback · 2016-12-18T20:37:06Z

pandas/io/parsers.py

-    See csv.Dialect documentation for more details
+    If provided, this parameter will override values for the following
+    parameters: `delimiter`, `doublequote`, `escapechar`, `skipinitialspace`,
+    `quotechar`, and `quoting`. See csv.Dialect documentation for more details.


maybe say that you will get a warning if any of these are non-default?

No, because a user could want the default values on purpose, and they should still know too.

but what happens in this case? e.g. does the default set, THEN the values override? (vice versa doesn't make much sense I guess)

Correct: defaults are set and then are overriden.

ok, maybe make that clear in the doc-string (and warning if not already done so)

jreback · 2016-12-18T21:01:46Z

pandas/io/parsers.py

+                    conflict_msgs.append((
+                        "Conflicting values for '{param}': '{val}' was "
+                        "provided, but the dialect specifies '{diaval}'. "
+                        "Using the dialect-specified value.".format(


see this is contrary to your previous statement. Is the dialect or the override used?

I'm not sure what is unclear about "dialect-specified value." For example, if you pass in explicitly delimiter='.', but the dialect specifies delimiter=',', we use delimiter=','

codecov-io · 2016-12-19T01:01:12Z

Current coverage is 84.77% (diff: 100%)

Merging #14911 into master will decrease coverage by 0.01%

@@             master     #14911   diff @@
==========================================
  Files           145        145          
  Lines         51118      51129    +11   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43343      43343          
- Misses         7775       7786    +11   
  Partials          0          0

Powered by Codecov. Last update caab85b...16e48bc

jreback · 2016-12-19T11:01:56Z

lgtm.

@jorisvandenbossche

jorisvandenbossche · 2016-12-27T16:48:23Z

I am not sure this is worth the discussion (I never user the dialect keyword, not sure if many people do), but, if we are adding this warning, I have the following problem (as also noted in the issue):

For certain cases the default value conflicts with the one specified in a dialect. In those cases you even get a warning when just doing pd.read_csv(path, dialect='specific-dialect'), which seems to defeat the purpose of the dialect a bit, i.e. to specify/override a bunch of arguments together.

gfyoung · 2016-12-28T06:17:21Z

@jorisvandenbossche : I am not sure I fully understand your concern. What do you suggest then I do? It's better to be safe than sorry if we have conflicting values.

jorisvandenbossche · 2016-12-28T11:27:20Z

It's better to be safe than sorry if we have conflicting values.

At some point, yes. But there the dialect purpose is to override default values, so it would be strange that you get a warning for the intended use of the dialect kwarg. (if I misunderstand the use, that may well be, as I never use it).
Eg:

In [1]: s = """a,b,c
   ...: 1,2,3"""

In [3]: import csv

In [4]: pd.read_csv(StringIO(s), dialect=csv.unix_dialect)
/home/joris/scipy/pandas/pandas/io/parsers.py:385: ParserWarning: Conflicting values for 'quoting': '0' was provided, but the dialect specifies '1'. Using the dialect-specified value.
  parser = TextFileReader(filepath_or_buffer, **kwds)
Out[4]: 
   a  b  c
0  1  2  3

So here the user did not provide quoting=0 him/herself as the message would indicate, it was just the default.

Maybe this is only a rare case, so we should not care about it too much. But the same could be said IMO for the warning itself which also adds complexity to the code (my saying: we could also limit it to just a clear documentation of the behaviour)

gfyoung · 2016-12-28T18:46:30Z

@jorisvandenbossche : dialect=None by default, so if somebody does pass in a dialect argument, they are of course intending to use it.

However, to your point that we should only document this behavior, I am inclined to disagree at this point. We currently warn about conflicts between dtype and converters for the C engine. If you have conflicting values for a read_csv parameter, I don't see why that would be any different.

jorisvandenbossche · 2016-12-29T00:04:08Z

We currently warn about conflicts between dtype and converters for the C engine. If you have conflicting values for a read_csv parameter, I don't see why that would be any different.

I agree that warning is good when the user provides two conflicting kwargs like dtype and converters (and we indeed do this!). But, my point is that in this case, it is not the user who provides them both, and as such has no control over it. It are the default values that raise the warning.
And it is exactly the point of passing a dialect to override default values without having to pass each of those individual kwargs (and as a result, will give conflicts).

gfyoung · 2016-12-29T00:08:46Z

Yes, but how do you differentiate between passing the default argument in intentionally and not passing in any argument at all? If you can't tell the difference, then you warn to be safe. More information is better in such cases of uncertainty.

jorisvandenbossche · 2016-12-29T00:11:38Z

@gfyoung Just to be clear: ideally, we would be able to deduce whether a user has passed a conflicting kwarg themselves (eg pd.read_csv(.., dialect="excel", quotechar=None) where quotechar=None would be ignored due to dialect="excel"), and warn in such cases (that I would be totally be OK with).

However, this is not possible in all cases, as we cannot distuinguish between a default value or the user passing a default value (in principle we could detect when the user passed a non-default value and only warn in those cases if there is a conflicting value. However, IMO, that is too much effort not worth this specific case.

A comment on the actual PR added docs: can you also update io.rst (with the explanation that you updated of the dialect kwarg)?

jorisvandenbossche · 2016-12-29T00:13:02Z

Yes, but how do you differentiate between passing the default argument in intentionally and not passing in any argument at all? If you can't tell the difference, then you warn to be safe. More information is better.

See my follow-up comment I was already writing :-)
More information is better -> good information is good, but unnecessary warnings are annoying IMO.

gfyoung · 2016-12-29T01:13:26Z

@jorisvandenbossche : Ah, yes, good catch. io.rst updated as requested.

gfyoung · 2016-12-30T18:44:03Z

@jreback , @jorisvandenbossche : I have addressed all comments here. Ready to merge if there are no other concerns.

jreback · 2016-12-30T18:47:51Z

pandas/io/parsers.py

+            # If any are missing, we will raise automatically.
+            for param in ('delimiter', 'doublequote', 'escapechar',
+                          'skipinitialspace', 'quotechar', 'quoting'):
+                try:


this seems kind of odd to check for, IOW, you are validating the dialect itself?

IOW, has this ever happened in tests?

Yes, I am validating the dialect itself (because we don't check otherwise). I can add a test for this to make sure.

jreback · 2016-12-30T18:48:59Z

lgtm. just that small comment (no big deal really).

jorisvandenbossche · 2016-12-30T19:40:51Z

Still, I am -1 on adding a warning that comes up for overriding default values (as argued above, this is the exact purpose of the keyword). But if nobody else makes the same objection, not going to argue.

jreback · 2016-12-30T19:44:36Z

I will disagree with @jorisvandenbossche here. I think the warning is useful (not that I have ever used a dialect!). But I would for sure want to know that I am overriding things.

(needs rebase as merged bunches of stuff).

jorisvandenbossche · 2016-12-30T19:56:06Z

To give a bit an extreme case. But suppose I specified my own custom dialect:

In [13]: class my_dialect(csv.Dialect):
    ...:     delimiter = '\t'
    ...:     quotechar = "'"
    ...:     doublequote = False
    ...:     skipinitialspace = False
    ...:     quoting = 2

So if I use this with read_csv:

 pd.read_csv(..., dialect=my_dialect)

this will generate (with this PR) a long warning indicating that I am overriding a lot of default values. If you want to get rid of those warnings, you have to do:

pd.read_csv(..., dialect=my_dialect, quotechar="'", doublequote=False, skipinitialspace=False, quoting=2)

so specifying them all individually to not override them. What is then the point of having the dialect kwarg with which you can override a bunch of keywords at once? (as this is the point of the keyword, to "override values (default or not)" (from the docstring)).

But as I never use dialect, I will not argue further:-) (which I actually now did ..). But I find it just introduction of a lot of code for something not useful.

1) Update documentation about how the dialect parameter is handled. 2) Verify that the dialect parameter passed in is valid before accessing the dialect attributes. Closes pandas-devgh-14898.

gfyoung · 2016-12-30T20:57:49Z

@jorisvandenbossche : Fair enough - I guess the users shall have the final word 😄

jreback · 2016-12-30T21:21:07Z

ok lgtm. @gfyoung ping on green.

gfyoung · 2016-12-31T08:16:13Z

@jreback : Everything is green. Ready to merge if there are no other concerns.

jreback · 2016-12-31T15:55:25Z

thanks!

jreback · 2017-01-19T20:58:17Z

getting this warning here: https://travis-ci.org/pandas-dev/pandas/jobs/193465661

>>>-------------------------------------------------------------------------
Warning in /tmp/doc/source/io.rst at block ending on line 426
Specify :okwarning: as an option in the ipython:: block to suppress this message
----------------------------------------------------------------------------
/home/travis/build/pandas-dev/pandas/pandas/io/parsers.py:389: ParserWarning: Conflicting values for 'quoting': '0' was provided, but the dialect specifies '3'. Using the dialect-specified value.
  parser = TextFileReader(filepath_or_buffer, **kwds)
<<<-------------------------------------------------------------------------

?

jreback reviewed Dec 18, 2016

View reviewed changes

jreback added Docs IO CSV read_csv, to_csv labels Dec 18, 2016

jreback reviewed Dec 18, 2016

View reviewed changes

gfyoung force-pushed the csv-dialect-override branch from a3f3a88 to 269d07a Compare December 18, 2016 21:08

jreback added this to the 0.20.0 milestone Dec 19, 2016

gfyoung force-pushed the csv-dialect-override branch 5 times, most recently from c108095 to ba1bb73 Compare December 24, 2016 20:22

gfyoung force-pushed the csv-dialect-override branch from ba1bb73 to 4a6bcce Compare December 29, 2016 01:13

jreback reviewed Dec 30, 2016

View reviewed changes

gfyoung force-pushed the csv-dialect-override branch from 4a6bcce to 9b79653 Compare December 30, 2016 19:24

BUG, DOC: Improve dialect handling in read_csv

16e48bc

1) Update documentation about how the dialect parameter is handled. 2) Verify that the dialect parameter passed in is valid before accessing the dialect attributes. Closes pandas-devgh-14898.

gfyoung force-pushed the csv-dialect-override branch from 9b79653 to 16e48bc Compare December 30, 2016 20:56

jreback closed this in 5d4e92c Dec 31, 2016

gfyoung deleted the csv-dialect-override branch December 31, 2016 18:52

gfyoung mentioned this pull request Jan 20, 2017

BLD: Remove csv dialect warning in io.rst #15177

Merged

Uh oh!

BUG, DOC: Improve dialect handling in read_csv #14911

BUG, DOC: Improve dialect handling in read_csv #14911

Uh oh!

Conversation

gfyoung commented Dec 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 84.77% (diff: 100%)

Uh oh!

jreback commented Dec 19, 2016

Uh oh!

jorisvandenbossche commented Dec 27, 2016

Uh oh!

gfyoung commented Dec 28, 2016

Uh oh!

jorisvandenbossche commented Dec 28, 2016

Uh oh!

gfyoung commented Dec 28, 2016

Uh oh!

jorisvandenbossche commented Dec 29, 2016

Uh oh!

gfyoung commented Dec 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Dec 29, 2016

Uh oh!

jorisvandenbossche commented Dec 29, 2016

Uh oh!

gfyoung commented Dec 29, 2016

Uh oh!

gfyoung commented Dec 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 30, 2016

Uh oh!

jorisvandenbossche commented Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Dec 30, 2016

Uh oh!

jorisvandenbossche commented Dec 30, 2016

Uh oh!

gfyoung commented Dec 30, 2016

Uh oh!

jreback commented Dec 30, 2016

Uh oh!

gfyoung commented Dec 31, 2016

Uh oh!

jreback commented Dec 31, 2016

gfyoung Dec 18, 2016 •

edited

Loading

gfyoung Dec 18, 2016 •

edited

Loading

codecov-io commented Dec 19, 2016 •

edited

Loading

gfyoung commented Dec 29, 2016 •

edited

Loading

gfyoung Dec 30, 2016 •

edited

Loading

jorisvandenbossche commented Dec 30, 2016 •

edited

Loading