BUG: Warn when dtypes differ in between chunks in csv parser #4991

guyrt · 2013-09-26T03:52:21Z

jreback · 2013-09-26T10:59:37Z

pandas/parser.pyx

+        dtypes = set([a.dtype for a in arrs])
+        if len(dtypes) > 1:
+            common_type = np.find_common_type(dtypes, [])
+            if common_type == np.object:


don't convert to np.str, leave as np.object. But I think this might need be a tad more restrictive. If all types are numeric (and none of np.datetime64 or np.timedelta64), then use the common. else with mixed typed use np.object. User can then deal with it. You could do a UserWarning in the 2nd case (e.g more than 1 type and its going to be object). See if it triggers at all currently (and need a test for triggering it), you can use tm.assert_raises_warning.

guyrt · 2013-09-27T00:31:24Z

I've pushed what I think is a good compromise.

If all types are numeric, then coerce appropriately.

If not, then use np.object type (this is actually the default) and issue warning. The problem here is that an array with type object can hold elements of different types. We either have to coerce the types to the same (very complicated with dates, objects, and what have you) or issue a warning and let the user sort it out. I've opted for a warning in this case. Tests added for both cases.

jreback · 2013-09-27T00:36:28Z

ok....that sounds good. coercing is quite difficult, I fixed a bug in concat that does that and its about a page of code!. The parser is not setup to do it (nor should it be), so that is fine.

The warning is probably fine. Though I don't believe other warnings are issued for really anything. Maybe they should even occur in other situations like this. If you think of something, pls open an issue.

thanks!

guyrt · 2013-09-27T00:41:59Z

any chance that code in concat can be used for this?

jreback · 2013-09-27T00:53:27Z

maybe
look in tools/merge/concat_single_item

not sure it is worth it though

jreback · 2013-09-27T12:43:18Z

pandas/parser.pyx

+                warning_message = " ".join(["Column %s has mixed types." % name,
+                    "Specify dtype option on import or set low_memory=False."
+                  ])
+                print >> sys.stderr, warning_message


create a parser warning type:

class DtypeAmbiguityWarning(Warning): pass

call like this

warnings.warn(warning_message, DtypeAmbiguityWarning)

you may have to import warnings at the top

guyrt · 2013-09-27T13:58:33Z

Done. Should be ready to go pending tests.

jreback · 2013-09-27T14:02:54Z

pandas/io/tests/test_parsers.py

+        integers = [str(i) for i in range(499999)]
+        data = "a\n" + "\n".join(integers + ["1.0", "2.0"] + integers)
+
+        with warnings.catch_warnings(record=True) as w:


use tm.assert_raises_warning here (same idea), but don't need the fail test

I'm not sure what you mean... we don't want a warning to be raised here.

assert_raises_warning just checks that you are in fact warning (with the correct warning).

I just realized that you passed low memory.....so need 2 tests then: your exisiting is fine, and then one w/o low_memory (so that the warning happens and that you check that it happens). this is just to catch the guarantee on the function (about the warning), in case someone changes in the future

guyrt · 2013-09-28T21:42:03Z

I made two changes:
Fix extra print (oops)
Change tests to run on all parsers. Only C low memory parser should raise the warning.

jreback · 2013-09-28T21:44:26Z

gr8.....waiting to merge a big change (unrealed to this)....

jreback · 2013-09-29T03:12:07Z

@guyrt #4335?

guyrt · 2013-09-29T17:17:09Z

I'll take a look at that one next.

I've got a data set I've been playing with (nhtsa vehicle safety reports... pretty interesting) that triggers the warning for 11 columns. That's 11 warnings. I'm going to try to squash it to one warning, so hold off on merging this request until I do.

jreback · 2013-09-29T17:22:12Z

ok...sure...lmk

closes pandas-dev#3866 Silently fix problem rather than warning if we can coerce to numerical type.

guyrt · 2013-09-29T18:38:55Z

Tests pass. Ready for merge.

jreback · 2013-09-29T19:28:22Z

thank you sir!

jreback · 2013-10-04T20:23:46Z

@guyrt did you have a chance to look at #4335 ?

guyrt · 2013-10-04T20:31:26Z

@jreback yes - I'll add comments there.

jreback reviewed Sep 26, 2013
View reviewed changes

jreback reviewed Sep 27, 2013
View reviewed changes

BUG: Warn when dtypes differ in between chunks in csv parser

c1836fa

closes pandas-dev#3866 Silently fix problem rather than warning if we can coerce to numerical type.

jreback merged commit c1836fa into pandas-dev:master Sep 29, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Warn when dtypes differ in between chunks in csv parser #4991

BUG: Warn when dtypes differ in between chunks in csv parser #4991

guyrt commented Sep 26, 2013

jreback Sep 26, 2013

guyrt commented Sep 27, 2013

jreback commented Sep 27, 2013

guyrt commented Sep 27, 2013

jreback commented Sep 27, 2013

jreback Sep 27, 2013

guyrt commented Sep 27, 2013

jreback Sep 27, 2013

guyrt Sep 27, 2013

jreback Sep 27, 2013

guyrt commented Sep 28, 2013

jreback commented Sep 28, 2013

jreback commented Sep 29, 2013

guyrt commented Sep 29, 2013

jreback commented Sep 29, 2013

guyrt commented Sep 29, 2013

jreback commented Sep 29, 2013

jreback commented Oct 4, 2013

guyrt commented Oct 4, 2013

BUG: Warn when dtypes differ in between chunks in csv parser #4991

BUG: Warn when dtypes differ in between chunks in csv parser #4991

Conversation

guyrt commented Sep 26, 2013

jreback Sep 26, 2013

Choose a reason for hiding this comment

guyrt commented Sep 27, 2013

jreback commented Sep 27, 2013

guyrt commented Sep 27, 2013

jreback commented Sep 27, 2013

jreback Sep 27, 2013

Choose a reason for hiding this comment

guyrt commented Sep 27, 2013

jreback Sep 27, 2013

Choose a reason for hiding this comment

guyrt Sep 27, 2013

Choose a reason for hiding this comment

jreback Sep 27, 2013

Choose a reason for hiding this comment

guyrt commented Sep 28, 2013

jreback commented Sep 28, 2013

jreback commented Sep 29, 2013

guyrt commented Sep 29, 2013

jreback commented Sep 29, 2013

guyrt commented Sep 29, 2013

jreback commented Sep 29, 2013

jreback commented Oct 4, 2013

guyrt commented Oct 4, 2013