-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Warn when dtypes differ in between chunks in csv parser #4991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dtypes = set([a.dtype for a in arrs]) | ||
if len(dtypes) > 1: | ||
common_type = np.find_common_type(dtypes, []) | ||
if common_type == np.object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't convert to np.str, leave as np.object. But I think this might need be a tad more restrictive. If all types are numeric (and none of np.datetime64 or np.timedelta64), then use the common. else with mixed typed use np.object. User can then deal with it. You could do a UserWarning in the 2nd case (e.g more than 1 type and its going to be object). See if it triggers at all currently (and need a test for triggering it), you can use tm.assert_raises_warning.
I've pushed what I think is a good compromise. If all types are numeric, then coerce appropriately. If not, then use np.object type (this is actually the default) and issue warning. The problem here is that an array with type object can hold elements of different types. We either have to coerce the types to the same (very complicated with dates, objects, and what have you) or issue a warning and let the user sort it out. I've opted for a warning in this case. Tests added for both cases. |
ok....that sounds good. coercing is quite difficult, I fixed a bug in concat that does that and its about a page of code!. The parser is not setup to do it (nor should it be), so that is fine. The warning is probably fine. Though I don't believe other warnings are issued for really anything. Maybe they should even occur in other situations like this. If you think of something, pls open an issue. thanks! |
any chance that code in concat can be used for this? |
maybe not sure it is worth it though |
warning_message = " ".join(["Column %s has mixed types." % name, | ||
"Specify dtype option on import or set low_memory=False." | ||
]) | ||
print >> sys.stderr, warning_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a parser warning type:
class DtypeAmbiguityWarning(Warning): pass
call like this
warnings.warn(warning_message, DtypeAmbiguityWarning)
you may have to import warnings
at the top
Done. Should be ready to go pending tests. |
integers = [str(i) for i in range(499999)] | ||
data = "a\n" + "\n".join(integers + ["1.0", "2.0"] + integers) | ||
|
||
with warnings.catch_warnings(record=True) as w: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tm.assert_raises_warning
here (same idea), but don't need the fail test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean... we don't want a warning to be raised here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_raises_warning
just checks that you are in fact warning (with the correct warning).
I just realized that you passed low memory.....so need 2 tests then: your exisiting is fine, and then one w/o low_memory (so that the warning happens and that you check that it happens). this is just to catch the guarantee on the function (about the warning), in case someone changes in the future
I made two changes: |
gr8.....waiting to merge a big change (unrealed to this).... |
I'll take a look at that one next. I've got a data set I've been playing with (nhtsa vehicle safety reports... pretty interesting) that triggers the warning for 11 columns. That's 11 warnings. I'm going to try to squash it to one warning, so hold off on merging this request until I do. |
ok...sure...lmk |
closes pandas-dev#3866 Silently fix problem rather than warning if we can coerce to numerical type.
Tests pass. Ready for merge. |
thank you sir! |
@jreback yes - I'll add comments there. |
Closes #3866