Skip to content

BUG: Warn when dtypes differ in between chunks in csv parser #4991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 29, 2013

Conversation

guyrt
Copy link
Contributor

@guyrt guyrt commented Sep 26, 2013

Closes #3866

dtypes = set([a.dtype for a in arrs])
if len(dtypes) > 1:
common_type = np.find_common_type(dtypes, [])
if common_type == np.object:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't convert to np.str, leave as np.object. But I think this might need be a tad more restrictive. If all types are numeric (and none of np.datetime64 or np.timedelta64), then use the common. else with mixed typed use np.object. User can then deal with it. You could do a UserWarning in the 2nd case (e.g more than 1 type and its going to be object). See if it triggers at all currently (and need a test for triggering it), you can use tm.assert_raises_warning.

@guyrt
Copy link
Contributor Author

guyrt commented Sep 27, 2013

I've pushed what I think is a good compromise.

If all types are numeric, then coerce appropriately.

If not, then use np.object type (this is actually the default) and issue warning. The problem here is that an array with type object can hold elements of different types. We either have to coerce the types to the same (very complicated with dates, objects, and what have you) or issue a warning and let the user sort it out. I've opted for a warning in this case. Tests added for both cases.

@jreback
Copy link
Contributor

jreback commented Sep 27, 2013

ok....that sounds good. coercing is quite difficult, I fixed a bug in concat that does that and its about a page of code!. The parser is not setup to do it (nor should it be), so that is fine.

The warning is probably fine. Though I don't believe other warnings are issued for really anything. Maybe they should even occur in other situations like this. If you think of something, pls open an issue.

thanks!

@guyrt
Copy link
Contributor Author

guyrt commented Sep 27, 2013

any chance that code in concat can be used for this?

@jreback
Copy link
Contributor

jreback commented Sep 27, 2013

maybe
look in tools/merge/concat_single_item

not sure it is worth it though

warning_message = " ".join(["Column %s has mixed types." % name,
"Specify dtype option on import or set low_memory=False."
])
print >> sys.stderr, warning_message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a parser warning type:

class DtypeAmbiguityWarning(Warning): pass

call like this

warnings.warn(warning_message, DtypeAmbiguityWarning)

you may have to import warnings at the top

@guyrt
Copy link
Contributor Author

guyrt commented Sep 27, 2013

Done. Should be ready to go pending tests.

integers = [str(i) for i in range(499999)]
data = "a\n" + "\n".join(integers + ["1.0", "2.0"] + integers)

with warnings.catch_warnings(record=True) as w:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_raises_warning here (same idea), but don't need the fail test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean... we don't want a warning to be raised here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_raises_warning just checks that you are in fact warning (with the correct warning).

I just realized that you passed low memory.....so need 2 tests then: your exisiting is fine, and then one w/o low_memory (so that the warning happens and that you check that it happens). this is just to catch the guarantee on the function (about the warning), in case someone changes in the future

@guyrt
Copy link
Contributor Author

guyrt commented Sep 28, 2013

I made two changes:
Fix extra print (oops)
Change tests to run on all parsers. Only C low memory parser should raise the warning.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

gr8.....waiting to merge a big change (unrealed to this)....

@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

@guyrt #4335?

@guyrt
Copy link
Contributor Author

guyrt commented Sep 29, 2013

I'll take a look at that one next.

I've got a data set I've been playing with (nhtsa vehicle safety reports... pretty interesting) that triggers the warning for 11 columns. That's 11 warnings. I'm going to try to squash it to one warning, so hold off on merging this request until I do.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

ok...sure...lmk

closes pandas-dev#3866

Silently fix problem rather than warning if we can coerce
to numerical type.
@guyrt
Copy link
Contributor Author

guyrt commented Sep 29, 2013

Tests pass. Ready for merge.

@jreback jreback merged commit c1836fa into pandas-dev:master Sep 29, 2013
@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

thank you sir!

@jreback
Copy link
Contributor

jreback commented Oct 4, 2013

@guyrt did you have a chance to look at #4335 ?

@guyrt
Copy link
Contributor Author

guyrt commented Oct 4, 2013

@jreback yes - I'll add comments there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Unexpected behaviour when reading large text files with mixed datatypes
2 participants