Skip to content

BUG: Avoid flaky usecols set in C engine #14984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 25, 2016

Explanation of the bug can be found here.

Closes #14792.

@codecov-io
Copy link

codecov-io commented Dec 25, 2016

Current coverage is 84.78% (diff: 100%)

Merging #14984 into master will increase coverage by <.01%

@@             master     #14984   diff @@
==========================================
  Files           145        145          
  Lines         51090      51094     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43315      43319     +4   
  Misses         7775       7775          
  Partials          0          0          

Powered by Codecov. Last update 7f0eefc...82cf55b

@jreback jreback added Bug IO CSV read_csv, to_csv labels Dec 26, 2016
@jreback jreback added this to the 0.20.0 milestone Dec 26, 2016
usecols_dtype = lib.infer_dtype(usecols)
if usecols_dtype not in ('empty', 'integer',
'string', 'unicode'):
raise ValueError(msg)

return set(usecols)
return usecols
return set(usecols), usecols_dtype
Copy link
Contributor

@jreback jreback Dec 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this still have a set conversion here? is this for a different case? if so, then should happen in your validation below (to make it more consistent), IOW, this always returns tuple of (list of usecols, dtype of usecols). Please make the doc-string indicate that as well (t he doc-string for this function)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully follow your comment here. The reason I return set is for the Python engine. I can clarify the doc-string to return set (or None) along the dtype of usecols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear to have _validate_use_cols always return (list, dtype) (for all engines). It makes it very clear what it does. Then have it convert to a set elsewhere (as you do in _set_noconvert_columns. otherwise you have 2 places which do this conversion (from list -> set), depending on the engine. again confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's always set for both engines. This function returns either (independent of engine):

  • set
  • None
  • callable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : I updated the documentation to illustrate the return type is independent of engine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still, why are we returning a set here? why not a list? (for this same case)? then you don't have to rely on the ordering and convert to a list if its integer type?

could just do this: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order

then make it into a set at the last moment.

I am nitpicking on this because converting this to a set is happening too early, and it is just confusing.

Copy link
Member Author

@gfyoung gfyoung Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early conversion to set has been used for quite some time, especially with the C engine. Also, a list is not necessarily going to help with ordering I might add:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b,c\n1,2,3'
>>> read_csv(StringIO(data), usecols=[2, 1])
   b  c
0  2  3

When I say "order," it is not about maintaining the order in which usecols is passed in but about ordering usecols correctly so that the names or indices are in the same order as they are listed in the data columns. That is why converting at the last moment does not present any benefit AFAICT.

@gfyoung gfyoung force-pushed the usecols-parse-dates-list branch from 30bae02 to 6ac5814 Compare December 27, 2016 01:28
if self.usecols_dtype == 'integer':
# A set of integers will be converted to a list in
# the correct order every single time.
usecols = list(self.usecols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is very confusing here. You are now converting to a list. So usecols ends up as a list, None or a callable? when / how does it end up as a set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a short docstring for this _set_noconvert_columns function?

Copy link
Member Author

@gfyoung gfyoung Dec 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : During __init__, both functions call _validate_usecols_arg when setting self.usecols, which returns either a callable, set, or None.

@jorisvandenbossche : Added docstring.

@gfyoung gfyoung force-pushed the usecols-parse-dates-list branch from 6ac5814 to 82cf55b Compare December 27, 2016 19:16
@gfyoung
Copy link
Member Author

gfyoung commented Dec 30, 2016

@jreback , @jorisvandenbossche : Addressed comments, and everything is green.

@jreback jreback closed this in a42a015 Dec 30, 2016
@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

ok thanks @gfyoung

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants