-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Auto-detect field widths in read_fwf when unspecified #4488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I could take this, if that's OK? I just need a few pointers:
|
your point 2 is correct, here is a fix for that (and maybe you should just do this in all cases) don't start with the buffer/path itself, but wrap it in a StringIO like class that can handle this type of bufferring, kind of like the Reader class you created in that generator question on SO; except it also would have a reset method (prob a more consistent name that is compat with StringIO though), that seeks to the beginning and buffers what it already read first to you before resuming getting from the generator make sense? |
@jreback Yes. I would just avoid buffering the whole data, just the last Also, is adding |
is that for the 'checking' part? if so yes,....but I would think you only need even a few lines (e.g. no header, and use didn't use skiprows, or if they did you skipped them), and I think you have to assume that the format is the same, yes? it IS fixed width (but then I guess you have to 'validate' it)...maybe raise if you find discrepancies? |
Yes just for inferring the width of columns. There are a lot of things that should be taken into account, as you mentioned ( |
@alefnula I realize you are right, you prob do need more lines in order to 'figure out' where the proper terminations are....because the fields can be ragged within the 'unknown' widths |
Think I found a simpler solution based on the minimal starting point of the column: detect_fwf_widths. I just have to work out the edge cases and implement the buffering. |
@alefnula gr8....prob will need a number of tests for this....what API are you proposing for this? maybe |
@jreback I added a
I wanted to enable user to say either:
If this just complicates things I can use just a simple |
Also do you have an idea why did I get this error from Travis for every python interpreter except 2.7?!
The file is in the repository. |
You need to edit e.g.
|
Ok i think I found an optimal algorithm that uses a bitmask to detect the gaps between columns. detect_fwf_widths. Implemented a simple buffering (I think it couldn't be simpler, but if you want me to change it I'll do whatever you want) get_rows and next. And also the tests are here. Some of the data set's are pretty messy but it can detect the columns correctly: example. If you have more ideas about test files please post them. Also if the naming conventions are OK I'll change the documentation. If they are not, just tell me what to change and I'll do it. |
Just to check: would it still work if you had 1000 columns? 10K? or even |
@alefnula, can you open a PR to discuss? Easier to see diffs there. (and you can always change it up by pushing / editing commits. |
Here's a quick hack for generating the
widths
arg.It's basic, but it works for me.
any takers to flesh this out into a PR with tests?
The text was updated successfully, but these errors were encountered: