-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Added colspecs detection to read_fwf #4955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
self.f = f | ||
self.colspecs = colspecs | ||
self.filler = filler # Empty characters between fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happened to encoding? is thousands necessary? (I get filler can prob be removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were not used anywhere in this class so I removed them. They are used in the PythonParser
no the FixedWidthReader
. Also I renamed the filler
to delimiter to be consistent with other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok....encoding should stay in....its prob not tested....can you add something for that? (see the tests for read_csv)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ecoding
is used in the FixedWidthFieldParser
not FixedWidthReader
. All tests passed. I could return it if you want, but as i said it's not used anywhere...
@alefnula pls add an extry in release notes and v0.13.0 example (and put the same example in io.rst). you can do these at the end....just an FYI |
@jreback Yes, I was waiting for some feedback first. :) |
@alefnula apparently my comment got lost: what happens if you have many many columns (e.g., 10,000) or a first row that is very long (maybe 100,000 characters or something)? Will your implementation still work? |
@jtratner It will work. But it won't be very efficient if you have 10k columns. Characters lengths of the row are not such a big deal, I will allocate just one Maybe the |
@alefnula I think buffering is fine as well as what you are doing. The time you spend determining widths is a) much lower than human time doing it!, b) almost always less than actually reading the file.... |
what about doing an API like: |
@jreback As I said, whatever you like :) I would just use: Or maybe if either of the two is |
@alefnula how useful is to inspect certain rows do you think? (as opposed to just top 100) |
@jreback At first I thought it could be useful, but when I started writing the tests I found out that it's much more painful to determine the correct rows from which the But even then I'll inver correctly using this algorithm... When I added that parameter, I had another algorithm in mind, but it turned out that it wasn't correct. |
ok in that case I would then turn auto detect on infer is passed for either widths or column spec |
@jreback Deal. But that will wait have to wait until tomorrow, it's 2AM here :D |
@jreback I would maybe just change one more thing: enable just the Maybe even let it be the default if none of the above is specified? |
@alefnula that's fine..... |
@jreback Which part? Just the |
i like only accepting |
Everything done. If I missed something or have to add something more please tell me. |
@@ -789,6 +791,19 @@ column widths for contiguous columns: | |||
The parser will take care of extra white spaces around the columns | |||
so it's ok to have extra separation between the columns in the file. | |||
|
|||
If your data file has correctly separated columns using the delimiter provided | |||
to the ``read_fwf`` function - like in the case of the ``bar.csv`` file, where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put a similar blurb in v0.13.0.txt....to 'announce' this new feature
@@ -789,6 +791,21 @@ column widths for contiguous columns: | |||
The parser will take care of extra white spaces around the columns | |||
so it's ok to have extra separation between the columns in the file. | |||
|
|||
.. versionadded:: 0.13.0 | |||
|
|||
If your data file has correctly separated columns using the delimiter provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making this paragraph shorter.
By default,
read_fwfwill try to infer the file's colspecs by using the first 100 rows of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtratner I wanted to point out that the columns must be clearly (visually) separated. You cannot have:
12foo
23bar
34baz
And expect to get two columns using the sniffing. Also they must be a valid fwf, so this:
12 foo
12344 bar
334 baz
Even though it's clearly separated is not a valid fwf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, read_fwf will try to infer the file's colspecs by using the first 100 rows of the file, but only if the columns are delimited by whitespace and are aligned between rows.
@alefnula you need to add test cases for what happens if you can't or shouldn't be able to sniff the file (what happens, does it raise an error?) - there must exist files for which this doesn't work. Are there any pathological cases where this sniffing could get confused? Also, please make sure you add a test case where you dynamically build a very long string (i.e., a very 'wide' file) and then test to make sure it can be sniffed appropriately. |
also what happens with different delimiters (can the delimiter be a regular expression?), and unicode (UTF-8, non-UTF-8, variable-width - you can look at some of the other tests for cases where the width of the letters vary, maybe try for some chinese characters?). |
also multicharacter delimiters, if those are accepted. |
@jtratner OK so to recap:
|
Wide files - you don't need to add that, I just wanted to check that it worked. Somehow I thought that this was setup to accept regexes, but I think I was getting confused with csv's separator. To confirm, you're saying that |
@jtratner Yes, that's correct. If it raises an exception it's a bug in my code. |
okay, that all sounds fine. I do want to make sure you test with unicode - but you're working with string and not bytes anyways, so probably okay in Python 3. I don't have a great grasp on whether using something that's variable width unicode would mess up what you're doing. |
@jtratner I'll add tests for that. So just to be clear. I should do the: 1 - Shorten and fix docs |
@alefnula Happy to clarify.
Thanks! |
@jtratner All done. |
@alefnula just need a rebase :) |
@cpcloud I rebased and squashed it. It's just one commit. Or you mean something else? |
@alefnula Sorry. I wasn't being clear. There's a merge conflict with upstream; probably with |
@@ -1945,29 +1951,63 @@ class FixedWidthReader(object): | |||
""" | |||
A reader of fixed-width lines. | |||
""" | |||
def __init__(self, f, colspecs, filler, thousands=None, encoding=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw - were filler, thousands and encoding just doing nothing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thousands
and encoding
were not used, and i renamed the filler
to delimiter
for consistency with other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that encoding wasn't used? How does it know how to handle non-utf8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's doing it earlier - my bad :p. Anyways, as soon as you make that one test work for both 2 and 3 I'm good with merging this.
@cpcloud Rebased. |
Implemented an algorithm that uses a bitmask to detect the gaps between the columns. The reader buffers the lines used for detection in case the input stream is not seekable.
Rebased onto master (resolved merge conflicts). |
How clever. I like it. This is fine by me |
ENH: Added colspecs detection to read_fwf
@alefnula thanks for this! |
The new detection routine fails for cases that the #4488 snippet works fine for. For example:
This works with #4488 but fails with this code due to the last line. It's a common example of a Also, the |
closes #4488
Implemented an algorithm that uses a bitmask to detect the gaps between the columns.
Also the reader buffers the lines used for detection in case it's input is not seek-able.
Added tests.