-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_fwf colspecs autodetection could be more flexible #5056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So what should happen in this case? Are you thinking that detection should be entirely based on first line?
|
Also, as an aside, this is probably a clearer example (when it's at the end, seems like it's a special footer)
|
Not at all. #4488 handles the first case by looking for a pattern which is common to a majority of records, In your example, the first row is indeed the only hint as to the width of the columns, |
cc @alefnula I think that if detection is 'infer' and "doesn't work", then should raise. not sure if you can make an accurate determination in all cases though? |
cc @alefnula I think you can do a percentage computation on the mask that you are computing. If they are say the same in say 90% of the lines then you can go with that (or even higher per). I think that would allow for some (but not most) no-delimeter lines |
Yes, this is a problem. The current implementation cannot detect the columns properly if they are not separated by the filler. But the suggested implementation in issue #4488 didn't work for any case I tested :( For the example you just wrote here it returns
It returns But this is not the primary reason I haven't tried to fix it. The much bigger problem with the splitting approach is that it cannot handle the cases where columns are strings which can contain the filler, for example:
The approach I used supports these cases. So basically these are the reason why I chose to use the bitmask instead of splitting and counting. (and also it's more memory and time efficient - but this is not that important if we read only 100 lines of data). The only drawback is that it cannot handle the columns that are glued together, but that is clearly stated in the documentation. Maybe an approach where the outliers will be discarded can be implemented. But I'll have to find a way to do it efficiently... |
Ultimately, you can still pass colspecs directly in ambiguous cases. |
Also I don't think this is a bug. The behavior of inferring columns is clearly documented, and the function behaves as the documentation states it does... |
@alefnula can you squash the mas column wise and compare how many equal as per of total rows? (not sure if that is really useful.....) just a thought |
Not true. There may be a bug in in #4488 (or not) but the underlying approach should handle strings Documented behaviour is great, but ultimately it's preferable to support more cases rather |
Removed bug label. It's a common case that isn't handled, which is a shame but not a bug. |
@jreback I dont actually think that this problem can be solved in a way that will satisfy everyone :( since it's too ambiguous. The provided example:
Can be split in two different (both valid) ways:
And here is an even worse case.
If we try to implement any kind of heuristic that will determine where to split, there will be a case when we will get it wrong. The user will, maybe, get the correct number of columns but the data will be incorrect. So he will get a false conclusion (especially if he has a large data set that he cannot skim through) that we correctly split the columns, but the data will not be correct. |
@alefnula agreed....false positive is bad here. I was just wondering if you can detect the false positive and raise a helpful error message? |
@jreback In the current implementation you wont get false positives. Either the number column will be correct and you will get valid data or the number of columns wont be correct...
|
In all the examples you provided #4488 would do the right thing. |
@y-p The problem is that there is no right thing :)
Which of the following is right?
Also #4488 does not work for any case I tried :( Maybe you pasted the wrong version of the code there? |
None of them, the right one is:
Both #4488 and your private case of it take the rightmost "filler" column as the edge of each column. How about adding a |
Oh, now I see what you aim for :) Correct me if I'm wrong, but your point is that most of the
Maybe a |
No, it has nothing to do with alignment. Your code is a fine default, I'll add in a kwarg Correction: an aligned kwarg makes sense, but still won't handle the cases I'm concerned with, |
@y-p if you don't have the column headers, you can't know which one is |
I think you're right that the alignment should be consistent, but the |
The algo I had in mind doesn't isn't dependent on col headings in any way. pushed to 0.14, it's not a blocker. (or someday) |
Since there hasn't been much activity for this feature since and there appears to be significant complexity for determining the correct behavior I am going to close for now, but can reopen if there's more interest |
See #4955 (comment)
This works with the snippet in the orig #4488 but fails with the merged #4955 due to the last line.
It's a common example of a fwf file.
The text was updated successfully, but these errors were encountered: