-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jankaWIS Thanks for the bug report! I think this is due to #34464, which dropped support for *kwds that were passed to TextFileReader(skip_blank_lines being one of them). Labeling as regression for now. The fix for this would actually be very simple if the core devs wished to add this back(Adding *kwds here pandas/pandas/io/excel/_base.py Line 373 in fc9fdba
pandas/pandas/io/excel/_base.py Line 336 in fc9fdba
|
We don’t need to add back in the keyword arguments - can just add this parameter to read_excel and document |
If anyone wants to take this up, the fix would be to add skip_blank_lines to the read_excel signature and to the io.parse call. Do note that you'll also need to add back the docstring(see variable |
take |
Will this require a new test, or is it enough that the underlying TextParser already has tests for skip_blank_lines? |
@ahawryluk Thanks for taking this up. I think you need to add a test specific for excel at |
I've figured out how to pass Should we proceed as planned, restoring the <0.24 behaviour, but clarify the docstring? (e.g. "skips blank rows if the worksheet has a single column") I can think of other options, but they're not backwards compatible:
|
I'd vote for 2. but I'm not really in charge :). |
The more I think about it, I also lean towards option 2. The current behaviour (read_excel drops empty cells from single-column spreadsheets) seems more like a bug than a feature, and most of the SO posts that @jankaWIS provided are looking for a way to turn it off. @WillAyd what do you think of this option? It's not really a deprecation, but would it require a FutureWarning? |
Is the nuance you are seeing with the single column specific to excel or is it seen in the CSV parser as well?
…Sent from my iPhone
On Feb 24, 2021, at 11:20 AM, Andrew Hawyrluk ***@***.***> wrote:
The more I think about it, I also lean towards option 2. The current behaviour (read_excel drops empty cells from single-column spreadsheets) seems more like a bug than a feature, and most of the SO posts that @jankaWIS provided are looking for a way to turn it off.
@WillAyd what do you think of this option? It's not really a deprecation, but would it require a FutureWarning?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
If the CSV is generated by pandas (or any other library that depends on the standard library So, a single-column of data with empty elements can be saved to CSV with index=False and reloaded intact. Saving the same data to excel with index=False and reloading will drop all the empty rows. Python 3.5 had a bug in |
Hmm sorry but I think I'm missing the big picture as to how this all relates. So ideally since we have a default of |
The pull request resolves this issue and the new tests pass, so it's definitely an improvement. Someone might read the new docstring and think, "that's an odd option," but at least they can turn it off now. |
#40214 fixed the buggy behavior of read_excel sometimes dropping blank lines and sometimes not. I'm -1 on exposing the |
Was there a work around in 1.1 or 1.2 to get skip_blank_lines=False functionality? |
In both 1.1 and 1.2, I believe blank lines will not be skipped unless a single column is being read. When a single column is read, there is no way currently to not skip blank lines. Is that what you're experiencing @jmathew? |
Yeah it has only one column. |
Thanks @jmathew. I haven't been able to find any workaround for 1.1/1.2, it seems blank lines are removed before any of the other processing in done by the parser. The only viable option I see to not skipping blank lines in 1.1./1.2 would be to backport #40214. Since the buggy behavior is not a regression (although this fact is complicated by having the possible workaround for this buggy behavior removed in 1.1), I do not think the backport should be done in this case. This opinion is also partially based on the thought (hope?) that the number of users impacted is small, and the severity of the impact on those users is also small. If you have any thoughts on this - please do share. |
Thanks for looking @rhshadrach I only need a solution to tide me over until 1.3 is released. So Im fine with the decision not to backport. Off the top of my head, the fix seems really simple. Is monkey patching the fix from my code at runtime an option? I don't know how to do it in python but it seems like it would be a good fit for this scenario. It is blocking for my use case as I need to know the original cell location for each value. However, worst case is I reopen the spreadsheet in a different reader and translate the data frame location to the original for this one narrow situation. |
We ended up making a build of 1.1 with #40214 applied. That seemed to work well. We also explored patching it in at runtime. Works on my machine, but hasn't been tested significantly: import pandas as pd
import ast
import inspect
read_excel_ast = ast.parse(inspect.getsource(pd.read_excel))
# Clear the decorators because they cause errors when exec.
read_excel_ast.body[0].decorator_list.clear()
# Modify the return call to include the skip_blank_lines arg
return_args = read_excel_ast.body[0].body[1].value.keywords
new_arg = ast.keyword('skip_blank_lines',ast.Constant(False))
new_arg.value.col_offset = 27
new_arg.value.lineno = return_args[-1].value.lineno + 1
return_args.append(new_arg)
exec(compile(read_excel_ast, '<string>', 'exec'), pd.__dict__) |
Thanks @jmathew - glad to hear that worked. I'll guess you already considered this, but adding an additional column (e.g. "dropme") to the excel files is another potential workaround, but that is not viable in many situations. |
@rhshadrach Did you have any other thoughts on this issue before it can be closed? #40214 (comment) |
Thanks for the ping @ahawryluk - with 1.3 around the corner I agree this can be closed. |
Is your feature request related to a problem?
Reading Stack Overflow, there has been several posts trying to skip blank lines when reading (post 1, post 2, post 3, ...). The proposed solution works for pandas v<0.24 but there is no mention about this feature and trying to use it I get:
In pandas 1.1.4 and 1.2:
TypeError: read_excel() got an unexpected keyword argument 'skip_blank_lines'
and no mention in the documentation posted above.Describe the solution you'd like
Pandas
read_excel
and/orExcelFile
should get back this argument and functionalityThe text was updated successfully, but these errors were encountered: