-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: pd.read_html argument to extract hrefs along with text from cells #45973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please fix the pre-commit checks?
847d930
to
d69ce74
Compare
Ah I missed that error - sorry about that. I've fixed it now. |
50aab0d
to
70cf3fa
Compare
70cf3fa
to
a13c5f0
Compare
This looks fine to me, albeit there is some agreement on the name of the new kwarg. Not convinced by |
I felt it was more explicit than |
pandas/io/html.py
Outdated
@@ -585,6 +624,10 @@ def _parse_tables(self, doc, match, attrs): | |||
raise ValueError(f"No tables found matching pattern {repr(match.pattern)}") | |||
return result | |||
|
|||
def _href_getter(self, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type the args and returns of all of the added code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've typed the returns, but won't lxml/bs4 be required to type the args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@attack68 What shall I do about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Sorry to bother you, but I haven't been able to come up with a solution for this. Could you please suggest how I should do it?
To elaborate a bit on my first comment - the requirements may not be installed, and in that case the typing using the custom types defined in the libraries would fail (as far as I understand), so that doesn't seem like a viable solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke Would you be able to enlighten me regarding this request? I'm still at a loss as to how to approach it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the top of the file you can do:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from bf4/lxml import ...
Then type obj
. The CI checks have all the optional dependencies installed so these checks should be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, except I think you need to test all the combinations "header" "footer" etc.
It would be good if you can put it all into one test and use. @pytest.mark.parametrize
Also have you rendered the docs. I think your version added was in the wrong place so it would be good to post a rendering to show the docs are working correctly.
@attack68 Could you help with this request from jreback? I'm not sure how to go about testing it.
|
f418975
to
490005a
Compare
Rebased and tests are passing. I've kept the extract_links index structure custom (returns a flat tuple index rather than a MultiIndex) for now, but I'm open to reverting that. I'm not sure what is causing the Docstring action to fail - the line is totally different to what is shown in the error. |
this is failing checks.eg pandas/io/html.py line 1024. you cab review the logs for explanations. |
https://github.com/abmyii/pandas/blob/read_html-extract-hrefs/pandas/io/html.py#L1024 is a blank line and 1025 is not a docstring? Is https://github.com/abmyii/pandas/blob/read_html-extract-hrefs/pandas/io/html.py#L1141 the issue - and if so, should it be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc edits
c6b7ea1
to
4c7f532
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fairly good. Just one small comment and a merge conflict
Thanks for sticking with this @abmyii! |
pandas-dev#45973) * ENH: pd.read_html argument to extract hrefs along with text from cells * Fix typing error * Simplify tests * Fix still incorrect typing * Summarise whatsnew entry and move detailed explanation into user guide * More flexible link extraction * Suggested changes * extract_hrefs -> extract_links * Move versionadded to correct place and improve docstring for extract_links (@attack68) * Test for invalid extract_links value * Test all extract_link options * Fix for MultiIndex headers (also fixes tests) * Test that text surrounding <a> tag is still captured * Test for multiple <a> tags in cell * Fix all tests, with both MultiIndex -> Index and np.nan -> None conversions resolved * Add back EOF newline to test_html.py * Correct user guide example * Update pandas/io/html.py * Update pandas/io/html.py * Update pandas/io/html.py * Simplify MultiIndex -> Index conversion * Move unnecessary fixtures into test body * Simplify statement * Fix code checks Co-authored-by: JHM Darbyshire <[email protected]>
pandas-dev#45973) * ENH: pd.read_html argument to extract hrefs along with text from cells * Fix typing error * Simplify tests * Fix still incorrect typing * Summarise whatsnew entry and move detailed explanation into user guide * More flexible link extraction * Suggested changes * extract_hrefs -> extract_links * Move versionadded to correct place and improve docstring for extract_links (@attack68) * Test for invalid extract_links value * Test all extract_link options * Fix for MultiIndex headers (also fixes tests) * Test that text surrounding <a> tag is still captured * Test for multiple <a> tags in cell * Fix all tests, with both MultiIndex -> Index and np.nan -> None conversions resolved * Add back EOF newline to test_html.py * Correct user guide example * Update pandas/io/html.py * Update pandas/io/html.py * Update pandas/io/html.py * Simplify MultiIndex -> Index conversion * Move unnecessary fixtures into test body * Simplify statement * Fix code checks Co-authored-by: JHM Darbyshire <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.