-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_docx? #22518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jnothman : Do you mind providing an example |
I’m a little hesitant here because I’m not aware of any good Python libraries to read and write Word files (@jnothman feel free to correct me) so I’m not sure how generalizable the reading of data in these types of files can be |
I've not looked into these, but I'm not sure if it's relevant as long as we are:
I've converted the I can see in the |
Closing and adding to a tracker issue #30407 for IO format requests, can re-open if interest is expressed. |
Problem description
I sometimes need to extract tables from docx files, rather than from HTML. Given that docx XML is very HTML-like when it comes to tables, it seems appropriate to reuse Pandas' loading facilities, ideally without first converging the whole docx to html.
Here is a hacky solution, which simply:
tbl
totable
andtc
totd
. (I've not looked into how Word marks things corresponding toth
,thead
andtfoot
.)Working implementation invoked by
pd.read_html('file://path/to/my.docx', flavor='docx')
Let me know what interest there is, or feel free to use this code in an implementation.
The text was updated successfully, but these errors were encountered: