-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add large file support for read_xml #45724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these tests seem to use the same data source / structure as the tests in test_xml.py
can they be shared.
Also could you reduce the amount of url reading involved in the testing (i.e. save the source as a file and read directly)? I've been trying to reduce the amount of CI testing flakiness due to this.
"for value in iterparse" | ||
) | ||
|
||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twoertwein, your thoughts on this handling? Since iterparse
can potentially read very large XML files (1 GB, 5 GB, 10 GB+), this checks for strings, online docs, or compressed docs. The idea is to avoid get_handle
downloading or extracting such large content in memory and raise MemoryError
or OSError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, some people have lots of memory and would be fine downloading a 10GB file. Might be easier to have such a warning as part of the doc-string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterparse is memory efficient for large files. Only physical, fully extracted files on hard disk is being allowed in this method. This raises ParserError
for online or compressed sources to avoid in-memory downloading or decompression by get_handle
. Users may have a 10GB XML file on local disk to be iterparsed here. Do these if
conditions exhaust all non-file system possibilities? Can get_handle
or a related method check if path is a local file and return the path without reading content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use pandas.io.common.is_url(...) or pandas.io.common.is_fsspec_url(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are already using this :) and some more conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also add a local_only
keyword to get_handle
but that might make get_handle even more complex.
@mroeschke, I combined tests with tests in original and in dtypes. For this new feature, one URL is tested to raise an error so may never reach the endpoint. I removed any new S3 calls. Given the large size potential, this feature requires only uncompressed file system paths. |
@ParfaitG if you can merge master |
|
||
|
||
def test_parse_dates_true(parser): | ||
df_result = read_xml(xml_dates, parse_dates=True, parser=parser) | ||
with tm.ensure_clean() as path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may want to create a helper function to tests iterparse vs full read in a more concise way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea and now implemented.
@ParfaitG can you merge master |
lgtm, @mroeschke if any comments. |
@@ -3287,6 +3287,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``: | |||
df = pd.read_xml(xml, stylesheet=xsl) | |||
df | |||
|
|||
For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this entire xml section in the user guide linked to the read_xml
docstring? If not it would be good to link them under the Notes
of a docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small question otherwise LGTM
Awesome, thanks for sticking with it |
Awesome feature, already started using it! What is the purpose of this bit at Line 677 in fa7e31b
This started giving me "TypeError: 'NoneType' object does not support item deletion" on my XML files. The equivalent ElementTree parser does not have this (only lxml). After removing it, everything is working well, at least on my xml file. As to attribute parsing, shouldn't we parse only on row_node? Otherwise, another element entirely that's a child of row_node could have the attribute but it may not be what we're looking for. Maybe this? (untested) diff --git a/pandas/io/xml.py b/pandas/io/xml.py
index 181b0fe..8740e8c 100644
--- a/pandas/io/xml.py
+++ b/pandas/io/xml.py
@@ -411,13 +411,14 @@ class _EtreeFrameParser(_XMLFrameParser):
if event == "start":
if curr_elem == row_node:
row = {}
+ for col in self.iterparse[row_node]:
+ if col in elem.attrib:
+ row[col] = elem.attrib[col]
if row is not None:
for col in self.iterparse[row_node]:
if curr_elem == col:
row[col] = elem.text.strip() if elem.text else None
- if col in elem.attrib:
- row[col] = elem.attrib[col]
if event == "end":
if curr_elem == row_node and row is not None:
@@ -659,22 +660,20 @@ class _LxmlFrameParser(_XMLFrameParser):
if event == "start":
if curr_elem == row_node:
row = {}
+ for col in self.iterparse[row_node]:
+ if col in elem.attrib:
+ row[col] = elem.attrib[col]
if row is not None:
for col in self.iterparse[row_node]:
if curr_elem == col:
row[col] = elem.text.strip() if elem.text else None
- if col in elem.attrib:
- row[col] = elem.attrib[col] |
Thank you for your comment, @bailsman. Given this PR is closed, can you raise a potential BUG issue of this new feature with a reproducible example? And possibly ask a QST issue for your attribute parsing question or ask on StackOverflow. Specifically, we need a minimum XML sample and attempted code with desired result. |
Before I go off and try to debug this so that I can make a reproducible example, can I just ask you what this line does and what the purpose of this line was intended to be? Line 677 in fa7e31b
It's lxml specific (the elementtree equivalent doesn't have it), and, at least, in my case it seems to work fine without it and throws errors if I keep it. On initial investigation, the problem doesn't occur on small files that I've tested, so it would take some additional debugging effort on my part to figure out how to reproduce it, which I'd rather invest with fuller understanding to save time. |
Correct. That is an lxml-specific method where iterparse docs indicate that line intends to clean up preceding siblings with stated goal:
|
* ENH: Add large file support for read_xml * Combine tests, slightly fix docs * Adjust pytest decorator on URL test; fix doc strings * Adjust tests for helper function * Add iterparse feature to some tests * Add IO docs link in docstring
doc/source/whatsnew/v1.5.0.rst
file if fixing a bug or adding a new feature.