Skip to content

Commit fa7e31b

Browse files
authored
ENH: Add large file support for read_xml (#45724)
* ENH: Add large file support for read_xml * Combine tests, slightly fix docs * Adjust pytest decorator on URL test; fix doc strings * Adjust tests for helper function * Add iterparse feature to some tests * Add IO docs link in docstring
1 parent ec3eedd commit fa7e31b

File tree

5 files changed

+721
-50
lines changed

5 files changed

+721
-50
lines changed

doc/source/user_guide/io.rst

+39
Original file line numberDiff line numberDiff line change
@@ -3287,6 +3287,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
32873287
df = pd.read_xml(xml, stylesheet=xsl)
32883288
df
32893289
3290+
For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
3291+
supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
3292+
which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes.
3293+
without holding entire tree in memory.
3294+
3295+
.. versionadded:: 1.5.0
3296+
3297+
.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
3298+
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
3299+
3300+
To use this feature, you must pass a physical XML file path into ``read_xml`` and use the ``iterparse`` argument.
3301+
Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse`` should be
3302+
a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of
3303+
any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not
3304+
used in this method, descendants do not need to share same relationship with one another. Below shows example
3305+
of reading in Wikipedia's very large (12 GB+) latest article data dump.
3306+
3307+
.. code-block:: ipython
3308+
3309+
In [1]: df = pd.read_xml(
3310+
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
3311+
... iterparse = {"page": ["title", "ns", "id"]}
3312+
... )
3313+
... df
3314+
Out[2]:
3315+
title ns id
3316+
0 Gettysburg Address 0 21450
3317+
1 Main Page 0 42950
3318+
2 Declaration by United Nations 0 8435
3319+
3 Constitution of the United States of America 0 8435
3320+
4 Declaration of Independence (Israel) 0 17858
3321+
... ... ... ...
3322+
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3323+
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3324+
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3325+
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3326+
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
3327+
3328+
[3578765 rows x 3 columns]
32903329
32913330
.. _io.xml:
32923331

doc/source/whatsnew/v1.5.0.rst

+37
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,43 @@ apply converter methods, and parse dates (:issue:`43567`).
162162
df
163163
df.dtypes
164164
165+
.. _whatsnew_150.read_xml_iterparse:
166+
167+
read_xml now supports large XML using ``iterparse``
168+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169+
170+
For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
171+
now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
172+
which are memory-efficient methods to iterate through XML trees and extract specific elements
173+
and attributes without holding entire tree in memory (:issue:`#45442`).
174+
175+
.. code-block:: ipython
176+
177+
In [1]: df = pd.read_xml(
178+
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
179+
... iterparse = {"page": ["title", "ns", "id"]})
180+
... )
181+
df
182+
Out[2]:
183+
title ns id
184+
0 Gettysburg Address 0 21450
185+
1 Main Page 0 42950
186+
2 Declaration by United Nations 0 8435
187+
3 Constitution of the United States of America 0 8435
188+
4 Declaration of Independence (Israel) 0 17858
189+
... ... ... ...
190+
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
191+
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
192+
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
193+
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
194+
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
195+
196+
[3578765 rows x 3 columns]
197+
198+
199+
.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
200+
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
201+
165202
.. _whatsnew_150.api_breaking.api_breaking2:
166203

167204
api_breaking_change2

0 commit comments

Comments
 (0)