@@ -3287,6 +3287,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
3287
3287
df = pd.read_xml(xml, stylesheet = xsl)
3288
3288
df
3289
3289
3290
+ For very large XML files that can range in hundreds of megabytes to gigabytes, :func: `pandas.read_xml `
3291
+ supports parsing such sizeable files using `lxml's iterparse `_ and `etree's iterparse `_
3292
+ which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes.
3293
+ without holding entire tree in memory.
3294
+
3295
+ .. versionadded :: 1.5.0
3296
+
3297
+ .. _`lxml's iterparse` : https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
3298
+ .. _`etree's iterparse` : https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse
3299
+
3300
+ To use this feature, you must pass a physical XML file path into ``read_xml `` and use the ``iterparse `` argument.
3301
+ Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse `` should be
3302
+ a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of
3303
+ any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not
3304
+ used in this method, descendants do not need to share same relationship with one another. Below shows example
3305
+ of reading in Wikipedia's very large (12 GB+) latest article data dump.
3306
+
3307
+ .. code-block :: ipython
3308
+
3309
+ In [1]: df = pd.read_xml(
3310
+ ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
3311
+ ... iterparse = {"page": ["title", "ns", "id"]}
3312
+ ... )
3313
+ ... df
3314
+ Out[2]:
3315
+ title ns id
3316
+ 0 Gettysburg Address 0 21450
3317
+ 1 Main Page 0 42950
3318
+ 2 Declaration by United Nations 0 8435
3319
+ 3 Constitution of the United States of America 0 8435
3320
+ 4 Declaration of Independence (Israel) 0 17858
3321
+ ... ... ... ...
3322
+ 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3323
+ 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3324
+ 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3325
+ 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3326
+ 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
3327
+
3328
+ [3578765 rows x 3 columns]
3290
3329
3291
3330
.. _io.xml :
3292
3331
0 commit comments