BUG: read_xml not support large file #45442

wangrenz · 2022-01-18T07:10:17Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = pd.read_xml('202201140700.xml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 927, in read_xml
    return _parse(
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 728, in _parse
    data_dicts = p.parse_data()
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 391, in parse_data
    self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 553, in _parse_doc
    doc = fromstring(
  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1784, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 61504
lxml.etree.XMLSyntaxError: internal error: Huge input lookup, line 61504, column 702

Issue Description

lxml.etree.XMLSyntaxError: internal error: Huge input lookup, line 61504, column 702

Expected Behavior

read_xml nead add huge_tree=True in pandas/io/xml.py

for example:

Old parse_data:

self.xml_doc = XML(self._parse_doc(self.path_or_buffer))

New:

curr_parser = XMLParser(encoding=self.encoding,huge_tree=True)
self.xml_doc = XML(self._parse_doc(self.path_or_buffer), parser=curr_parser)

Old _parse_doc:

curr_parser = XMLParser(encoding=self.encoding)

New:

curr_parser = XMLParser(encoding=self.encoding,huge_tree=True)

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.13.1.el7.x86_64
Version : #1 SMP Tue Jun 23 15:46:38 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.19.0
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

ParfaitG · 2022-01-23T14:25:32Z

Thanks @wangrenz! This is a great use case. Large XML file support was in the works for both lxml and etree parsers (see #40131 ) We may not need an additional argument but have read_xml catch this exception and attempt the lxml workaround.

How large was your XML file? Can you post a reproducible example of its content (redact as needed)?

ParfaitG · 2022-01-25T05:00:51Z

Using the Wikipedia latest page article dump with a bzip2 at 2.7 GB which decompresses to 12.4 GB, I am unable to raise your lxml error. In fact, my Ubuntu laptop of 8 GM RAM raised a killed on a read_xml attempt.

However, iterparse worked great for both lxml and etree which can be an approach to use for large XML files in read_xml since the entire tree is not read at once and you can read all elements one by one and even delete after use to avoid growing the enormous tree. See below implementation.

The idea is users pass in an iterparse_items dict parameter where the key will be the repeating element in document and value will be the list of any descendant or attribute located anywhere under the repeating element. Using this argument will be in lieu of default xpath parsing and works for users who need tags or attributes in heavily nested XML documents without relation to each other but as descendants to repeating element.

import pandas as pd
import xml.etree.ElementTree as ET
#import lxml.etree as ET

xml_file = "~/Downloads/enwikisource-latest-pages-articles.xml"
iterparse_items = {"page": ["title", "ns", "id"]}

data =  []
node = next(iter(iterparse_items))

for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
    curr_elem = (
        elem.tag.split('}')[1] if '}' in elem.tag else elem.tag
    )

    if event == 'start':
        if curr_elem == node:
            row = {}

        for col in iterparse_items[node]:
            if curr_elem == col:
                row[col] = (
                    elem.text.strip() 
                    if elem.text is not None 
                    else elem.text
                )
            if col in elem.attrib:
                row[col] = elem.attrib[col].strip()

    if event == 'end':
        if curr_elem == node:
            data.append(row)

    elem.clear()

df = pd.DataFrame(data)
print(df)
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]

wangrenz · 2022-02-16T17:24:44Z

Thank you @ParfaitG , My xml file as follow:

https://raw.githubusercontent.com/wangrenz/temp_repo/master/202201150000.xml.bz2

The xml file size is about 60MB.

ParfaitG · 2022-02-19T01:46:17Z

What are you attempting to extract from this XML? What is its structure? Without passing a specific xpath, read_xml defaults to only the children of the root, /*. Recall XML is an open-ended design format and can be flat to very dense.

For pandas data frames, flat, shallow XML with repeated nodes is expected for its two-dimensional structure of rows by columns. See Notes section of pandas.read_xml. If needed, you can use XSLT via lxml to flatten.

mroeschke · 2022-03-18T23:18:56Z

Closed by #45724

wangrenz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2022

ParfaitG added the IO XML read_xml, to_xml label Jan 23, 2022

lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jan 23, 2022

ParfaitG mentioned this issue Jan 30, 2022

ENH: Add large file support for read_xml #45724

Merged

4 tasks

jreback added this to the 1.5 milestone Mar 18, 2022

mroeschke closed this as completed Mar 18, 2022

tcompa mentioned this issue Apr 11, 2023

Memory usage for pandas.read_xml fractal-analytics-platform/fractal-tasks-core#362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_xml not support large file #45442

BUG: read_xml not support large file #45442

wangrenz commented Jan 18, 2022

ParfaitG commented Jan 23, 2022 •

edited

Loading

ParfaitG commented Jan 25, 2022

wangrenz commented Feb 16, 2022

ParfaitG commented Feb 19, 2022

mroeschke commented Mar 18, 2022

BUG: read_xml not support large file #45442

BUG: read_xml not support large file #45442

Comments

wangrenz commented Jan 18, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

ParfaitG commented Jan 23, 2022 • edited Loading

ParfaitG commented Jan 25, 2022

wangrenz commented Feb 16, 2022

ParfaitG commented Feb 19, 2022

mroeschke commented Mar 18, 2022

ParfaitG commented Jan 23, 2022 •

edited

Loading