Skip to content

BUG: read_xml not support large file #45442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
wangrenz opened this issue Jan 18, 2022 · 5 comments
Closed
2 of 3 tasks

BUG: read_xml not support large file #45442

wangrenz opened this issue Jan 18, 2022 · 5 comments
Labels
Bug IO XML read_xml, to_xml
Milestone

Comments

@wangrenz
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df = pd.read_xml('202201140700.xml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 927, in read_xml
    return _parse(
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 728, in _parse
    data_dicts = p.parse_data()
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 391, in parse_data
    self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
  File "/home/miniconda3/envs/metenv/lib/python3.8/site-packages/pandas/io/xml.py", line 553, in _parse_doc
    doc = fromstring(
  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1784, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 61504
lxml.etree.XMLSyntaxError: internal error: Huge input lookup, line 61504, column 702

Issue Description

lxml.etree.XMLSyntaxError: internal error: Huge input lookup, line 61504, column 702

Expected Behavior

read_xml nead add huge_tree=True in pandas/io/xml.py

for example:

Old parse_data:

self.xml_doc = XML(self._parse_doc(self.path_or_buffer))

New:

curr_parser = XMLParser(encoding=self.encoding,huge_tree=True)
self.xml_doc = XML(self._parse_doc(self.path_or_buffer), parser=curr_parser)

Old _parse_doc:

curr_parser = XMLParser(encoding=self.encoding)

New:

curr_parser = XMLParser(encoding=self.encoding,huge_tree=True)

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.13.1.el7.x86_64
Version : #1 SMP Tue Jun 23 15:46:38 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.19.0
xlrd : None
xlwt : None
numba : None

@wangrenz wangrenz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 18, 2022
@ParfaitG ParfaitG added the IO XML read_xml, to_xml label Jan 23, 2022
@ParfaitG
Copy link
Contributor

ParfaitG commented Jan 23, 2022

Thanks @wangrenz! This is a great use case. Large XML file support was in the works for both lxml and etree parsers (see #40131 ) We may not need an additional argument but have read_xml catch this exception and attempt the lxml workaround.

How large was your XML file? Can you post a reproducible example of its content (redact as needed)?

@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jan 23, 2022
@ParfaitG
Copy link
Contributor

Using the Wikipedia latest page article dump with a bzip2 at 2.7 GB which decompresses to 12.4 GB, I am unable to raise your lxml error. In fact, my Ubuntu laptop of 8 GM RAM raised a killed on a read_xml attempt.

However, iterparse worked great for both lxml and etree which can be an approach to use for large XML files in read_xml since the entire tree is not read at once and you can read all elements one by one and even delete after use to avoid growing the enormous tree. See below implementation.

The idea is users pass in an iterparse_items dict parameter where the key will be the repeating element in document and value will be the list of any descendant or attribute located anywhere under the repeating element. Using this argument will be in lieu of default xpath parsing and works for users who need tags or attributes in heavily nested XML documents without relation to each other but as descendants to repeating element.

import pandas as pd
import xml.etree.ElementTree as ET
#import lxml.etree as ET

xml_file = "~/Downloads/enwikisource-latest-pages-articles.xml"
iterparse_items = {"page": ["title", "ns", "id"]}

data =  []
node = next(iter(iterparse_items))

for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
    curr_elem = (
        elem.tag.split('}')[1] if '}' in elem.tag else elem.tag
    )

    if event == 'start':
        if curr_elem == node:
            row = {}

        for col in iterparse_items[node]:
            if curr_elem == col:
                row[col] = (
                    elem.text.strip() 
                    if elem.text is not None 
                    else elem.text
                )
            if col in elem.attrib:
                row[col] = elem.attrib[col].strip()

    if event == 'end':
        if curr_elem == node:
            data.append(row)

    elem.clear()

df = pd.DataFrame(data)
print(df)
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]

@wangrenz
Copy link
Author

Thank you @ParfaitG , My xml file as follow:

https://raw.githubusercontent.com/wangrenz/temp_repo/master/202201150000.xml.bz2

The xml file size is about 60MB.

@ParfaitG
Copy link
Contributor

What are you attempting to extract from this XML? What is its structure? Without passing a specific xpath, read_xml defaults to only the children of the root, /*. Recall XML is an open-ended design format and can be flat to very dense.

For pandas data frames, flat, shallow XML with repeated nodes is expected for its two-dimensional structure of rows by columns. See Notes section of pandas.read_xml. If needed, you can use XSLT via lxml to flatten.

@jreback jreback added this to the 1.5 milestone Mar 18, 2022
@mroeschke
Copy link
Member

Closed by #45724

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO XML read_xml, to_xml
Projects
None yet
Development

No branches or pull requests

5 participants