Skip to content

BUG: iterparse on read_xml overwrites with attributes on child elements #47343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
bailsman opened this issue Jun 14, 2022 · 3 comments · Fixed by #47414
Closed
3 tasks done

BUG: iterparse on read_xml overwrites with attributes on child elements #47343

bailsman opened this issue Jun 14, 2022 · 3 comments · Fixed by #47414
Labels
Bug IO XML read_xml, to_xml

Comments

@bailsman
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from tempfile import NamedTemporaryFile
import pandas as pd

XML = '''
<issue>
  <type>BUG</type>
  <reporter type="newbie">Emanuel</reporter>
</issue>
'''.encode('utf-8')

with NamedTemporaryFile() as tempfile:
    tempfile.write(XML)
    tempfile.flush()
    df = pd.read_xml(tempfile.name, iterparse={'issue': ['type']})
    issue_type = df.iloc[0]['type']
    print(f"issue_type: expecting BUG, got {issue_type}")

Issue Description

The parsing code parses attributes on all child elements, rather than just on the main element.

This is probably unintentional - in the example above, the user is most likely interested in the type of the incident rather than the type attribute on some child element.

This feature is planned for 1.5.0 and not yet released. See: #45724 (comment)

Expected Behavior

Attributes are parsed only on the toplevel element.

Installed Versions

5465f54

@bailsman bailsman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2022
@phofl phofl added the IO XML read_xml, to_xml label Jun 14, 2022
@ParfaitG
Copy link
Contributor

Interesting side effect and possibly an edge case. Parser builds a dictionary using element or attribute names as keys. Since same named attribute is after the same named element, dictionary retains the last value. Reverse the reporter and type nodes and you will then return BUG.

Since iterparse does not know levels (i.e., parent, child) but indiscriminately walks down the tree parsing then discarding each element, behavior should align more with names argument which is supported in iterparse but after parsing not during like XPath. Per docs with emphasis:

Column names for DataFrame of parsed XML data. Use this parameter to rename original element names and distinguish same named elements.

So, one thought is the user should pass names to rename duplicate XML names.

df = pd.read_xml(tempfile.name, iterparse={'issue': ['type', 'type']}, names=["type_elem", "type_attr"])

df
  type_elem type_attr
0       BUG    newbie

However, given how iterparse iterates down the tree, passing same names causes it to repeatedly overwrite the dictionary value. Will give this some thought. For now, if XML really isn't that large and users are using iterparse in lieu of xpath, consider running XSLT via stylesheet argument to rename the same named items then iterparse the document.

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 16, 2022
@ParfaitG
Copy link
Contributor

I have an iterparse solution in the works to use names argument consistent with xpath approach where the logic checks for dictionary key and value before assignment. Solution passes all current unit tests for both lxml and etree parsers with additional new test inspired by your example. PR forthcoming.

    if row is not None:
        if self.names:
            for col, nm in zip(self.iterparse[row_node], self.names):
                if curr_elem == col:
                    elem_val = elem.text.strip() if elem.text else None
                    if elem_val not in row.values() and nm not in row:
                        row[nm] = elem_val
                if col in elem.attrib:
                    if elem.attrib[col] not in row.values() and nm not in row:
                        row[nm] = elem.attrib[col]
        else:
            for col in self.iterparse[row_node]:
                if curr_elem == col:
                    row[col] = elem.text.strip() if elem.text else None
                if col in elem.attrib:
                    row[col] = elem.attrib[col]

@bailsman
Copy link
Author

You have to be very careful to put your names in the correct order (maybe someone will say 'Well... obviously!') - but once you do that it works perfectly! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO XML read_xml, to_xml
Projects
None yet
4 participants