BUG: iterparse on read_xml overwrites with attributes on child elements #47343

bailsman · 2022-06-14T12:17:24Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from tempfile import NamedTemporaryFile
import pandas as pd

XML = '''
<issue>
  <type>BUG</type>
  <reporter type="newbie">Emanuel</reporter>
</issue>
'''.encode('utf-8')

with NamedTemporaryFile() as tempfile:
    tempfile.write(XML)
    tempfile.flush()
    df = pd.read_xml(tempfile.name, iterparse={'issue': ['type']})
    issue_type = df.iloc[0]['type']
    print(f"issue_type: expecting BUG, got {issue_type}")

Issue Description

The parsing code parses attributes on all child elements, rather than just on the main element.

This is probably unintentional - in the example above, the user is most likely interested in the type of the incident rather than the type attribute on some child element.

This feature is planned for 1.5.0 and not yet released. See: #45724 (comment)

Expected Behavior

Attributes are parsed only on the toplevel element.

Installed Versions

5465f54

The text was updated successfully, but these errors were encountered:

ParfaitG · 2022-06-15T02:45:24Z

Interesting side effect and possibly an edge case. Parser builds a dictionary using element or attribute names as keys. Since same named attribute is after the same named element, dictionary retains the last value. Reverse the reporter and type nodes and you will then return BUG.

Since iterparse does not know levels (i.e., parent, child) but indiscriminately walks down the tree parsing then discarding each element, behavior should align more with names argument which is supported in iterparse but after parsing not during like XPath. Per docs with emphasis:

Column names for DataFrame of parsed XML data. Use this parameter to rename original element names and distinguish same named elements.

So, one thought is the user should pass names to rename duplicate XML names.

df = pd.read_xml(tempfile.name, iterparse={'issue': ['type', 'type']}, names=["type_elem", "type_attr"])

df
  type_elem type_attr
0       BUG    newbie

However, given how iterparse iterates down the tree, passing same names causes it to repeatedly overwrite the dictionary value. Will give this some thought. For now, if XML really isn't that large and users are using iterparse in lieu of xpath, consider running XSLT via stylesheet argument to rename the same named items then iterparse the document.

ParfaitG · 2022-06-17T02:01:32Z

I have an iterparse solution in the works to use names argument consistent with xpath approach where the logic checks for dictionary key and value before assignment. Solution passes all current unit tests for both lxml and etree parsers with additional new test inspired by your example. PR forthcoming.

    if row is not None:
        if self.names:
            for col, nm in zip(self.iterparse[row_node], self.names):
                if curr_elem == col:
                    elem_val = elem.text.strip() if elem.text else None
                    if elem_val not in row.values() and nm not in row:
                        row[nm] = elem_val
                if col in elem.attrib:
                    if elem.attrib[col] not in row.values() and nm not in row:
                        row[nm] = elem.attrib[col]
        else:
            for col in self.iterparse[row_node]:
                if curr_elem == col:
                    row[col] = elem.text.strip() if elem.text else None
                if col in elem.attrib:
                    row[col] = elem.attrib[col]

bailsman · 2022-06-22T23:51:32Z

You have to be very careful to put your names in the correct order (maybe someone will say 'Well... obviously!') - but once you do that it works perfectly! Thanks!

bailsman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2022

phofl added the IO XML read_xml, to_xml label Jun 14, 2022

simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 16, 2022

This was referenced Jun 18, 2022

BUG: iterparse of read_xml not parsing duplicate element and attribute names #47409

Closed

BUG: iterparse of read_xml not parsing duplicate element and attribute names #47414

Merged

mroeschke closed this as completed in #47414 Jun 21, 2022

ParfaitG mentioned this issue Jul 8, 2022

BUG: using read_xml with iterparse and names will ignore duplicate values #47483

Closed

3 tasks

bama-chi mentioned this issue Feb 5, 2023

BUG: iterparse on read_xml overwrites nested child elements #51183

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: iterparse on read_xml overwrites with attributes on child elements #47343

BUG: iterparse on read_xml overwrites with attributes on child elements #47343

bailsman commented Jun 14, 2022

ParfaitG commented Jun 15, 2022

ParfaitG commented Jun 17, 2022

bailsman commented Jun 22, 2022

BUG: iterparse on read_xml overwrites with attributes on child elements #47343

BUG: iterparse on read_xml overwrites with attributes on child elements #47343

Comments

bailsman commented Jun 14, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

ParfaitG commented Jun 15, 2022

ParfaitG commented Jun 17, 2022

bailsman commented Jun 22, 2022