-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: iterparse on read_xml overwrites with attributes on child elements #47343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting side effect and possibly an edge case. Parser builds a dictionary using element or attribute names as keys. Since same named attribute is after the same named element, dictionary retains the last value. Reverse the reporter and type nodes and you will then return Since iterparse does not know levels (i.e., parent, child) but indiscriminately walks down the tree parsing then discarding each element, behavior should align more with
So, one thought is the user should pass names to rename duplicate XML names. df = pd.read_xml(tempfile.name, iterparse={'issue': ['type', 'type']}, names=["type_elem", "type_attr"])
df
type_elem type_attr
0 BUG newbie However, given how iterparse iterates down the tree, passing same names causes it to repeatedly overwrite the dictionary value. Will give this some thought. For now, if XML really isn't that large and users are using |
I have an if row is not None:
if self.names:
for col, nm in zip(self.iterparse[row_node], self.names):
if curr_elem == col:
elem_val = elem.text.strip() if elem.text else None
if elem_val not in row.values() and nm not in row:
row[nm] = elem_val
if col in elem.attrib:
if elem.attrib[col] not in row.values() and nm not in row:
row[nm] = elem.attrib[col]
else:
for col in self.iterparse[row_node]:
if curr_elem == col:
row[col] = elem.text.strip() if elem.text else None
if col in elem.attrib:
row[col] = elem.attrib[col] |
You have to be very careful to put your names in the correct order (maybe someone will say 'Well... obviously!') - but once you do that it works perfectly! Thanks! |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The parsing code parses attributes on all child elements, rather than just on the main element.
This is probably unintentional - in the example above, the user is most likely interested in the type of the incident rather than the type attribute on some child element.
This feature is planned for 1.5.0 and not yet released. See: #45724 (comment)
Expected Behavior
Attributes are parsed only on the toplevel element.
Installed Versions
5465f54
The text was updated successfully, but these errors were encountered: