Skip to content

BUG: using read_xml with iterparse and names will ignore duplicate values #47483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
bailsman opened this issue Jun 23, 2022 · 4 comments · Fixed by #47630
Closed
3 tasks done

BUG: using read_xml with iterparse and names will ignore duplicate values #47483

bailsman opened this issue Jun 23, 2022 · 4 comments · Fixed by #47630
Labels
Bug IO XML read_xml, to_xml
Milestone

Comments

@bailsman
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from tempfile import NamedTemporaryFile
import pandas as pd

XML = '''
<issue>
  <type>BUG</type>
  <category>BUG</category>
</issue>
'''.encode('utf-8')

with NamedTemporaryFile() as tempfile:
    tempfile.write(XML)
    tempfile.flush()
    df = pd.read_xml(tempfile.name, iterparse={'issue': ['type', 'category']})
    assert df.iloc[0]['category'] == 'BUG' # works fine
    df = pd.read_xml(tempfile.name, iterparse={'issue': ['type', 'category']}, names=['type', 'cat'])
    assert df.iloc[0]['cat'] == 'BUG' # cat is never set because its value is duplicated with type

Issue Description

When using the names feature to rename columns, for some reason, if any value is duplicated with a previous value, it's completely ignored.

Note this issue is about a feature that is not released yet (planned for 1.5.0). #47414

Expected Behavior

type and cat should both get set to "BUG". Even though the value is duplicated, it's a separate piece of information in the xml.

Installed Versions

d43d6e2

@bailsman bailsman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 23, 2022
@haydenw2005
Copy link

take

@haydenw2005 haydenw2005 removed their assignment Jun 24, 2022
@haydenw2005
Copy link

take

@simonjayhawkins simonjayhawkins added the IO XML read_xml, to_xml label Jun 24, 2022
@haydenw2005
Copy link

I suspect the bug is on line 345 in pandas/io/xml.py. Changing if elem_val not in row.values() and nm not in row: to if nm not in row: seems to be the fix, though I do not have time to finish writing test cases (suddenly picked up a new job) and am not entirely sure if that is correct. Sorry for holding the issue- hopefully this helps.

@haydenw2005 haydenw2005 removed their assignment Jul 3, 2022
@ParfaitG
Copy link
Contributor

ParfaitG commented Jul 8, 2022

Thanks @haydenw2005! You are very close, line 345 should concurrently check key and value if row.get(nm) != elem_val and nm not in row and not just values as the fix for recent issue, #47343, raised by OP.

@jreback jreback added this to the 1.5 milestone Jul 8, 2022
@jreback jreback removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO XML read_xml, to_xml
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants