Skip to content

ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
soliujing opened this issue Sep 14, 2021 · 5 comments · Fixed by #45411
Closed

ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567

soliujing opened this issue Sep 14, 2021 · 5 comments · Fixed by #45411
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO XML read_xml, to_xml
Milestone

Comments

@soliujing
Copy link

Is your feature request related to a problem?

I am using pandas lib to read xml for further processes, however a number of columns with leading ZERO are always converted to numbers, so I lost the original data.

Describe the solution you'd like

It would be great to add dtype/converter arguments for pandas.read_xml() to force pandas to interprete certain columns with given dtype/converters.
Just like similar IO read (read_csv, read_html, etc)

read_xml
read_csv

API breaking implications

Probably not, this argument could be optional.

Describe alternatives you've considered

Write my own code to pull data by each xml nodes, which results in very bad performance.

@soliujing soliujing added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021
@soliujing soliujing changed the title ENH: Possible to add dtype/converters for pandas.read_xml() ? ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? Sep 14, 2021
@mzeitlin11
Copy link
Member

Could maybe be added to #40131, cc @ParfaitG if any thoughts

@mzeitlin11 mzeitlin11 added Dtype Conversions Unexpected or buggy dtype conversions IO XML read_xml, to_xml labels Sep 14, 2021
@ParfaitG
Copy link
Contributor

Agreed! Good feature to add to running list. Also, read_xml passes parsed data to TextParser shared by other io readers.

@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 15, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Sep 15, 2021
@ParfaitG
Copy link
Contributor

As a current workaround, consider running XSLT to quote the nodes with leading zeroes and then convert on the pandas side. If using the default lxml parser, XSLT 1.0 scripts are supported in read_xml. Below XSLT runs the standard Identity Template and encloses the text values of the zip with double quotes.

import pandas as pd

xml = \
'''<root>
     <row>
        <zip>08540</zip>
        <dat>123</dat>
     </row>
     <row>
        <zip>08628</zip>
        <dat>456</dat>
     </row>
     <row>
        <zip>27599</zip>
        <dat>789</dat>
     </row>
    </root>'''

xsl = \
'''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TEMPLATE TO COPY XML AS IS -->
    <xsl:template match="node()|@*">
       <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
       </xsl:copy>
    </xsl:template>
    
    <!-- ENCLOSE zip NODES WITH DOUBLE QUOTES -->
    <xsl:template match="zip">
      <xsl:copy>
        <xsl:variable name="quot">"</xsl:variable>
        <xsl:value-of select="concat($quot, text(), $quot)"/>
      </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>'''

df = (
    pd.read_xml(xml, stylesheet = xsl)
      .assign(zip = lambda x: x["zip"].str.replace('"', ''))
)

df
     zip  dat
0  08540  123
1  08628  456
2  27599  789

@soliujing
Copy link
Author

@ParfaitG xsl is really a good idea !
I have tried and have questions on namespace.

  • actually the XML file i am processing contains namespace at root node, I am not very familar with xsl, so any suggestion for match & replace ?

I have done as below but cannot get any output..

xsl = \
'''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:doc="http://sample.com/sample">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TEMPLATE TO COPY XML AS IS -->
    <xsl:template match="node()|@*">
       <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
       </xsl:copy>
    </xsl:template>
    
    <!-- ENCLOSE zip NODES WITH DOUBLE QUOTES -->
    <xsl:template match="doc:zip">
      <xsl:copy>
        <xsl:variable name="quot">"</xsl:variable>
        <xsl:value-of select="concat($quot, text(), $quot)"/>
      </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>'''

@ParfaitG
Copy link
Contributor

ParfaitG commented Oct 1, 2021

Hmmmm...At the very least, you should get back the same XML with this XSLT. The identity transform should work across any XML with or without namespaces. I can understand the double quotes not rendering. Possibly, doc: is not the correct namespace for <zip> nodes. Please share sample of XML for reproducible example.

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Jan 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO XML read_xml, to_xml
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants