ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567

soliujing · 2021-09-14T14:38:43Z

Is your feature request related to a problem?

I am using pandas lib to read xml for further processes, however a number of columns with leading ZERO are always converted to numbers, so I lost the original data.

Describe the solution you'd like

It would be great to add dtype/converter arguments for pandas.read_xml() to force pandas to interprete certain columns with given dtype/converters.
Just like similar IO read (read_csv, read_html, etc)

read_xml
read_csv

API breaking implications

Probably not, this argument could be optional.

Describe alternatives you've considered

Write my own code to pull data by each xml nodes, which results in very bad performance.

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-09-14T19:57:17Z

Could maybe be added to #40131, cc @ParfaitG if any thoughts

ParfaitG · 2021-09-15T01:34:05Z

Agreed! Good feature to add to running list. Also, read_xml passes parsed data to TextParser shared by other io readers.

ParfaitG · 2021-09-18T14:43:27Z

As a current workaround, consider running XSLT to quote the nodes with leading zeroes and then convert on the pandas side. If using the default lxml parser, XSLT 1.0 scripts are supported in read_xml. Below XSLT runs the standard Identity Template and encloses the text values of the zip with double quotes.

import pandas as pd

xml = \
'''<root>
     <row>
        <zip>08540</zip>
        <dat>123</dat>
     </row>
     <row>
        <zip>08628</zip>
        <dat>456</dat>
     </row>
     <row>
        <zip>27599</zip>
        <dat>789</dat>
     </row>
    </root>'''

xsl = \
'''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TEMPLATE TO COPY XML AS IS -->
    <xsl:template match="node()|@*">
       <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
       </xsl:copy>
    </xsl:template>
    
    <!-- ENCLOSE zip NODES WITH DOUBLE QUOTES -->
    <xsl:template match="zip">
      <xsl:copy>
        <xsl:variable name="quot">"</xsl:variable>
        <xsl:value-of select="concat($quot, text(), $quot)"/>
      </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>'''

df = (
    pd.read_xml(xml, stylesheet = xsl)
      .assign(zip = lambda x: x["zip"].str.replace('"', ''))
)

df
     zip  dat
0  08540  123
1  08628  456
2  27599  789

soliujing · 2021-09-28T06:39:48Z

@ParfaitG xsl is really a good idea !
I have tried and have questions on namespace.

actually the XML file i am processing contains namespace at root node, I am not very familar with xsl, so any suggestion for match & replace ?

I have done as below but cannot get any output..

xsl = \
'''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:doc="http://sample.com/sample">
    <xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TEMPLATE TO COPY XML AS IS -->
    <xsl:template match="node()|@*">
       <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
       </xsl:copy>
    </xsl:template>
    
    <!-- ENCLOSE zip NODES WITH DOUBLE QUOTES -->
    <xsl:template match="doc:zip">
      <xsl:copy>
        <xsl:variable name="quot">"</xsl:variable>
        <xsl:value-of select="concat($quot, text(), $quot)"/>
      </xsl:copy>
    </xsl:template>
    
</xsl:stylesheet>'''

ParfaitG · 2021-10-01T00:16:28Z

Hmmmm...At the very least, you should get back the same XML with this XSLT. The identity transform should work across any XML with or without namespaces. I can understand the double quotes not rendering. Possibly, doc: is not the correct namespace for <zip> nodes. Please share sample of XML for reproducible example.

soliujing added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2021

soliujing changed the title ~~ENH: Possible to add dtype/converters for pandas.read_xml() ?~~ ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? Sep 14, 2021

mzeitlin11 added Dtype Conversions Unexpected or buggy dtype conversions IO XML read_xml, to_xml labels Sep 14, 2021

lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 15, 2021

lithomas1 added this to the Contributions Welcome milestone Sep 15, 2021

ParfaitG mentioned this issue Nov 20, 2021

Pandas IO XML Issue Tracker #40131

Closed

14 tasks

This was referenced Jan 16, 2022

ENH: add dtype to read_xml #45341

Closed

ENH: Add dtypes/converters arguments for pandas.read_xml #45411

Merged

jreback modified the milestones: Contributions Welcome, 1.5 Jan 23, 2022

jreback closed this as completed in #45411 Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567

ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567

soliujing commented Sep 14, 2021

mzeitlin11 commented Sep 14, 2021

ParfaitG commented Sep 15, 2021

ParfaitG commented Sep 18, 2021

soliujing commented Sep 28, 2021

ParfaitG commented Oct 1, 2021

ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567

ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567

Comments

soliujing commented Sep 14, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

mzeitlin11 commented Sep 14, 2021

ParfaitG commented Sep 15, 2021

ParfaitG commented Sep 18, 2021

soliujing commented Sep 28, 2021

ParfaitG commented Oct 1, 2021