-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Possible to add dtype/converters as arguments for pandas.read_xml() ? #43567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Agreed! Good feature to add to running list. Also, |
As a current workaround, consider running XSLT to quote the nodes with leading zeroes and then convert on the pandas side. If using the default import pandas as pd
xml = \
'''<root>
<row>
<zip>08540</zip>
<dat>123</dat>
</row>
<row>
<zip>08628</zip>
<dat>456</dat>
</row>
<row>
<zip>27599</zip>
<dat>789</dat>
</row>
</root>'''
xsl = \
'''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TEMPLATE TO COPY XML AS IS -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<!-- ENCLOSE zip NODES WITH DOUBLE QUOTES -->
<xsl:template match="zip">
<xsl:copy>
<xsl:variable name="quot">"</xsl:variable>
<xsl:value-of select="concat($quot, text(), $quot)"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>'''
df = (
pd.read_xml(xml, stylesheet = xsl)
.assign(zip = lambda x: x["zip"].str.replace('"', ''))
)
df
zip dat
0 08540 123
1 08628 456
2 27599 789 |
@ParfaitG xsl is really a good idea !
I have done as below but cannot get any output..
|
Hmmmm...At the very least, you should get back the same XML with this XSLT. The identity transform should work across any XML with or without namespaces. I can understand the double quotes not rendering. Possibly, |
Is your feature request related to a problem?
I am using pandas lib to read xml for further processes, however a number of columns with leading ZERO are always converted to numbers, so I lost the original data.
Describe the solution you'd like
It would be great to add dtype/converter arguments for
pandas.read_xml()
to force pandas to interprete certain columns with given dtype/converters.Just like similar IO read (read_csv, read_html, etc)
read_xml
read_csv
API breaking implications
Probably not, this argument could be optional.
Describe alternatives you've considered
Write my own code to pull data by each xml nodes, which results in very bad performance.
The text was updated successfully, but these errors were encountered: