-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: to_xml, read_xml #27554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This was discussed in #4734 but never got anywhere. Are there any standards for representing / translating tabular data to/from XML? |
I don't know if there are any standards per se. I used pandas to read csv and json and wondered ever since why xml wasn't supported by pandas since it's also a very popular format. I always figured it was because in order to parse xml you have to know certain things about the data. I got this impression because every example I've seen of using the regular python xml parser to load the data into a pandas dataframe seem to require that the names of certain elements be known (or at least more than just the filename) but today a friend of mine told me that parsing xml is just as hard as json which is why I decided to ask this. |
I'm not aware of any major objections (others can certainly chime in) but would just need research and a PR from the community |
Would be great. I used pandas for work and it would have been great if this was implemented. If it can be added that'd be great so I can add it to my code. |
I can look into this. |
Hi |
Has there been any progress on this ENH? While there is no standard XML, many public and private data dumps, repositories, and APIs that Other database and data science solutions provide such functionality of migrating XML documents to tables or datasets including:
I have answered many Stackoverflow questions where OPs regularly need a migration tool of XML to Data Frames, albeit with varying levels of nested documents. Surely, by While I would endeavor this PR, the Consider the below generalized version to import pandas as pd
import xml.etree.ElementTree as et # PYTHON STANDARD LIBRARY
from urllib import request # PYTHON STANDARD LIBRARY
import lxml.etree as lxet # THIRD-PARTY LIBRARY
def read_xml(io, nodes, namespace=None, url=False):
# URL
if url:
rq = request.urlopen(io)
xtree = et.fromstring(rq.read())
else:
# FILE
xtree = et.parse(io)
# LIST OF MERGED DICTIONARIES WITH TERTIARY OPERATOR FOR NAMESPACE TYPES
data = [{ **{r.tag.split('}')[1] if namespace else r.tag:
r.text.strip() if len(r.text.strip()) > 0 else None
for r in row.findall('*') }, **row.attrib
} for row in xtree.findall(nodes, namespace)]
# PASS PARSED DATA TO DataFrame CONSTRUCTOR
return pd.DataFrame(data)
def write_xml(df, io):
df_dict = df.to_dict(orient='index')
# INITIALIZING XML FILE
root = et.Element('data')
# WRITING TO XML NODES
for i, d in df_dict.items():
childRoot = et.SubElement(root, "row")
for k, v in d.items():
et.SubElement(childRoot, k).text = str(v)
# SAVING XML FILE
with open(io, 'wb') as f:
f.write(et.tostring(root,
xml_declaration=True,
encoding="UTF-8"))
def LXML_write_xml(df, io):
df_dict = df.to_dict(orient='index')
# INITIALIZING XML FILE
root = lxet.Element('data')
# WRITING TO XML NODES
for i, d in df_dict.items():
childRoot = lxet.SubElement(root, "row")
for k, v in d.items():
lxet.SubElement(childRoot, k).text = str(v)
tree = lxet.ElementTree(root)
tree.write(io,
pretty_print=True,
xml_declaration=True,
encoding="UTF-8")
return None
'''
EXAMPLES of read_xml()
'''
xml_df = read_xml("https://opendata.maryland.gov/api/views/un65-7ipd/rows.xml?accessType=DOWNLOAD",
nodes = ".//row/row",
url=True)
print(xml_df)
# year month ... _position _address
# 0 2002 JAN ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 1 2002 FEB ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 2 2002 MAR ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 3 2002 APR ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 4 2002 MAY ... 0 https://opendata.maryland.gov/resource/un65-7i...
# .. ... ... ... ... ...
# 221 2020 JUN ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 222 2020 JUL ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 223 2020 AUG ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 224 2020 SEP ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 225 2020 OCT ... 0 https://opendata.maryland.gov/resource/un65-7i...
xml_df = read_xml("http://ergast.com/api/f1/1954/results/1.xml",
nodes = ".//doc:Race",
namespace = {'doc': "http://ergast.com/mrd/1.4"},
url=True)
print(xml_df)
# RaceName ... url
# 0 Argentine Grand Prix ... http://en.wikipedia.org/wiki/1954_Argentine_Gr...
# 1 Indianapolis 500 ... http://en.wikipedia.org/wiki/1954_Indianapolis...
# 2 Belgian Grand Prix ... http://en.wikipedia.org/wiki/1954_Belgian_Gran...
# 3 French Grand Prix ... http://en.wikipedia.org/wiki/1954_French_Grand...
# 4 British Grand Prix ... http://en.wikipedia.org/wiki/1954_British_Gran...
# 5 German Grand Prix ... http://en.wikipedia.org/wiki/1954_German_Grand...
# 6 Swiss Grand Prix ... http://en.wikipedia.org/wiki/1954_Swiss_Grand_...
# 7 Italian Grand Prix ... http://en.wikipedia.org/wiki/1954_Italian_Gran...
# 8 Spanish Grand Prix ... http://en.wikipedia.org/wiki/1954_Spanish_Gran...
xml_df = read_xml("https://www.sec.gov/info/edgar/edgartaxonomies.xml",
nodes = ".//Loc",
url=True)
print(xml_df)
# Family Version ... Namespace Prefix
# 0 US GAAP 2020 ... http://xbrl.sec.gov/stpr/2018-01-31 stpr
# 1 US GAAP 2020 ... http://xbrl.sec.gov/sic/2020-01-31 sic
# 2 US GAAP 2020 ... http://xbrl.sec.gov/sic/2011-01-31 sic
# 3 US GAAP 2020 ... http://xbrl.sec.gov/naics/2017-01-31 naics
# 4 US GAAP 2020 ... http://xbrl.sec.gov/exch/2020-01-31 exch
# .. ... ... ... ... ...
# 356 BASE 2010 ... http://www.xbrl.org/2004/ref ref
# 357 BASE 2010 ... http://www.w3.org/1999/xlink xlink
# 358 BASE 2010 ... http://www.xbrl.org/2003/XLink xl
# 359 BASE 2010 ... http://www.xbrl.org/2003/linkbase link
# 360 BASE 2010 ... http://www.xbrl.org/2003/instance xbrli
'''
EXAMPLES OF write_xml()
'''
write_xml(xml_df, 'Output.xml')
#<?xml version='1.0' encoding='UTF-8'?>
#<data><row><Family>US GAAP</Family><Version>2020</Version><Href>https://xbrl.sec.gov/...
LXML_write_xml(xml_df, 'Output.xml')
#<?xml version='1.0' encoding='UTF-8'?>
#<data>
# <row>
# <Family>US GAAP</Family>
# <Version>2020</Version>
# <Href>https://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd</Href>
# <AttType>SCH</AttType>
# <FileTypeName>Schema</FileTypeName>
# <Elements>1</Elements>
# <Namespace>http://xbrl.sec.gov/stpr/2018-01-31</Namespace>
# <Prefix>stpr</Prefix>
# </row>
#...
#</data> |
@ParfaitG you are welcome to submit a PR - enhancements happen by volunteers |
Understood @jreback . I can get started with 4 PRs (via 4 different branches) following the style and signature of
I will emphasize in notes that After initial tested PRs, I can work with others more familiar with optimal object-oriented |
tests should be in the same PR as the code change |
Got it. Will plan for 2 PRs (1 and 3) for Edit: Will have to also make a small method add of |
take |
@ParfaitG you sir are a legend! So many people want/need a pandas way of reading XML. For whatever it is worth (completely understand if none of this is useful), I have been using xmltodict as an intermediary from making XML a pandas dataframe. You know, leveraging existing tools. from typing import Optional, List, OrderedDict
import pandas as pd
import xmltodict
def get_to_root_in_dict(the_dict: OrderedDict, root_key_list: Optional[List[str]]=None) -> OrderedDict:
if not root_key_list:
return the_dict
elif len(root_key_list) > 1:
return get_to_root_in_dict(the_dict[root_key_list[0]], root_key_list[1:])
else:
return the_dict[root_key_list[0]]
def read_xml_as_dataframe(xml: str, root_key_list: Optional[List[str]]=None, transpose: bool=False) -> pd.DataFrame:
if transpose:
return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list)).T
else:
return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list)) I've also notice that XML being just a text file, they sometime come within zips or even double zips... Another thing I've had issues with XML data was that, unlike JSON data, the pandas json_normalize didn't always work well with the intended XML schema. In JSON, a single element can exist within a list, but in XML, if only one tag exists, then the xml to dict process does not put that element into a list. So the resulting dataframe ends up with a single column with some lists and some strings, and some nans. So my package had to include a Anyways, thanks again for picking this up! |
@minchulkim87 - thanks! Your work is interesting and a convenience handler for those unfamiliar with the XML document model. I aim to focus on the XML standard with DOM support of elements, attributes, namespaces, even XPath and XSLT using XML handlers in Python's standard library and few core external libraries that |
I'm sure this has been asked before but I can't find it. Why doesn't pandas have a method called read_xml?
The text was updated successfully, but these errors were encountered: