Skip to content

ENH: Add I/O support of XML with pandas.read_xml and DataFrame.to_xml… #39516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Feb 27, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
b67d876
ENH: Add i/o support of XML with pandas.read_xml and DataFrame.to_xml…
ParfaitG Feb 1, 2021
98e3bcd
Merge branch 'master' into read_xml
ParfaitG Feb 1, 2021
cd79a06
Refactor code for base classes, add tests, adjust whatsnew entry
ParfaitG Feb 3, 2021
6c06dc2
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 3, 2021
fadcb67
Fixed import_optional_dependency() args
ParfaitG Feb 3, 2021
ac5fd3a
Fix fixture and param name collision and check two errors in tests
ParfaitG Feb 3, 2021
25ba341
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 3, 2021
143402a
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 3, 2021
938b0a0
Adjusted tests to handle etree version issues
ParfaitG Feb 3, 2021
a92c21e
Add appropriate etree skips in tests
ParfaitG Feb 3, 2021
51f10f2
Remove check for warnings in tests
ParfaitG Feb 3, 2021
3520d58
Adjust code to conform to mypy and docstring validation
ParfaitG Feb 4, 2021
4832562
Add read_xml to TestPDApi test and fix for etree test
ParfaitG Feb 4, 2021
2914c32
Add read_xml to TestPDApi test and fix for etree test
ParfaitG Feb 4, 2021
72d0e93
Replace lxml ImportWarning for ImportError with added tests
ParfaitG Feb 4, 2021
6453f6e
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 4, 2021
8af695e
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 5, 2021
b80b8ce
Adjust fixture for lxml skip and add error validation in tests
ParfaitG Feb 5, 2021
a6cfc90
Add conditional skips for envs without lxml
ParfaitG Feb 5, 2021
6c4e0b4
Clean up whatnew rst of rebase issue
ParfaitG Feb 5, 2021
a57fd35
Fix unescaped emphasis and wording in read_xml docstring
ParfaitG Feb 5, 2021
16cbcd3
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 8, 2021
23439b4
Add XML section in io.rst and lxml dependency for read_xml in install…
ParfaitG Feb 8, 2021
2effae0
Add section title in whatsnew and tree builder for lxml dependency in…
ParfaitG Feb 10, 2021
878eebe
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 10, 2021
35fa6a6
Clean up merge issue in whatsnew, remove escape in io.rst, adjust exc…
ParfaitG Feb 11, 2021
80d44f9
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 11, 2021
f861d53
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 11, 2021
947840a
Remove redundant try/except and fix default namespace condition
ParfaitG Feb 16, 2021
f8dc56c
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 16, 2021
cb34dde
Replace path or buffer handling with get_handle and add compression a…
ParfaitG Feb 20, 2021
3133486
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 20, 2021
a7716b8
Fix issues in tests from other Python envs
ParfaitG Feb 21, 2021
701d225
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 21, 2021
5b93c16
Fix precommit issue with import line
ParfaitG Feb 21, 2021
9a0dfb4
Adjust code and tests per twoertwein comments
ParfaitG Feb 21, 2021
9556035
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 21, 2021
82ac370
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 22, 2021
c478cb0
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 22, 2021
e23200d
Remove redundancy and object names in XML parse and rename tests for …
ParfaitG Feb 23, 2021
b0b3759
Resolve merge conflict with upstream/master
ParfaitG Feb 23, 2021
b48e257
Add XML table in install.rst
ParfaitG Feb 23, 2021
453ac40
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 23, 2021
9b21636
Streamline filepath_or_buffer handling and add TypeError tests
ParfaitG Feb 23, 2021
bea318c
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 23, 2021
49343b1
Fix lxml test on few Python envs
ParfaitG Feb 23, 2021
ce986bc
Adjust io handling in context maanger
ParfaitG Feb 24, 2021
347d58b
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 24, 2021
e2f80db
Add and fix tests for special filepath_or_buffer values
ParfaitG Feb 24, 2021
c7e1e11
Fix tests for better example and wrong parser
ParfaitG Feb 24, 2021
9790e7c
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 24, 2021
df9ecf4
Adjust to handle empty string stylesheet with tests
ParfaitG Feb 24, 2021
46719b7
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 24, 2021
5d75d51
Move methods out of class, adjust xpath check, and data frame formatting
ParfaitG Feb 25, 2021
66c01d2
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 25, 2021
5c0af6e
Update tests to conform to mypy
ParfaitG Feb 25, 2021
2eae8ad
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 25, 2021
603644e
Import methods to avoid duplication and add typing to parse_doc
ParfaitG Feb 27, 2021
3ec7297
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 27, 2021
6194f83
Refactor code and revert changes to avoid optional module type hints
ParfaitG Feb 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/source/reference/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,13 @@ HTML

read_html

XML
~~~~
.. autosummary::
:toctree: api/

read_xml

HDFStore: PyTables (HDF5)
~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
Expand Down
37 changes: 37 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,43 @@ For example:
storage_options=headers
)

.. _whatsnew_130.window_method_table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you picked up another change here

Copy link
Contributor Author

@ParfaitG ParfaitG Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I merge latest? And should I add XML section to io.rst or handle in different PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should always merge latest every time you are pushing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docs for io.rst in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a top-level table in io.rst that needs updating as well (for the I/O read/write methods near the top)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added XML section and updated top-level table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, this still looks like an artfiact from a merge


:class:`Rolling` and :class:`Expanding` now support a ``method`` argument with a
``'table'`` option that performs the windowing operation over an entire :class:`DataFrame`.
See ref:`window.overview` for performance and functional benefits. (:issue:`15095`)

.. _whatsnew_130.read_to_xml:

We added I/O support to read and render shallow versions of XML documents with
:func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a reference to lxml (same one as we have in install.rst)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

full XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`)

.. ipython:: python

xml = """<?xml version='1.0' encoding='utf-8'?>
<data>
<row>
<shape>square</shape>
<degrees>360</degrees>
<sides>4.0</sides>
</row>
<row>
<shape>circle</shape>
<degrees>360</degrees>
<sides/>
</row>
<row>
<shape>triangle</shape>
<degrees>180</degrees>
<sides>3.0</sides>
</row>
</data>"""

df = pd.read_xml(xml)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to show the rendered df, so end the ipython block here, and then add another one for the .to_xml()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


df.to_xml()

.. _whatsnew_130.enhancements.other:

Other enhancements
Expand Down
1 change: 1 addition & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@
read_feather,
read_gbq,
read_html,
read_xml,
read_json,
read_stata,
read_sas,
Expand Down
172 changes: 172 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2604,6 +2604,178 @@ def to_html(
render_links=render_links,
)

def to_xml(
self,
io: Optional[FilePathOrBuffer[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this path_or_buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index: Optional[bool] = True,
root_name: Optional[str] = "data",
row_name: Optional[str] = "row",
na_rep: Optional[str] = None,
attr_cols: Optional[Union[str, List[str]]] = None,
elem_cols: Optional[Union[str, List[str]]] = None,
namespaces: Optional[Union[dict, List[dict]]] = None,
prefix: Optional[str] = None,
encoding: Optional[str] = "utf-8",
xml_declaration: Optional[bool] = True,
pretty_print: Optional[bool] = True,
parser: Optional[str] = "lxml",
stylesheet: Optional[FilePathOrBuffer[str]] = None,
) -> Optional[str]:
"""
Render a DataFrame to an XML document.

.. versionadded:: 1.3.0

Parameters
----------
io : str, path object or file-like object, optional
File to write output to. If None, the output is returned as a
string.
index : bool, optional
Whether to include index in XML document.
root_name : str, default 'data'
The name of root element in XML document.
root_name : str, default 'row'
The name of row element in XML document.
na_rep : str, optional
Missing data representation.
attr_cols : list-like, optional
List of columns to write as attributes in row element.
Hierarchical columns will be flattened with underscore
delimiting the different levels.
elem_cols : list-like, optional
List of columns to write as children in row element. By default,
all columns output as children of row element. Hierarchical
columns will be flattened with underscore delimiting the
different levels.
namespaces : dict, optional
All namespaces to be defined in root element. Keys of dict
should be prefix names and values of dict corresponding URIs.
Default namespaces should be given empty string key. For
example, ::

namespaces = {'': 'https://example.com'}

prefix : str, optional
Namespace prefix to be used for every element and/or attribute
in document. This should be one of the keys in ``namespaces``
dict.
encoding : str, optional, default 'utf-8'
Encoding of the resulting document.
xml_declaration : str, optional
Whether to include the XML declaration at start of document.
pretty_print : bool, optional
Whether output should be pretty printed with indentation and
line breaks.
parser : {'lxml','etree'}, default "lxml"
Parser module to use for building of tree. Only 'lxml' and
'etree' are supported. With 'lxml', the ability to use XSLT
stylesheet is supported. Default parser uses 'lxml'. If
module is not installed a warning will raise and process
will continue with 'etree'.
stylesheet : str, path object or file-like object, optional
A URL, file-like object, or a raw string containing an XSLT
script used to transform the raw XML output. Script should use
layout of elements and attributes from original output. This
argument requires ``lxml`` to be installed. Only XSLT 1.0
scripts and not later versions is currently supported.

Returns
-------
None or str
If ``io`` is None, returns the resulting XML format as a
string. Otherwise returns None.

See Also
--------
to_json : Convert the pandas object to a JSON string.
to_html : Convert DataFrame to a html.

Examples
--------
>>> df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
... 'degrees': [360, 360, 180],
... 'sides': [4, np.nan, 3]})

>>> df.to_xml()
<?xml version='1.0' encoding='utf-8'?>
<data>
<row>
<index>0</index>
<shape>square</shape>
<degrees>360</degrees>
<sides>4.0</sides>
</row>
<row>
<index>1</index>
<shape>circle</shape>
<degrees>360</degrees>
<sides/>
</row>
<row>
<index>2</index>
<shape>triangle</shape>
<degrees>180</degrees>
<sides>3.0</sides>
</row>
</data>

>>> df.to_xml(attr_cols=['index', 'shape', 'degrees', 'sides'])
<?xml version='1.0' encoding='utf-8'?>
<data>
<row index="0" shape="square" degrees="360" sides="4.0"/>
<row index="1" shape="circle" degrees="360"/>
<row index="2" shape="triangle" degrees="180" sides="3.0"/>
</data>

>>> df.to_xml(namespaces = {"doc": "https://example.com"},
... prefix = "doc")
<?xml version='1.0' encoding='utf-8'?>
<doc:data xmlns:doc="https://example.com">
<doc:row>
<doc:index>0</doc:index>
<doc:shape>square</doc:shape>
<doc:degrees>360</doc:degrees>
<doc:sides>4.0</doc:sides>
</doc:row>
<doc:row>
<doc:index>1</doc:index>
<doc:shape>circle</doc:shape>
<doc:degrees>360</doc:degrees>
<doc:sides/>
</doc:row>
<doc:row>
<doc:index>2</doc:index>
<doc:shape>triangle</doc:shape>
<doc:degrees>180</doc:degrees>
<doc:sides>3.0</doc:sides>
</doc:row>
</doc:data>
"""

formatter = fmt.DataFrameFormatter(
self,
index=index,
na_rep=na_rep,
)

return fmt.DataFrameRenderer(formatter).to_xml(
io=io,
index=index,
root_name=root_name,
row_name=row_name,
na_rep=na_rep,
attr_cols=attr_cols,
elem_cols=elem_cols,
namespaces=namespaces,
prefix=prefix,
encoding=encoding,
xml_declaration=xml_declaration,
pretty_print=pretty_print,
parser=parser,
stylesheet=stylesheet,
)

# ----------------------------------------------------------------------
@Substitution(
klass="DataFrame",
Expand Down
1 change: 1 addition & 0 deletions pandas/io/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@
from pandas.io.spss import read_spss
from pandas.io.sql import read_sql, read_sql_query, read_sql_table
from pandas.io.stata import read_stata
from pandas.io.xml import read_xml
117 changes: 117 additions & 0 deletions pandas/io/formats/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
cast,
)
from unicodedata import east_asian_width
from warnings import warn

import numpy as np

Expand Down Expand Up @@ -914,6 +915,7 @@ class DataFrameRenderer:

Called in pandas.core.frame.DataFrame:
- to_html
- to_xml
- to_string

Parameters
Expand Down Expand Up @@ -1003,6 +1005,121 @@ def to_html(
string = html_formatter.to_string()
return save_to_buffer(string, buf=buf, encoding=encoding)

def to_xml(
self,
io: Optional[FilePathOrBuffer[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

index: Optional[bool] = True,
root_name: Optional[str] = "data",
row_name: Optional[str] = "row",
na_rep: Optional[str] = None,
attr_cols: Optional[Union[str, List[str]]] = None,
elem_cols: Optional[Union[str, List[str]]] = None,
namespaces: Optional[Union[dict, List[dict]]] = None,
prefix: Optional[str] = None,
encoding: Optional[str] = "utf-8",
xml_declaration: Optional[bool] = True,
pretty_print: Optional[bool] = True,
parser: Optional[str] = "lxml",
stylesheet: Optional[FilePathOrBuffer[str]] = None,
) -> Optional[str]:
"""
Render a DataFrame to an XML document.

.. versionadded:: 1.3.0

Parameters
----------
io : str, path object or file-like object, optional
File to write output to. If None, the output is returned as a
string.
index : bool, optional
Whether to include index in XML document.
root_name : str, default 'data'
The name of root element in XML document.
root_name : str, default 'row'
The name of row element in XML document.
na_rep : str, optional
Missing data representation.
attr_cols : list-like, optional
List of columns to write as attributes in row element.
Hierarchical columns will be flattened with underscore
delimiting the different levels.
elem_cols : list-like, optional
List of columns to write as children in row element. By default,
all columns output as children of row element. Hierarchical
columns will be flattened with underscore delimiting the
different levels.
namespaces : dict, optional
All namespaces to be defined in root element. Keys of dict
should be prefix names and values of dict corresponding URIs.
Default namespaces should be given empty string key. For
example, ::

namespaces = {'': 'https://example.com'}

prefix : str, optional
Namespace prefix to be used for every element and/or attribute
in document. This should be one of the keys in ``namespaces``
dict.
encoding : str, optional, default 'utf-8'
Encoding of the resulting document.
xml_declaration : str, optional
Whether to include the XML declaration at start of document.
pretty_print : bool, optional
Whether output should be pretty printed with indentation and
line breaks.
parser : {'lxml','etree'}, default "lxml"
Parser module to use for building of tree. Only 'lxml' and
'etree' are supported. With 'lxml', the ability to use XSLT
stylesheet is supported. Default parser uses 'lxml'. If
module is not installed a warning will raise and process
will continue with 'etree'.
stylesheet : str, path object or file-like object, optional
A URL, file-like object, or a raw string containing an XSLT
script used to transform the raw XML output. Script should use
layout of elements and attributes from original output. This
argument requires ``lxml`` to be installed. Only XSLT 1.0
scripts and not later versions is currently supported.
"""

from pandas.io.formats.xml import EtreeXMLFormatter, LxmlXMLFormatter

if parser == "lxml":
try:
TreeBuilder = LxmlXMLFormatter
except ImportError:
warn(
"You do not have lxml installed (default parser). "
"Instead, etree will be used.",
ImportWarning,
)
TreeBuilder = EtreeXMLFormatter

elif parser == "etree":
TreeBuilder = EtreeXMLFormatter

else:
raise ValueError("Values for parser can only be lxml or etree.")

xml_formatter = TreeBuilder(
self.fmt,
io=io,
index=index,
root_name=root_name,
row_name=row_name,
na_rep=na_rep,
attr_cols=attr_cols,
elem_cols=elem_cols,
namespaces=namespaces,
prefix=prefix,
encoding=encoding,
xml_declaration=xml_declaration,
pretty_print=pretty_print,
stylesheet=stylesheet,
)

return xml_formatter.write_output()

def to_string(
self,
buf: Optional[FilePathOrBuffer[str]] = None,
Expand Down
Loading