ENH: Add I/O support of XML with pandas.read_xml and DataFrame.to_xml… #39516

ParfaitG · 2021-02-01T01:04:20Z

… (GH27554)

To Do List

Add parse_dates feature for read_xml.
Add tests for storage_options.
Add tests for ParserError, OSError, URLError, etc.
Add xpath_vars feature to pass $ variables in xpath expression. See lxml xpath() method.
Add xsl_params feature to pass values into XSLT script. See lxml stylesheet parameters.
Add iterparse feature for memory efficient parsing of large XML. See etree iterparse and lxml iterparse.

… (GH27554)

twoertwein · 2021-02-02T00:56:06Z

pandas/io/formats/xml.py

+
+        try:
+            if self.io:
+                with open(self.io, "wb") as f:


if you want to support compression/fsspec/buffers and so on:

with get_handle(self.io, mode="wb", is_text=False, storage_options=..., compression=...) as handles: handles.handle.write(xml_doc)

I am not strong on these features. I attempted compression in a draft version of read_xml looking at _json.py but implementation would require workarounds. Can this be for future PR? If not, I would need several days to incorporate.

totally fine to defer
just add a check list of todos (top of PR is ok or new issue)

twoertwein · 2021-02-02T01:02:55Z

pandas/io/xml.py

+        as string, depending on object type.
+        """
+
+        obj = None


I think this could boil down to the following (except for str/bytes)

with get_handle(self.io, mode="r", is_text=True, encoding=self.encoding) as handles: obj = handles.handle.read()

Accepting strings of XML and strings representing filepaths might get tricky in some cases?

jreback

looks pretty good. a number of comments on the structure. will give more specific comments on the code itself after reorg. the tests look ok so far. need a bunch more on error testing. we want virtually every error line tested (meaning anytime you are raising an excpetion each tyep should be tested).

its ok to add a checklist to the issue to track things (some of which can also be in followup PRs)

jreback · 2021-02-02T00:55:43Z

doc/source/whatsnew/v1.3.0.rst

+.. _whatsnew_130.read_to_xml:
+
+We added I/O support to read and render shallow versions of XML documents with 
+:func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser, 


can you add a reference to lxml (same one as we have in install.rst)

jreback · 2021-02-02T00:56:13Z

doc/source/whatsnew/v1.3.0.rst

+      </row>
+    </data>"""
+
+    df = pd.read_xml(xml)


you need to show the rendered df, so end the ipython block here, and then add another one for the .to_xml()

doc/source/whatsnew/v1.3.0.rst

jreback · 2021-02-02T00:57:19Z

pandas/core/frame.py

@@ -2604,6 +2604,178 @@ def to_html(
            render_links=render_links,
        )

+    def to_xml(
+        self,
+        io: Optional[FilePathOrBuffer[str]] = None,


name this path_or_buffer

jreback · 2021-02-02T00:58:34Z

pandas/io/formats/format.py

@@ -1003,6 +1005,121 @@ def to_html(
        string = html_formatter.to_string()
        return save_to_buffer(string, buf=buf, encoding=encoding)

+    def to_xml(
+        self,
+        io: Optional[FilePathOrBuffer[str]] = None,


jreback · 2021-02-02T01:08:38Z

pandas/io/formats/xml.py

+        if isinstance(self.stylesheet, str):
+            obj = self.stylesheet
+
+        if isinstance(self.stylesheet, bytes):


but see @twoertwein comments

pandas/io/xml.py

jreback · 2021-02-02T01:11:03Z

pandas/io/xml.py

+    fallback option with etree parser.
+    """
+
+    if parser == "lxml":


like this is repeated logic here. need to handle this centrally.

jreback · 2021-02-02T01:11:32Z

pandas/io/xml.py

+    return _data_to_frame(data=data_dicts, **kwargs)
+
+
+@deprecate_nonkeyword_arguments(version="2.0")


you don't need to do this, just make everything keyword only (except for io) but rename that to path_or_buf

jreback · 2021-02-02T01:12:28Z

pandas/tests/io/formats/test_to_xml.py

+</data>"""
+
+
+@pytest.mark.parametrize("parser", ["lxml", "etree"])


this can be a fixture

jreback · 2021-02-05T19:12:54Z

doc/source/whatsnew/v1.3.0.rst

@@ -33,6 +33,80 @@ For example:
        storage_options=headers
    )

+.. _whatsnew_130.window_method_table:


looks like you picked up another change here

Should I merge latest? And should I add XML section to io.rst or handle in different PR?

you should always merge latest every time you are pushing

add docs for io.rst in this PR

there is a top-level table in io.rst that needs updating as well (for the I/O read/write methods near the top)

Added XML section and updated top-level table.

great, this still looks like an artfiact from a merge

jreback · 2021-02-05T19:13:01Z

doc/source/whatsnew/v1.3.0.rst

+We added I/O support to read and render shallow versions of XML documents with
+:func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser,
+full XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`)
+=======


rebase issue

Will clean up

jreback · 2021-02-05T19:15:37Z

just a note here. we can merge this as long as it passes all tests and is mostly complete. meaning can certainly have a followup issue to tackle small things / corrections / xfails.

jreback · 2021-02-07T17:21:31Z

doc/source/whatsnew/v1.3.0.rst

@@ -33,6 +33,80 @@ For example:
        storage_options=headers
    )

+.. _whatsnew_130.window_method_table:


you should always merge latest every time you are pushing

jreback · 2021-02-07T17:21:47Z

doc/source/whatsnew/v1.3.0.rst

@@ -33,6 +33,80 @@ For example:
        storage_options=headers
    )

+.. _whatsnew_130.window_method_table:


add docs for io.rst in this PR

jreback · 2021-02-07T17:22:22Z

doc/source/whatsnew/v1.3.0.rst

@@ -33,6 +33,80 @@ For example:
        storage_options=headers
    )

+.. _whatsnew_130.window_method_table:


there is a top-level table in io.rst that needs updating as well (for the I/O read/write methods near the top)

pandas/core/frame.py

….rst

twoertwein · 2021-02-23T15:01:39Z

pandas/io/formats/xml.py

+            style_doc = io.StringIO(style_doc)
+
+        handle_data = self._get_data_from_filepath(style_doc)
+        xml_data = self._preprocess_data(handle_data)


Is it possible to use _get_data_from_filepath here too? Maybe extend it to handle string and bytes as well. That would avoid creating a StringIO/BytesIO that isn't closed (I don't think that unclosed StringIO/BytesIO trigger a ResourceWarning)

Good call. I used StringIO here to pass mypy (which interestingly does not raise on similar line in pandas.io.xml for read_xml maybe due to ternary operator). But I can move the XML string to bytes conversion in _get_data_from_filepath which passes mypy and avoids io objects.

def _get_data_from_filepath(self, filepath_or_buffer): def _get_data_from_filepath(self, filepath_or_buffer): """ Extract raw XML data. ... """ filepath_or_buffer = stringify_path(filepath_or_buffer) if ( isinstance(filepath_or_buffer, str) and not filepath_or_buffer.startswith(("<?xml", "<")) ) and ( not isinstance(filepath_or_buffer, str) or is_url(filepath_or_buffer) or is_fsspec_url(filepath_or_buffer) or file_exists(filepath_or_buffer) ): with get_handle( filepath_or_buffer, "r", encoding=self.encoding, compression=self.compression, storage_options=self.storage_options, ) as handle_obj: filepath_or_buffer = ( handle_obj.handle.read() if hasattr(handle_obj.handle, "read") else handle_obj.handle ) return filepath_or_buffer

I need to convert, otherwise, method errs with below traceback. Any thoughts for conversion workaround?

Traceback (most recent call last): ... File "/home/pandas-parfaitg/pandas/io/xml.py", line 213, in _get_data_from_filepath with get_handle( File "/home/pandas-parfaitg/pandas/io/common.py", line 593, in get_handle ioargs = _get_filepath_or_buffer( File "/home/pandas-parfaitg/pandas/io/common.py", line 362, in _get_filepath_or_buffer file_obj = fsspec.open( File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 429, in open return open_files( File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 280, in open_files fs, fs_token, paths = get_fs_token_paths( File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 600, in get_fs_token_paths cls = get_filesystem_class(protocol) File "/opt/conda/lib/python3.8/site-packages/fsspec/registry.py", line 204, in get_filesystem_class raise ValueError("Protocol not known: %s" % protocol) ValueError: Protocol not known: <?xml version='1.0' encoding='utf-8'?>

based on the traceback it seem you are you passing an XML string/bytes to get_handle. I think in your case you can use get_handle for everything except 1) strings representing XML data, 2) any bytes objects, and 3) None (if that is possible).

See updated code block above, _get_data_from_filepath had conditionals before passing into get_handle (omitted earlier for brevity). My addition for your 1 is first if condition (which does not raise errors, passes tests, and mypy). I even add tests for a not file or buffer object.

Remaining CI test fails do not relate to this PR with pandas.io.xml but another test, specifically: pandas.tests.arrays.sparse.test_array. Should I keep upstream/merge?

Great! The IO side looks good to me!

Is xml_data always opened by pandas or is it possible that it is a user-provided file object? If it is always opened by pandas you could have something like the following to make sure that the StringIO/BytesIO are always closed:

with self._preprocess_data(handle_data) as xml_data: ....

I think there are a few more places where this could be used (if xml_data cannot be a user-provided file handle).

Remaining CI test fails do not relate to this PR with pandas.io.xml but another test, specifically: pandas.tests.arrays.sparse.test_array. Should I keep upstream/merge?

Multiple commits on master have the same failing test. I wouldn't worry about it.

looking great! I don't think you need to add tests for not intended user input (None/DataFrame).

Understood. But for None, what if user sends a variable to read_xml that contains NoneType? Possibly initialized but never given a value later or variable was assigned None at end of an API process.

twoertwein · 2021-02-24T14:53:28Z

pandas/tests/io/test_xml.py

@@ -237,6 +236,28 @@ def test_file_buffered_reader_no_xml_declaration(datapath, parser, mode):
    tm.assert_frame_equal(df_str, df_expected)


+@td.skip_if_no("lxml")
+def test_closed_file_lxml(datapath):


you can probably parametrize these two tests.

In general, I'm not sure whether it is necessary to enforce generic error messages for "obviously" wrong inputs (None/closed files handles). @jreback

One test to add (or extending an existing test) is to make sure that a user-provided file handle is not closed by read/to_xml.

Understood. I can remove those wrong input tests. I tried simulating how users may behave (having answered many StackOverflow pandas answers from newbies!). Will parametrize and add file handle close tests.

jreback

a number of comments. overall looks really good.

I think might be ok to centralize all testing in pandas/tests/io/xml/test_to_xml.py and so on (you can move all these tests there)

jreback · 2021-02-24T17:06:41Z

pandas/io/formats/xml.py

+        including replacing missing entities and including indexes.
+        """
+
+        na_dict = {"None": self.na_rep, "NaN": self.na_rep, "nan": self.na_rep}


@ParfaitG can you update this

jreback · 2021-02-24T17:07:02Z

pandas/io/formats/xml.py

+
+        raise AbstractMethodError(self)
+
+    def _get_data_from_filepath(self, filepath_or_buffer):


can you type this?

jreback · 2021-02-24T17:07:13Z

pandas/io/formats/xml.py

+
+        return filepath_or_buffer
+
+    def _preprocess_data(self, data):


can you type

jreback · 2021-02-24T17:07:40Z

pandas/io/formats/xml.py

+        The data either has a `read` attribute (e.g. a file object or a
+        StringIO/BytesIO) or is a string or bytes that is an XML document.
+        """
+        if isinstance(data, str):


hmm why do we accept bytes here? (and not just string)?

why is this a method on the class?

Both _get_data_from_filepath and _preprocess_data methods were borrowed from pandas.io.json._json inside the JsonReader class. How to adjust here?

oh, ok yeah just make a module level function if you need to do this then

jreback · 2021-02-24T17:09:43Z

pandas/io/xml.py

+        children = self.xml_doc.xpath(self.xpath + "/*", namespaces=self.namespaces)
+        attrs = self.xml_doc.xpath(self.xpath + "/@*", namespaces=self.namespaces)
+
+        if (elems == [] and attrs == [] and children == []) or (


this is a very strange condition, can you pull out the attrs & children. can you use not children and so on here

jreback · 2021-02-24T17:10:17Z

pandas/io/xml.py

+            return tp.read()
+    except ParserError:
+        raise ParserError(
+            "XML document may be too complex for import. "


is this hit in tests?

This is on to-do list for tests. A catch-all for edge cases that passes other checks but fails here. There may be a complex XML that I have not anticipated. Otherwise I can let TextParser's TypeError raise.

TODO list is fine

jreback · 2021-02-24T17:12:04Z

pandas/tests/io/test_xml.py

+    with open(xsl, mode) as f:
+        xsl_obj = f.read()
+
+    read_xml(kml, stylesheet=xsl_obj)


no comparisons? (but should check something minimal at least)

jreback

lgtm. @pandas-dev/pandas-core if anyone would like to review. will merge in a few days otherwise.

WillAyd

This is really impressive - nice work! Small comments but given how large this PR is I would be OK with merging and tackling as follow ups

WillAyd · 2021-02-26T17:34:43Z

pandas/io/formats/xml.py

+        return bytes(new_doc)
+
+
+def _get_data_from_filepath(


This looks to be one of the only functions not typed; always nice to have (can be done in a follow up)

Also is this a copy of the function in the other xml.py module? Would be nice to de-duplicate

Re parse_doc not typed in Lxml classes, originally I had it typed but lxml unlike etree has its _Element and _ElementTree objects as private variables (with leading underscores) which fails flake8 on import. Can ignore.

Yes, methods do repeat. Can pull out of class as module level method in pandas.io.import (i.e., read_xml) to be imported in pandas.io.formats.xml (i.e., to_xml).

Digging deeper, the private variables were from modified class objects. lxml does use same named types as etree. Adjusted accordingly.

WillAyd · 2021-02-26T17:35:13Z

pandas/io/xml.py

+    functionality.
+    """
+
+    def __init__(


Similar comment on typing for this method

jreback · 2021-02-27T18:15:28Z

thanks @ParfaitG really nice. please issue PRs for followups when you can. You may want to move the list into an issue for tracking.

ParfaitG · 2021-02-27T19:00:33Z

Thanks @twoertwein for your tremendous help on I/O side!

ParfaitG added 2 commits January 31, 2021 18:32

ENH: Add i/o support of XML with pandas.read_xml and DataFrame.to_xml…

b67d876

… (GH27554)

Merge branch 'master' into read_xml

98e3bcd

twoertwein reviewed Feb 2, 2021

View reviewed changes

jreback requested changes Feb 2, 2021

View reviewed changes

jreback added the IO XML read_xml, to_xml label Feb 2, 2021

ParfaitG added 17 commits February 2, 2021 18:53

Refactor code for base classes, add tests, adjust whatsnew entry

cd79a06

Merge remote-tracking branch 'upstream/master' into read_xml

6c06dc2

Fixed import_optional_dependency() args

fadcb67

Fix fixture and param name collision and check two errors in tests

ac5fd3a

Merge remote-tracking branch 'upstream/master' into read_xml

25ba341

Merge remote-tracking branch 'upstream/master' into read_xml

143402a

Adjusted tests to handle etree version issues

938b0a0

Add appropriate etree skips in tests

a92c21e

Remove check for warnings in tests

51f10f2

Adjust code to conform to mypy and docstring validation

3520d58

Add read_xml to TestPDApi test and fix for etree test

4832562

Add read_xml to TestPDApi test and fix for etree test

2914c32

Replace lxml ImportWarning for ImportError with added tests

72d0e93

Merge remote-tracking branch 'upstream/master' into read_xml

6453f6e

Merge remote-tracking branch 'upstream/master' into read_xml

8af695e

Adjust fixture for lxml skip and add error validation in tests

b80b8ce

Add conditional skips for envs without lxml

a6cfc90

jreback requested changes Feb 5, 2021

View reviewed changes

ParfaitG added 2 commits February 5, 2021 15:22

Clean up whatnew rst of rebase issue

6c4e0b4

Fix unescaped emphasis and wording in read_xml docstring

a57fd35

jreback requested changes Feb 7, 2021

View reviewed changes

ParfaitG added 2 commits February 7, 2021 19:03

Merge remote-tracking branch 'upstream/master' into read_xml

16cbcd3

Add XML section in io.rst and lxml dependency for read_xml in install…

23439b4

….rst

ParfaitG added 3 commits February 22, 2021 20:17

Resolve merge conflict with upstream/master

b0b3759

Add XML table in install.rst

b48e257

Merge remote-tracking branch 'upstream/master' into read_xml

453ac40

twoertwein reviewed Feb 23, 2021

View reviewed changes

ParfaitG added 8 commits February 23, 2021 12:52

Streamline filepath_or_buffer handling and add TypeError tests

9b21636

Merge remote-tracking branch 'upstream/master' into read_xml

bea318c

Fix lxml test on few Python envs

49343b1

Adjust io handling in context maanger

ce986bc

Merge remote-tracking branch 'upstream/master' into read_xml

347d58b

Add and fix tests for special filepath_or_buffer values

e2f80db

Fix tests for better example and wrong parser

c7e1e11

Merge remote-tracking branch 'upstream/master' into read_xml

9790e7c

twoertwein reviewed Feb 24, 2021

View reviewed changes

ParfaitG added 2 commits February 24, 2021 09:09

Adjust to handle empty string stylesheet with tests

df9ecf4

Merge remote-tracking branch 'upstream/master' into read_xml

46719b7

jreback requested changes Feb 24, 2021

View reviewed changes

ParfaitG added 4 commits February 25, 2021 00:47

Move methods out of class, adjust xpath check, and data frame formatting

5d75d51

Merge remote-tracking branch 'upstream/master' into read_xml

66c01d2

Update tests to conform to mypy

5c0af6e

Merge remote-tracking branch 'upstream/master' into read_xml

2eae8ad

jreback approved these changes Feb 26, 2021

View reviewed changes

WillAyd approved these changes Feb 26, 2021

View reviewed changes

ParfaitG added 3 commits February 27, 2021 07:46

Import methods to avoid duplication and add typing to parse_doc

603644e

Merge remote-tracking branch 'upstream/master' into read_xml

3ec7297

Refactor code and revert changes to avoid optional module type hints

6194f83

jreback merged commit 11afc76 into pandas-dev:master Feb 27, 2021

ParfaitG deleted the read_xml branch February 27, 2021 18:59

ParfaitG mentioned this pull request Mar 1, 2021

Pandas IO XML Issue Tracker #40131

Closed

14 tasks

		return _data_to_frame(data=data_dicts, **kwargs)


		@deprecate_nonkeyword_arguments(version="2.0")

		</data>"""


		@pytest.mark.parametrize("parser", ["lxml", "etree"])


		raise AbstractMethodError(self)

		def _get_data_from_filepath(self, filepath_or_buffer):

ENH: Add I/O support of XML with pandas.read_xml and DataFrame.to_xml… #39516

ENH: Add I/O support of XML with pandas.read_xml and DataFrame.to_xml… #39516

Conversation

ParfaitG commented Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParfaitG Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParfaitG Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParfaitG Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParfaitG Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 27, 2021

ParfaitG commented Feb 27, 2021

ParfaitG commented Feb 1, 2021 •

edited

Loading

ParfaitG Feb 5, 2021 •

edited

Loading

ParfaitG Feb 23, 2021 •

edited

Loading

ParfaitG Feb 23, 2021 •

edited

Loading

ParfaitG Feb 26, 2021 •

edited

Loading