Skip to content

ENH: Add I/O support of XML with pandas.read_xml and DataFrame.to_xml… #39516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Feb 27, 2021

Conversation

ParfaitG
Copy link
Contributor

@ParfaitG ParfaitG commented Feb 1, 2021

… (GH27554)

To Do List

  • Add parse_dates feature for read_xml.
  • Add tests for storage_options.
  • Add tests for ParserError, OSError, URLError, etc.
  • Add xpath_vars feature to pass $ variables in xpath expression. See lxml xpath() method.
  • Add xsl_params feature to pass values into XSLT script. See lxml stylesheet parameters.
  • Add iterparse feature for memory efficient parsing of large XML. See etree iterparse and lxml iterparse.


try:
if self.io:
with open(self.io, "wb") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to support compression/fsspec/buffers and so on:

with get_handle(self.io, mode="wb", is_text=False, storage_options=..., compression=...) as handles:
    handles.handle.write(xml_doc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not strong on these features. I attempted compression in a draft version of read_xml looking at _json.py but implementation would require workarounds. Can this be for future PR? If not, I would need several days to incorporate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally fine to defer
just add a check list of todos (top of PR is ok or new issue)

pandas/io/xml.py Outdated
as string, depending on object type.
"""

obj = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could boil down to the following (except for str/bytes)

with get_handle(self.io, mode="r", is_text=True, encoding=self.encoding) as handles:
    obj = handles.handle.read()

Accepting strings of XML and strings representing filepaths might get tricky in some cases?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. a number of comments on the structure. will give more specific comments on the code itself after reorg. the tests look ok so far. need a bunch more on error testing. we want virtually every error line tested (meaning anytime you are raising an excpetion each tyep should be tested).

its ok to add a checklist to the issue to track things (some of which can also be in followup PRs)

.. _whatsnew_130.read_to_xml:

We added I/O support to read and render shallow versions of XML documents with
:func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a reference to lxml (same one as we have in install.rst)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

</row>
</data>"""

df = pd.read_xml(xml)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to show the rendered df, so end the ipython block here, and then add another one for the .to_xml()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -2604,6 +2604,178 @@ def to_html(
render_links=render_links,
)

def to_xml(
self,
io: Optional[FilePathOrBuffer[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this path_or_buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1003,6 +1005,121 @@ def to_html(
string = html_formatter.to_string()
return save_to_buffer(string, buf=buf, encoding=encoding)

def to_xml(
self,
io: Optional[FilePathOrBuffer[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

if isinstance(self.stylesheet, str):
obj = self.stylesheet

if isinstance(self.stylesheet, bytes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if/elif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but see @twoertwein comments

fallback option with etree parser.
"""

if parser == "lxml":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this is repeated logic here. need to handle this centrally.

pandas/io/xml.py Outdated
return _data_to_frame(data=data_dicts, **kwargs)


@deprecate_nonkeyword_arguments(version="2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to do this, just make everything keyword only (except for io) but rename that to path_or_buf

</data>"""


@pytest.mark.parametrize("parser", ["lxml", "etree"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a fixture

@jreback jreback added the IO XML read_xml, to_xml label Feb 2, 2021
@@ -33,6 +33,80 @@ For example:
storage_options=headers
)

.. _whatsnew_130.window_method_table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you picked up another change here

Copy link
Contributor Author

@ParfaitG ParfaitG Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I merge latest? And should I add XML section to io.rst or handle in different PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should always merge latest every time you are pushing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docs for io.rst in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a top-level table in io.rst that needs updating as well (for the I/O read/write methods near the top)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added XML section and updated top-level table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, this still looks like an artfiact from a merge

We added I/O support to read and render shallow versions of XML documents with
:func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser,
full XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`)
=======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clean up

@jreback
Copy link
Contributor

jreback commented Feb 5, 2021

just a note here. we can merge this as long as it passes all tests and is mostly complete. meaning can certainly have a followup issue to tackle small things / corrections / xfails.

@@ -33,6 +33,80 @@ For example:
storage_options=headers
)

.. _whatsnew_130.window_method_table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should always merge latest every time you are pushing

@@ -33,6 +33,80 @@ For example:
storage_options=headers
)

.. _whatsnew_130.window_method_table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docs for io.rst in this PR

@@ -33,6 +33,80 @@ For example:
storage_options=headers
)

.. _whatsnew_130.window_method_table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a top-level table in io.rst that needs updating as well (for the I/O read/write methods near the top)

style_doc = io.StringIO(style_doc)

handle_data = self._get_data_from_filepath(style_doc)
xml_data = self._preprocess_data(handle_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use _get_data_from_filepath here too? Maybe extend it to handle string and bytes as well. That would avoid creating a StringIO/BytesIO that isn't closed (I don't think that unclosed StringIO/BytesIO trigger a ResourceWarning)

Copy link
Contributor Author

@ParfaitG ParfaitG Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I used StringIO here to pass mypy (which interestingly does not raise on similar line in pandas.io.xml for read_xml maybe due to ternary operator). But I can move the XML string to bytes conversion in _get_data_from_filepath which passes mypy and avoids io objects.

def _get_data_from_filepath(self, filepath_or_buffer):
    def _get_data_from_filepath(self, filepath_or_buffer):
        """
        Extract raw XML data.

        ...
        """
        filepath_or_buffer = stringify_path(filepath_or_buffer)

        if (
            isinstance(filepath_or_buffer, str)
            and not filepath_or_buffer.startswith(("<?xml", "<"))
        ) and (
            not isinstance(filepath_or_buffer, str)
            or is_url(filepath_or_buffer)
            or is_fsspec_url(filepath_or_buffer)
            or file_exists(filepath_or_buffer)
        ):
            with get_handle(
                filepath_or_buffer,
                "r",
                encoding=self.encoding,
                compression=self.compression,
                storage_options=self.storage_options,
            ) as handle_obj:
                filepath_or_buffer = (
                    handle_obj.handle.read()
                    if hasattr(handle_obj.handle, "read")
                    else handle_obj.handle
                )

        return filepath_or_buffer

I need to convert, otherwise, method errs with below traceback. Any thoughts for conversion workaround?

Traceback (most recent call last):
   ...
  File "/home/pandas-parfaitg/pandas/io/xml.py", line 213, in _get_data_from_filepath
    with get_handle(
  File "/home/pandas-parfaitg/pandas/io/common.py", line 593, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/home/pandas-parfaitg/pandas/io/common.py", line 362, in _get_filepath_or_buffer
    file_obj = fsspec.open(
  File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 429, in open
    return open_files(
  File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 280, in open_files
    fs, fs_token, paths = get_fs_token_paths(
  File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 600, in get_fs_token_paths
    cls = get_filesystem_class(protocol)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/registry.py", line 204, in get_filesystem_class
    raise ValueError("Protocol not known: %s" % protocol)
ValueError: Protocol not known: <?xml version='1.0' encoding='utf-8'?>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the traceback it seem you are you passing an XML string/bytes to get_handle. I think in your case you can use get_handle for everything except 1) strings representing XML data, 2) any bytes objects, and 3) None (if that is possible).

Copy link
Contributor Author

@ParfaitG ParfaitG Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See updated code block above, _get_data_from_filepath had conditionals before passing into get_handle (omitted earlier for brevity). My addition for your 1 is first if condition (which does not raise errors, passes tests, and mypy). I even add tests for a not file or buffer object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining CI test fails do not relate to this PR with pandas.io.xml but another test, specifically: pandas.tests.arrays.sparse.test_array. Should I keep upstream/merge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! The IO side looks good to me!

Is xml_data always opened by pandas or is it possible that it is a user-provided file object? If it is always opened by pandas you could have something like the following to make sure that the StringIO/BytesIO are always closed:

with self._preprocess_data(handle_data) as xml_data:
    ....

I think there are a few more places where this could be used (if xml_data cannot be a user-provided file handle).

Remaining CI test fails do not relate to this PR with pandas.io.xml but another test, specifically: pandas.tests.arrays.sparse.test_array. Should I keep upstream/merge?

Multiple commits on master have the same failing test. I wouldn't worry about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great! I don't think you need to add tests for not intended user input (None/DataFrame).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. But for None, what if user sends a variable to read_xml that contains NoneType? Possibly initialized but never given a value later or variable was assigned None at end of an API process.

@@ -237,6 +236,28 @@ def test_file_buffered_reader_no_xml_declaration(datapath, parser, mode):
tm.assert_frame_equal(df_str, df_expected)


@td.skip_if_no("lxml")
def test_closed_file_lxml(datapath):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably parametrize these two tests.

In general, I'm not sure whether it is necessary to enforce generic error messages for "obviously" wrong inputs (None/closed files handles). @jreback

One test to add (or extending an existing test) is to make sure that a user-provided file handle is not closed by read/to_xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I can remove those wrong input tests. I tried simulating how users may behave (having answered many StackOverflow pandas answers from newbies!). Will parametrize and add file handle close tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a number of comments. overall looks really good.

I think might be ok to centralize all testing in pandas/tests/io/xml/test_to_xml.py and so on (you can move all these tests there)

including replacing missing entities and including indexes.
"""

na_dict = {"None": self.na_rep, "NaN": self.na_rep, "nan": self.na_rep}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParfaitG can you update this


raise AbstractMethodError(self)

def _get_data_from_filepath(self, filepath_or_buffer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


return filepath_or_buffer

def _preprocess_data(self, data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

The data either has a `read` attribute (e.g. a file object or a
StringIO/BytesIO) or is a string or bytes that is an XML document.
"""
if isinstance(data, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why do we accept bytes here? (and not just string)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a method on the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both _get_data_from_filepath and _preprocess_data methods were borrowed from pandas.io.json._json inside the JsonReader class. How to adjust here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok yeah just make a module level function if you need to do this then

pandas/io/xml.py Outdated
children = self.xml_doc.xpath(self.xpath + "/*", namespaces=self.namespaces)
attrs = self.xml_doc.xpath(self.xpath + "/@*", namespaces=self.namespaces)

if (elems == [] and attrs == [] and children == []) or (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very strange condition, can you pull out the attrs & children. can you use not children and so on here

return tp.read()
except ParserError:
raise ParserError(
"XML document may be too complex for import. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this hit in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on to-do list for tests. A catch-all for edge cases that passes other checks but fails here. There may be a complex XML that I have not anticipated. Otherwise I can let TextParser's TypeError raise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO list is fine

with open(xsl, mode) as f:
xsl_obj = f.read()

read_xml(kml, stylesheet=xsl_obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no comparisons? (but should check something minimal at least)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @pandas-dev/pandas-core if anyone would like to review. will merge in a few days otherwise.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really impressive - nice work! Small comments but given how large this PR is I would be OK with merging and tackling as follow ups

return bytes(new_doc)


def _get_data_from_filepath(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be one of the only functions not typed; always nice to have (can be done in a follow up)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is this a copy of the function in the other xml.py module? Would be nice to de-duplicate

Copy link
Contributor Author

@ParfaitG ParfaitG Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re parse_doc not typed in Lxml classes, originally I had it typed but lxml unlike etree has its _Element and _ElementTree objects as private variables (with leading underscores) which fails flake8 on import. Can ignore.

Yes, methods do repeat. Can pull out of class as module level method in pandas.io.import (i.e., read_xml) to be imported in pandas.io.formats.xml (i.e., to_xml).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging deeper, the private variables were from modified class objects. lxml does use same named types as etree. Adjusted accordingly.

functionality.
"""

def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment on typing for this method

@jreback jreback merged commit 11afc76 into pandas-dev:master Feb 27, 2021
@jreback
Copy link
Contributor

jreback commented Feb 27, 2021

thanks @ParfaitG really nice. please issue PRs for followups when you can. You may want to move the list into an issue for tracking.

@ParfaitG ParfaitG deleted the read_xml branch February 27, 2021 18:59
@ParfaitG
Copy link
Contributor Author

Thanks @twoertwein for your tremendous help on I/O side!

@ParfaitG ParfaitG mentioned this pull request Mar 1, 2021
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO XML read_xml, to_xml
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: to_xml, read_xml
4 participants