diff --git a/AUTHORS.rst b/AUTHORS.rst index 3dde015f..af0a7a65 100644 --- a/AUTHORS.rst +++ b/AUTHORS.rst @@ -45,3 +45,4 @@ Patches and suggestions - Jon Dufresne - Ville Skyttä - Jonathan Vanasco +- Tom Most diff --git a/CHANGES.rst b/CHANGES.rst index 047a7545..8690d749 100644 --- a/CHANGES.rst +++ b/CHANGES.rst @@ -32,7 +32,7 @@ Released on July 14, 2016 * Cease supporting DATrie under PyPy. -* **Remove ``PullDOM`` support, as this hasn't ever been properly +* **Remove PullDOM support, as this hasn't ever been properly tested, doesn't entirely work, and as far as I can tell is completely unused by anyone.** @@ -70,7 +70,7 @@ Released on July 14, 2016 to clarify their status as public.** * **Get rid of the sanitizer package. Merge sanitizer.sanitize into the - sanitizer.htmlsanitizer module and move that to saniziter. This means + sanitizer.htmlsanitizer module and move that to sanitizer. This means anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no code changes.** diff --git a/doc/html5lib.rst b/doc/html5lib.rst index f0646aac..2a0b150f 100644 --- a/doc/html5lib.rst +++ b/doc/html5lib.rst @@ -1,13 +1,8 @@ html5lib Package ================ -:mod:`html5lib` Package ------------------------ - -.. automodule:: html5lib.__init__ - :members: - :undoc-members: - :show-inheritance: +.. automodule:: html5lib + :members: __version__ :mod:`constants` Module ----------------------- @@ -26,7 +21,7 @@ html5lib Package :show-inheritance: :mod:`serializer` Module ----------------------- +------------------------ .. automodule:: html5lib.serializer :members: @@ -41,4 +36,5 @@ Subpackages html5lib.filters html5lib.treebuilders html5lib.treewalkers + html5lib.treeadapters diff --git a/doc/html5lib.treeadapters.rst b/doc/html5lib.treeadapters.rst new file mode 100644 index 00000000..6b2dc78d --- /dev/null +++ b/doc/html5lib.treeadapters.rst @@ -0,0 +1,20 @@ +treebuilders Package +==================== + +:mod:`~html5lib.treeadapters` Package +------------------------------------- + +.. automodule:: html5lib.treeadapters + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: html5lib.treeadapters.genshi + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: html5lib.treeadapters.sax + :members: + :undoc-members: + :show-inheritance: diff --git a/doc/html5lib.treewalkers.rst b/doc/html5lib.treewalkers.rst index 46501258..085d8a98 100644 --- a/doc/html5lib.treewalkers.rst +++ b/doc/html5lib.treewalkers.rst @@ -10,7 +10,7 @@ treewalkers Package :show-inheritance: :mod:`base` Module -------------------- +------------------ .. automodule:: html5lib.treewalkers.base :members: @@ -34,7 +34,7 @@ treewalkers Package :show-inheritance: :mod:`etree_lxml` Module ------------------------ +------------------------ .. automodule:: html5lib.treewalkers.etree_lxml :members: @@ -43,9 +43,9 @@ treewalkers Package :mod:`genshi` Module --------------------------- +-------------------- .. automodule:: html5lib.treewalkers.genshi :members: :undoc-members: - :show-inheritance: \ No newline at end of file + :show-inheritance: diff --git a/doc/movingparts.rst b/doc/movingparts.rst index 80ee2ad1..6ba367a2 100644 --- a/doc/movingparts.rst +++ b/doc/movingparts.rst @@ -4,22 +4,25 @@ The moving parts html5lib consists of a number of components, which are responsible for handling its features. +Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document. +Several tree representations are supported, as are translations to other formats via *tree adapters*. +The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes. +The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization. Tree builders ------------- The parser reads HTML by tokenizing the content and building a tree that -the user can later access. There are three main types of trees that -html5lib can build: +the user can later access. html5lib can build three types of trees: -* ``etree`` - this is the default; builds a tree based on ``xml.etree``, +* ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`, which can be found in the standard library. Whenever possible, the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x) is used. -* ``dom`` - builds a tree based on ``xml.dom.minidom``. +* ``dom`` - builds a tree based on :mod:`xml.dom.minidom`. -* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree`` +* ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree`` API. The performance gains are relatively small compared to using the accelerated ``ElementTree`` module. @@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API: with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") -When instantiating a parser object, you have to pass a tree builder -class in the ``tree`` keyword attribute: +To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function. -.. code-block:: python - - import html5lib - parser = html5lib.HTMLParser(tree=SomeTreeBuilder) - document = parser.parse("
Hello World!") - -To get a builder class by name, use the ``getTreeBuilder`` function: +When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute: .. code-block:: python import html5lib - parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) + TreeBuilder = html5lib.getTreeBuilder("dom") + parser = html5lib.HTMLParser(tree=TreeBuilder) minidom_document = parser.parse("
Hello World!")
The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
Tree walkers
------------
-Once a tree is ready, you can work on it either manually, or using
-a tree walker, which provides a streaming view of the tree. html5lib
-provides walkers for all three supported types of trees (``etree``,
-``dom`` and ``lxml``).
+In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
+html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams Surprise!")
-
-HTMLTokenizer
-~~~~~~~~~~~~~
-
-This is the default tokenizer, the heart of html5lib. The implementation
-can be found in `html5lib/tokenizer.py
-