html5lib · willkg · Nov 6, 2017 · Apr 15, 2017 · Apr 15, 2017 · Apr 15, 2017
diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -45,3 +45,4 @@ Patches and suggestions
 - Jon Dufresne
 - Ville Skyttä
 - Jonathan Vanasco
+- Tom Most
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -32,7 +32,7 @@ Released on July 14, 2016
 
 * Cease supporting DATrie under PyPy.
 
-* **Remove ``PullDOM`` support, as this hasn't ever been properly
+* **Remove PullDOM support, as this hasn't ever been properly
   tested, doesn't entirely work, and as far as I can tell is
   completely unused by anyone.**
 
@@ -70,7 +70,7 @@ Released on July 14, 2016
   to clarify their status as public.**
 
 * **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
-  sanitizer.htmlsanitizer module and move that to saniziter. This means
+  sanitizer.htmlsanitizer module and move that to sanitizer. This means
   anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
   code changes.**
 

diff --git a/doc/html5lib.rst b/doc/html5lib.rst
@@ -1,13 +1,8 @@
 html5lib Package
 ================
 
-:mod:`html5lib` Package
------------------------
-
-.. automodule:: html5lib.__init__
-    :members:
-    :undoc-members:
-    :show-inheritance:
+.. automodule:: html5lib
+    :members: __version__
 
 :mod:`constants` Module
 -----------------------
@@ -26,7 +21,7 @@ html5lib Package
     :show-inheritance:
 
 :mod:`serializer` Module
-----------------------
+------------------------
 
 .. automodule:: html5lib.serializer
     :members:
@@ -41,4 +36,5 @@ Subpackages
     html5lib.filters
     html5lib.treebuilders
     html5lib.treewalkers
+    html5lib.treeadapters
 
diff --git a/doc/html5lib.treeadapters.rst b/doc/html5lib.treeadapters.rst
@@ -0,0 +1,20 @@
+treebuilders Package
+====================
+
+:mod:`~html5lib.treeadapters` Package
+-------------------------------------
+
+.. automodule:: html5lib.treeadapters
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+.. automodule:: html5lib.treeadapters.genshi
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+.. automodule:: html5lib.treeadapters.sax
+    :members:
+    :undoc-members:
+    :show-inheritance:
diff --git a/doc/html5lib.treewalkers.rst b/doc/html5lib.treewalkers.rst
@@ -10,7 +10,7 @@ treewalkers Package
     :show-inheritance:
 
 :mod:`base` Module
--------------------
+------------------
 
 .. automodule:: html5lib.treewalkers.base
     :members:
@@ -34,7 +34,7 @@ treewalkers Package
     :show-inheritance:
 
 :mod:`etree_lxml` Module
------------------------
+------------------------
 
 .. automodule:: html5lib.treewalkers.etree_lxml
     :members:
@@ -43,9 +43,9 @@ treewalkers Package
 
 
 :mod:`genshi` Module
---------------------------
+--------------------
 
 .. automodule:: html5lib.treewalkers.genshi
     :members:
     :undoc-members:
-    :show-inheritance:
+    :show-inheritance:
diff --git a/doc/movingparts.rst b/doc/movingparts.rst
@@ -4,22 +4,25 @@ The moving parts
 html5lib consists of a number of components, which are responsible for
 handling its features.
 
+Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
+Several tree representations are supported, as are translations to other formats via *tree adapters*.
+The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
+The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.
 
 Tree builders
 -------------
 
 The parser reads HTML by tokenizing the content and building a tree that
-the user can later access. There are three main types of trees that
-html5lib can build:
+the user can later access. html5lib can build three types of trees:
 
-* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
+* ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`,
   which can be found in the standard library. Whenever possible, the
   accelerated ``ElementTree`` implementation (i.e.
   ``xml.etree.cElementTree`` on Python 2.x) is used.
 
-* ``dom`` - builds a tree based on ``xml.dom.minidom``.
+* ``dom`` - builds a tree based on :mod:`xml.dom.minidom`.
 
-* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
+* ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree``
   API.  The performance gains are relatively small compared to using the
   accelerated ``ElementTree`` module.
 
@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
   with open("mydocument.html", "rb") as f:
       lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
 
-When instantiating a parser object, you have to pass a tree builder
-class in the ``tree`` keyword attribute:
+To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function.
 
-.. code-block:: python
-
-  import html5lib
-  parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
-  document = parser.parse("<p>Hello World!")
-
-To get a builder class by name, use the ``getTreeBuilder`` function:
+When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:
 
 .. code-block:: python
 
   import html5lib
-  parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
+  TreeBuilder = html5lib.getTreeBuilder("dom")
+  parser = html5lib.HTMLParser(tree=TreeBuilder)
   minidom_document = parser.parse("<p>Hello World!")
 
 The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
 Tree walkers
 ------------
 
-Once a tree is ready, you can work on it either manually, or using
-a tree walker, which provides a streaming view of the tree. html5lib
-provides walkers for all three supported types of trees (``etree``,
-``dom`` and ``lxml``).
+In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
+html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
 
 The implementation of walkers can be found in `html5lib/treewalkers/
 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
 
-Walkers make consuming HTML easier. html5lib uses them to provide you
-with has a couple of handy tools.
-
+html5lib provides :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.
 
 HTMLSerializer
 ~~~~~~~~~~~~~~
@@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.
   '>'
   'Witam wszystkich'
 
-You can customize the serializer behaviour in a variety of ways, consult
-the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
-documentation.
+You can customize the serializer behaviour in a variety of ways. Consult
+the :class:`~html5lib.serializer.HTMLSerializer` documentation.
 
 
 Filters
 ~~~~~~~
 
-You can alter the stream content with filters provided by html5lib:
+html5lib provides several filters:
 
 * :class:`alphabeticalattributes.Filter
   <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
@@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:
   the document
 
 * :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
-  ``LintError`` exceptions on invalid tag and attribute names, invalid
+  :exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
     assert namespace is None or isinstance(namespace, text_type) 
     assert namespace != "" 
     assert isinstance(name, text_type) 
     assert name != "" 
     assert isinstance(token["data"], dict) 
     if (not namespace or namespace == namespaces["html"]) and name in voidElements: 
         assert type == "EmptyTag" 
     else: 
         assert type == "StartTag" 
     if type == "StartTag" and self.require_matching_tags: 
         open_elements.append((namespace, name)) 
     for (namespace, name), value in token["data"].items(): 
         assert namespace is None or isinstance(namespace, text_type) 
         assert namespace != "" 
         assert isinstance(name, text_type) 
         assert name != "" 
         assert isinstance(value, text_type) 
 elif type == "EndTag": 
     namespace = token["namespace"] 
     name = token["name"] 
     assert namespace is None or isinstance(namespace, text_type) 
     assert namespace != "" 
     assert isinstance(name, text_type) 
     assert name != "" 
     if (not namespace or namespace == namespaces["html"]) and name in voidElements: 
         assert False, "Void element reported as EndTag token: %(tag)s" % {"tag": name} 
     elif self.require_matching_tags: 
         start = open_elements.pop() 
         assert start == (namespace, name) 
 elif type == "Comment": 
     data = token["data"] 
     assert isinstance(data, text_type) 
 elif type in ("Characters", "SpaceCharacters"): 
     data = token["data"] 
     assert isinstance(data, text_type) 
     assert data != "" 
     if type == "SpaceCharacters": 
         assert data.strip(spaceCharacters) == "" 
 elif type == "Doctype": 
     name = token["name"] 
     assert name is None or isinstance(name, text_type) 
     assert token["publicId"] is None or isinstance(name, text_type) 
     assert token["systemId"] is None or isinstance(name, text_type) 
 elif type == "Entity": 
     assert isinstance(token["name"], text_type) 
 elif type == "SerializerError": 
     assert isinstance(token["data"], text_type) 
 else: 
     assert False, "Unknown token type: %(type)s" % {"type": type} 
     assert namespace is None or isinstance(namespace, text_type) 
     assert namespace != "" 
     assert isinstance(name, text_type) 
     assert name != "" 
     assert isinstance(token["data"], dict) 
     if (not namespace or namespace == namespaces["html"]) and name in voidElements: 
         assert type == "EmptyTag" 
     else: 
         assert type == "StartTag" 
     if type == "StartTag" and self.require_matching_tags: 
         open_elements.append((namespace, name)) 
     for (namespace, name), value in token["data"].items(): 
         assert namespace is None or isinstance(namespace, text_type) 
         assert namespace != "" 
         assert isinstance(name, text_type) 
         assert name != "" 
         assert isinstance(value, text_type) 
  
 elif type == "EndTag": 
     namespace = token["namespace"] 
     name = token["name"] 
     assert namespace is None or isinstance(namespace, text_type) 
     assert namespace != "" 
     assert isinstance(name, text_type) 
     assert name != "" 
     if (not namespace or namespace == namespaces["html"]) and name in voidElements: 
         assert False, "Void element reported as EndTag token: %(tag)s" % {"tag": name} 
     elif self.require_matching_tags: 
         start = open_elements.pop() 
         assert start == (namespace, name) 
  
 elif type == "Comment": 
     data = token["data"] 
     assert isinstance(data, text_type) 
  
 elif type in ("Characters", "SpaceCharacters"): 
     data = token["data"] 
     assert isinstance(data, text_type) 
     assert data != "" 
     if type == "SpaceCharacters": 
         assert data.strip(spaceCharacters) == "" 
  
 elif type == "Doctype": 
     name = token["name"] 
     assert name is None or isinstance(name, text_type) 
     assert token["publicId"] is None or isinstance(name, text_type) 
     assert token["systemId"] is None or isinstance(name, text_type) 
  
 elif type == "Entity": 
     assert isinstance(token["name"], text_type) 
  
 elif type == "SerializerError": 
     assert isinstance(token["data"], text_type) 
  
 else: 
     assert False, "Unknown token type: %(type)s" % {"type": type} 
   PCDATA, etc.
 
 * :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
-  removes tags from the stream which are not necessary to produce valid
+  removes tags from the token stream which are not necessary to produce valid
   HTML
 
 * :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
@@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:
 
 * :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
   collapses all whitespace characters to single spaces unless they're in
-  ``<pre/>`` or ``textarea`` tags.
+  ``<pre/>`` or ``<textarea/>`` tags.
 
-To use a filter, simply wrap it around a stream:
+To use a filter, simply wrap it around a token stream:
 
 .. code-block:: python
 
@@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:
 Tree adapters
 -------------
 
-Used to translate one type of tree to another. More documentation
-pending, sorry.
+Tree adapters can be used to translate between tree formats.
+Two adapters are provided by html5lib:
 
+* :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
+* :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.
 
 Encoding discovery
 ------------------
@@ -156,54 +150,16 @@ the following way:
 * The encoding may be explicitly specified by passing the name of the
   encoding as the encoding parameter to the
   :meth:`~html5lib.html5parser.HTMLParser.parse` method on
-  ``HTMLParser`` objects.
+  :class:`~html5lib.html5parser.HTMLParser` objects.
 
 * If no encoding is specified, the parser will attempt to detect the
   encoding from a ``<meta>``  element in the first 512 bytes of the
   document (this is only a partial implementation of the current HTML
-  5 specification).
+  specification).
 
-* If no encoding can be found and the chardet library is available, an
+* If no encoding can be found and the :mod:`chardet` library is available, an
   attempt will be made to sniff the encoding from the byte pattern.
 
 * If all else fails, the default encoding will be used. This is usually
   `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
   a common fallback used by Web browsers.
-
-
-Tokenizers
-----------
-
-The part of the parser responsible for translating a raw input stream
-into meaningful tokens is the tokenizer. Currently html5lib provides
-two.
-
-To set up a tokenizer, simply pass it when instantiating
-a :class:`~html5lib.html5parser.HTMLParser`:
-
-.. code-block:: python
-
-  import html5lib
-  from html5lib import sanitizer
-
-  p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
-  p.parse("<p>Surprise!<script>alert('Boo!');</script>")
-
-HTMLTokenizer
-~~~~~~~~~~~~~
-
-This is the default tokenizer, the heart of html5lib. The implementation
-can be found in `html5lib/tokenizer.py
-<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
-
-HTMLSanitizer
-~~~~~~~~~~~~~
-
-This is a tokenizer that removes unsafe markup and CSS styles from the
-input. Elements that are known to be safe are passed through and the
-rest is converted to visible text. The default configuration of the
-sanitizer follows the `WHATWG Sanitization Rules
-<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
-
-The implementation can be found in `html5lib/sanitizer.py
-<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
diff --git a/html5lib/__init__.py b/html5lib/__init__.py
@@ -1,14 +1,23 @@
 """
-HTML parsing library based on the WHATWG "HTML5"
-specification. The parser is designed to be compatible with existing
-HTML found in the wild and implements well-defined error recovery that
+HTML parsing library based on the `WHATWG HTML specification
+<https://whatwg.org/html>`_. The parser is designed to be compatible with
+existing HTML found in the wild and implements well-defined error recovery that
 is largely compatible with modern desktop web browsers.
 
-Example usage:
+Example usage::
 
-import html5lib
-f = open("my_document.html")
-tree = html5lib.parse(f)
+    import html5lib
+    with open("my_document.html", "rb") as f:
+        tree = html5lib.parse(f)
+
+For convenience, this module re-exports the following names:
+
+* :func:`~.html5parser.parse`
+* :func:`~.html5parser.parseFragment`
+* :class:`~.html5parser.HTMLParser`
+* :func:`~.treebuilders.getTreeBuilder`
+* :func:`~.treewalkers.getTreeWalker`
+* :func:`~.serializer.serialize`
 """
 
 from __future__ import absolute_import, division, unicode_literals
@@ -22,4 +31,5 @@
            "getTreeWalker", "serialize"]
 
 # this has to be at the top level, see how setup.py parses this
+#: Distribution version number.
 __version__ = "0.9999999999-dev"
diff --git a/tox.ini b/tox.ini
@@ -11,7 +11,12 @@ deps =
   base: webencodings
   py26-base: ordereddict
   optional: -r{toxinidir}/requirements-optional.txt
+  doc: Sphinx
 
 commands =
   {envbindir}/py.test {posargs}
   {toxinidir}/flake8-run.sh
+
+[testenv:doc]
+changedir = doc
+commands = sphinx-build -b html . _build