Skip to content

Update docs #332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Nov 6, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,4 @@ Patches and suggestions
- Jon Dufresne
- Ville Skyttä
- Jonathan Vanasco
- Tom Most
4 changes: 2 additions & 2 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Released on July 14, 2016

* Cease supporting DATrie under PyPy.

* **Remove ``PullDOM`` support, as this hasn't ever been properly
* **Remove PullDOM support, as this hasn't ever been properly
tested, doesn't entirely work, and as far as I can tell is
completely unused by anyone.**

Expand Down Expand Up @@ -70,7 +70,7 @@ Released on July 14, 2016
to clarify their status as public.**

* **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
sanitizer.htmlsanitizer module and move that to saniziter. This means
sanitizer.htmlsanitizer module and move that to sanitizer. This means
anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
code changes.**

Expand Down
12 changes: 4 additions & 8 deletions doc/html5lib.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
html5lib Package
================

:mod:`html5lib` Package
-----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why take the header out here?

Copy link
Contributor Author

@twm twm Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise there are two headers in a row with exactly the same text.


.. automodule:: html5lib.__init__
:members:
:undoc-members:
:show-inheritance:
.. automodule:: html5lib
:members: __version__

:mod:`constants` Module
-----------------------
Expand All @@ -26,7 +21,7 @@ html5lib Package
:show-inheritance:

:mod:`serializer` Module
----------------------
------------------------

.. automodule:: html5lib.serializer
:members:
Expand All @@ -41,4 +36,5 @@ Subpackages
html5lib.filters
html5lib.treebuilders
html5lib.treewalkers
html5lib.treeadapters

20 changes: 20 additions & 0 deletions doc/html5lib.treeadapters.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
treebuilders Package
====================

:mod:`~html5lib.treeadapters` Package
-------------------------------------

.. automodule:: html5lib.treeadapters
:members:
:undoc-members:
:show-inheritance:

.. automodule:: html5lib.treeadapters.genshi
:members:
:undoc-members:
:show-inheritance:

.. automodule:: html5lib.treeadapters.sax
:members:
:undoc-members:
:show-inheritance:
8 changes: 4 additions & 4 deletions doc/html5lib.treewalkers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ treewalkers Package
:show-inheritance:

:mod:`base` Module
-------------------
------------------

.. automodule:: html5lib.treewalkers.base
:members:
Expand All @@ -34,7 +34,7 @@ treewalkers Package
:show-inheritance:

:mod:`etree_lxml` Module
-----------------------
------------------------

.. automodule:: html5lib.treewalkers.etree_lxml
:members:
Expand All @@ -43,9 +43,9 @@ treewalkers Package


:mod:`genshi` Module
--------------------------
--------------------

.. automodule:: html5lib.treewalkers.genshi
:members:
:undoc-members:
:show-inheritance:
:show-inheritance:
102 changes: 29 additions & 73 deletions doc/movingparts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,25 @@ The moving parts
html5lib consists of a number of components, which are responsible for
handling its features.

Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
Several tree representations are supported, as are translations to other formats via *tree adapters*.
The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.

Tree builders
-------------

The parser reads HTML by tokenizing the content and building a tree that
the user can later access. There are three main types of trees that
html5lib can build:
the user can later access. html5lib can build three types of trees:

* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
* ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`,
which can be found in the standard library. Whenever possible, the
accelerated ``ElementTree`` implementation (i.e.
``xml.etree.cElementTree`` on Python 2.x) is used.

* ``dom`` - builds a tree based on ``xml.dom.minidom``.
* ``dom`` - builds a tree based on :mod:`xml.dom.minidom`.

* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
* ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree``
API. The performance gains are relatively small compared to using the
accelerated ``ElementTree`` module.

Expand All @@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When instantiating a parser object, you have to pass a tree builder
class in the ``tree`` keyword attribute:
To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function.

.. code-block:: python

import html5lib
parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
document = parser.parse("<p>Hello World!")

To get a builder class by name, use the ``getTreeBuilder`` function:
When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:

.. code-block:: python

import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
TreeBuilder = html5lib.getTreeBuilder("dom")
parser = html5lib.HTMLParser(tree=TreeBuilder)
minidom_document = parser.parse("<p>Hello World!")

The implementation of builders can be found in `html5lib/treebuilders/
Expand All @@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
Tree walkers
------------

Once a tree is ready, you can work on it either manually, or using
a tree walker, which provides a streaming view of the tree. html5lib
provides walkers for all three supported types of trees (``etree``,
``dom`` and ``lxml``).
In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.

The implementation of walkers can be found in `html5lib/treewalkers/
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.

Walkers make consuming HTML easier. html5lib uses them to provide you
with has a couple of handy tools.

html5lib provides :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.

HTMLSerializer
~~~~~~~~~~~~~~
Expand All @@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.
'>'
'Witam wszystkich'

You can customize the serializer behaviour in a variety of ways, consult
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
documentation.
You can customize the serializer behaviour in a variety of ways. Consult
the :class:`~html5lib.serializer.HTMLSerializer` documentation.


Filters
~~~~~~~

You can alter the stream content with filters provided by html5lib:
html5lib provides several filters:

* :class:`alphabeticalattributes.Filter
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
Expand All @@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:
the document

* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
``LintError`` exceptions on invalid tag and attribute names, invalid
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really an AssertionError? If so, we should write up an issue to change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the implementation is basically all assert statements:

assert namespace is None or isinstance(namespace, text_type)
assert namespace != ""
assert isinstance(name, text_type)
assert name != ""
assert isinstance(token["data"], dict)
if (not namespace or namespace == namespaces["html"]) and name in voidElements:
assert type == "EmptyTag"
else:
assert type == "StartTag"
if type == "StartTag" and self.require_matching_tags:
open_elements.append((namespace, name))
for (namespace, name), value in token["data"].items():
assert namespace is None or isinstance(namespace, text_type)
assert namespace != ""
assert isinstance(name, text_type)
assert name != ""
assert isinstance(value, text_type)
elif type == "EndTag":
namespace = token["namespace"]
name = token["name"]
assert namespace is None or isinstance(namespace, text_type)
assert namespace != ""
assert isinstance(name, text_type)
assert name != ""
if (not namespace or namespace == namespaces["html"]) and name in voidElements:
assert False, "Void element reported as EndTag token: %(tag)s" % {"tag": name}
elif self.require_matching_tags:
start = open_elements.pop()
assert start == (namespace, name)
elif type == "Comment":
data = token["data"]
assert isinstance(data, text_type)
elif type in ("Characters", "SpaceCharacters"):
data = token["data"]
assert isinstance(data, text_type)
assert data != ""
if type == "SpaceCharacters":
assert data.strip(spaceCharacters) == ""
elif type == "Doctype":
name = token["name"]
assert name is None or isinstance(name, text_type)
assert token["publicId"] is None or isinstance(name, text_type)
assert token["systemId"] is None or isinstance(name, text_type)
elif type == "Entity":
assert isinstance(token["name"], text_type)
elif type == "SerializerError":
assert isinstance(token["data"], text_type)
else:
assert False, "Unknown token type: %(type)s" % {"type": type}

PCDATA, etc.

* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
removes tags from the stream which are not necessary to produce valid
removes tags from the token stream which are not necessary to produce valid
HTML

* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
Expand All @@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:

* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
collapses all whitespace characters to single spaces unless they're in
``<pre/>`` or ``textarea`` tags.
``<pre/>`` or ``<textarea/>`` tags.

To use a filter, simply wrap it around a stream:
To use a filter, simply wrap it around a token stream:

.. code-block:: python

Expand All @@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:
Tree adapters
-------------

Used to translate one type of tree to another. More documentation
pending, sorry.
Tree adapters can be used to translate between tree formats.
Two adapters are provided by html5lib:

* :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
* :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.

Encoding discovery
------------------
Expand All @@ -156,54 +150,16 @@ the following way:
* The encoding may be explicitly specified by passing the name of the
encoding as the encoding parameter to the
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
``HTMLParser`` objects.
:class:`~html5lib.html5parser.HTMLParser` objects.

* If no encoding is specified, the parser will attempt to detect the
encoding from a ``<meta>`` element in the first 512 bytes of the
document (this is only a partial implementation of the current HTML
5 specification).
specification).

* If no encoding can be found and the chardet library is available, an
* If no encoding can be found and the :mod:`chardet` library is available, an
attempt will be made to sniff the encoding from the byte pattern.

* If all else fails, the default encoding will be used. This is usually
`Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
a common fallback used by Web browsers.


Tokenizers
----------

The part of the parser responsible for translating a raw input stream
into meaningful tokens is the tokenizer. Currently html5lib provides
two.

To set up a tokenizer, simply pass it when instantiating
a :class:`~html5lib.html5parser.HTMLParser`:

.. code-block:: python

import html5lib
from html5lib import sanitizer

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
p.parse("<p>Surprise!<script>alert('Boo!');</script>")

HTMLTokenizer
~~~~~~~~~~~~~

This is the default tokenizer, the heart of html5lib. The implementation
can be found in `html5lib/tokenizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.

HTMLSanitizer
~~~~~~~~~~~~~

This is a tokenizer that removes unsafe markup and CSS styles from the
input. Elements that are known to be safe are passed through and the
rest is converted to visible text. The default configuration of the
sanitizer follows the `WHATWG Sanitization Rules
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.

The implementation can be found in `html5lib/sanitizer.py
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.
24 changes: 17 additions & 7 deletions html5lib/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
"""
HTML parsing library based on the WHATWG "HTML5"
specification. The parser is designed to be compatible with existing
HTML found in the wild and implements well-defined error recovery that
HTML parsing library based on the `WHATWG HTML specification
<https://whatwg.org/html>`_. The parser is designed to be compatible with
existing HTML found in the wild and implements well-defined error recovery that
is largely compatible with modern desktop web browsers.

Example usage:
Example usage::

import html5lib
f = open("my_document.html")
tree = html5lib.parse(f)
import html5lib
with open("my_document.html", "rb") as f:
tree = html5lib.parse(f)

For convenience, this module re-exports the following names:

* :func:`~.html5parser.parse`
* :func:`~.html5parser.parseFragment`
* :class:`~.html5parser.HTMLParser`
* :func:`~.treebuilders.getTreeBuilder`
* :func:`~.treewalkers.getTreeWalker`
* :func:`~.serializer.serialize`
"""

from __future__ import absolute_import, division, unicode_literals
Expand All @@ -22,4 +31,5 @@
"getTreeWalker", "serialize"]

# this has to be at the top level, see how setup.py parses this
#: Distribution version number.
__version__ = "0.9999999999-dev"
5 changes: 5 additions & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@ deps =
base: webencodings
py26-base: ordereddict
optional: -r{toxinidir}/requirements-optional.txt
doc: Sphinx

commands =
{envbindir}/py.test {posargs}
{toxinidir}/flake8-run.sh

[testenv:doc]
changedir = doc
commands = sphinx-build -b html . _build