Preserve order of attributes on serialization #37

gsnedders · 2013-05-04T12:06:29Z

Reported by @fantasai, Jun 1, 2010

What steps will reproduce the problem?
Parse an XHTML file containing attributes in unsorted order with lxml and reserialize.

What is the expected output? What do you see instead?
Expect no change.
Got attributes in alphabetical order, which makes the source harder to read (since the order was chosen to optimize readability, e.g. listing the fixed-length rel="stylesheet" before variable-length href="..."). This also makes it harder to understand diffs, since there's a lot of unnecessary changes to the source output.

Ideally, html5lib would remember the order of attributes and reserialize in that order. lxml does remember the order, so removing the attrs.sort() line in htmlserializer.py is adequate to fix the problem for serializing an lxml tree.

Jul 20, 2010 geoffers
AFAIK the reason for the sort being there is so that there is a guaranteed order even when a tree-builder with no guaranteed order is being used.

May 22, 2011 geoffers
There's no real way to fix this without relying upon defined-to-be-undefined behaviour in CPython/lxml, and as such I'm reluctant to do so. lxml says attributes are given in an arbitrary order, and they are stored in a dict which CPython makes no guarantee of the order of. (lxml does always insert attributes in document order into the dict, and dicts are ordered by insertion order, so it does actually work… for now, at least).

Yes, we could go against both the lxml/CPython documentation and rely upon the ordering, but if either ever changes their behaviour, it could mean html5lib could potentially start serializing the same lxml parse-tree in random ways, and I'd much rather go for the definitely-consistent route we have now.

The text was updated successfully, but these errors were encountered:

gsnedders · 2013-05-04T12:20:00Z

On the whole I think we should get rid of the sorted call that is there now — however, it's not quite that simple as that, as the serializer tests rely on alphabetical order, so they'll need some filter that ensures this.

Note, also, that no tree type we support preserves order: lxml explicitly returns them in an "arbitrary order", as such there is no guarantee the treewalker won't change iteration order (and it probably does, for it just returns a dict).

This doesn't do anything about the fact that none of our treebuilders preserve attribute order: it merely avoids the serializer reordering them from the order it receives them in. This changes the serializer tests to use an OrderedDict to get alphabetical order so they continue to meet their expectations.

baldwint · 2013-07-03T10:21:41Z

Another aspect to this issue is that attribute order is shuffled on parse, not just on serialization. The tokenizer preserves attribute ordering, but that is undone in the HTMLParser class by running them all through normalizeToken:

    def normalizeToken(self, token):
        """ HTML5 specific normalizations to the token stream """

        if token["type"] == tokenTypes["StartTag"]:
            token["data"] = dict(token["data"][::-1])

        return token

To achieve good ordering throughout the parse/reserialize roundtrip, this dict would also need to be changed to OrderedDict.

gsnedders · 2013-07-03T10:48:24Z

Is that the only change needed? I was suspecting it was going to be more than that!

baldwint · 2013-07-03T18:00:25Z

I think it is necessary but not sufficient. When I changed it to OrderedDict I still got shuffled attributes, but that may be due to my treebuilder (I'm using BeautifulSoup 4). In any case, no other treebuilder would have the ability to preserve order if html5lib passes attributes as a dict.

ghost assigned gsnedders May 4, 2013

gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 4, 2013

fixup! Fix html5lib#37: Preserve order of attributes on serialization.

692fa54

gsnedders closed this as completed in 0c99d3a May 16, 2013

gsnedders mentioned this issue Jul 3, 2013

Preserve order of attributes on parsing #86

Open

gsnedders mentioned this issue Nov 3, 2013

Various html5lib improvements #119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve order of attributes on serialization #37

Preserve order of attributes on serialization #37

gsnedders commented May 4, 2013

gsnedders commented May 4, 2013

Uh oh!

baldwint commented Jul 3, 2013

Uh oh!

gsnedders commented Jul 3, 2013

Uh oh!

baldwint commented Jul 3, 2013

Uh oh!

Preserve order of attributes on serialization #37

Preserve order of attributes on serialization #37

Comments

gsnedders commented May 4, 2013

gsnedders commented May 4, 2013

Uh oh!

baldwint commented Jul 3, 2013

Uh oh!

gsnedders commented Jul 3, 2013

Uh oh!

baldwint commented Jul 3, 2013

Uh oh!