Skip to content

Preserve order of attributes on serialization #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gsnedders opened this issue May 4, 2013 · 4 comments
Closed

Preserve order of attributes on serialization #37

gsnedders opened this issue May 4, 2013 · 4 comments
Assignees
Milestone

Comments

@gsnedders
Copy link
Member

From Google Code #153:

Reported by @fantasai, Jun 1, 2010

What steps will reproduce the problem?
Parse an XHTML file containing attributes in unsorted order with lxml and reserialize.

What is the expected output? What do you see instead?
Expect no change.
Got attributes in alphabetical order, which makes the source harder to read (since the order was chosen to optimize readability, e.g. listing the fixed-length rel="stylesheet" before variable-length href="..."). This also makes it harder to understand diffs, since there's a lot of unnecessary changes to the source output.

Ideally, html5lib would remember the order of attributes and reserialize in that order. lxml does remember the order, so removing the attrs.sort() line in htmlserializer.py is adequate to fix the problem for serializing an lxml tree.

Jul 20, 2010 geoffers
AFAIK the reason for the sort being there is so that there is a guaranteed order even when a tree-builder with no guaranteed order is being used.

May 22, 2011 geoffers
There's no real way to fix this without relying upon defined-to-be-undefined behaviour in CPython/lxml, and as such I'm reluctant to do so. lxml says attributes are given in an arbitrary order, and they are stored in a dict which CPython makes no guarantee of the order of. (lxml does always insert attributes in document order into the dict, and dicts are ordered by insertion order, so it does actually work… for now, at least).

Yes, we could go against both the lxml/CPython documentation and rely upon the ordering, but if either ever changes their behaviour, it could mean html5lib could potentially start serializing the same lxml parse-tree in random ways, and I'd much rather go for the definitely-consistent route we have now.

@gsnedders
Copy link
Member Author

On the whole I think we should get rid of the sorted call that is there now — however, it's not quite that simple as that, as the serializer tests rely on alphabetical order, so they'll need some filter that ensures this.

Note, also, that no tree type we support preserves order: lxml explicitly returns them in an "arbitrary order", as such there is no guarantee the treewalker won't change iteration order (and it probably does, for it just returns a dict).

@ghost ghost assigned gsnedders May 4, 2013
gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 4, 2013
This doesn't do anything about the fact that none of our
treebuilders preserve attribute order: it merely avoids the
serializer reordering them from the order it receives them in.

This changes the serializer tests to use an OrderedDict to get
alphabetical order so they continue to meet their expectations.
gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 4, 2013
This doesn't do anything about the fact that none of our
treebuilders preserve attribute order: it merely avoids the
serializer reordering them from the order it receives them in.

This changes the serializer tests to use an OrderedDict to get
alphabetical order so they continue to meet their expectations.
gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 4, 2013
@baldwint
Copy link

baldwint commented Jul 3, 2013

Another aspect to this issue is that attribute order is shuffled on parse, not just on serialization. The tokenizer preserves attribute ordering, but that is undone in the HTMLParser class by running them all through normalizeToken:

    def normalizeToken(self, token):
        """ HTML5 specific normalizations to the token stream """

        if token["type"] == tokenTypes["StartTag"]:
            token["data"] = dict(token["data"][::-1])

        return token

To achieve good ordering throughout the parse/reserialize roundtrip, this dict would also need to be changed to OrderedDict.

@gsnedders
Copy link
Member Author

Is that the only change needed? I was suspecting it was going to be more than that!

@baldwint
Copy link

baldwint commented Jul 3, 2013

I think it is necessary but not sufficient. When I changed it to OrderedDict I still got shuffled attributes, but that may be due to my treebuilder (I'm using BeautifulSoup 4). In any case, no other treebuilder would have the ability to preserve order if html5lib passes attributes as a dict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants