Skip to content

Add sort_keys parameter to Packer for stable dict packing. #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Vayu
Copy link

@Vayu Vayu commented Dec 17, 2015

It is often useful to have reproducible serialization. It allows, for instance, efficient comparison of records coming from different sources.

simplejson provides an option sort_keys, which makes serialization of dictionaries reproducible. This patch adds the same option to msgpack Packer class.

Same as for simplejson, allows reproducible serialization of dictionaries.
@methane
Copy link
Member

methane commented Dec 26, 2015

In msgpack, sort_keys doesn't enough for "canonicalize". There are some valid serialization for same data.

msgpack-python is too complex already.
I don't make it more complicated for "python only" feature. So I won't support such feature until mspgack has spec or recommendation about it.

@methane methane closed this Dec 26, 2015
@Vayu
Copy link
Author

Vayu commented Dec 26, 2015

This whole repository is a "python only" feature, since it is a python package.

You claim compatibility with simplejson. It is impossible to add new features without adding complexity, the question is whether the feature is useful enough to justify it.

@methane
Copy link
Member

methane commented Dec 26, 2015

You're right.
And I think sort_keys is less useful than JSON since I can't diff -u a.mpack b.mpack.

@Vayu
Copy link
Author

Vayu commented Dec 26, 2015

sort_keys has nothing to do with diff -u for JSON (that would be indent BTW).
In both JSON and msgpack sort_keys will make diff -q a.mpack b.mpack work.

@methane
Copy link
Member

methane commented Dec 26, 2015

sort_keys has nothing to do with diff -u for JSON (that would be indent BTW).

sort_keys and indent.

This whole repository is a "python only" feature, since it is a python package.

But msgpack is cross-language format.
You can't compare two msgpack from other languages even if the language supports sort_keys since numbers may be represented in other formats.

I don't want to add feature only for comparing two Python data.
There should be better way other than msgpack.

@Vayu
Copy link
Author

Vayu commented Dec 26, 2015

You cannot compare JSON files either because you don't know if sort_keys have been used or if strings have been encoded differently (e.g. "@" vs "\u0040").

Yet the option exists and is useful. So one can serialize in a way that at least the same code (or number of compatible codes) will always produce the same output.

The fact that it may not be always possible should not be a reason for dismissal. BTW this PR is not called "canonicalize", it is called "sort_keys".

@methane
Copy link
Member

methane commented Dec 26, 2015

You cannot compare JSON files either because you don't know if sort_keys have been used or if strings have been encoded differently (e.g. "@" vs "\u0040").

Even so, diff -u is useful for json serialized by sort_keys and indent.

So one can serialize in a way that at least the same code (or number of compatible codes) will always produce the same output.

Same version of same encoder is required. And if you have trouble, investigating it is very hard.

The fact that it may not be always possible should not be a reason for dismissal.

  • It encourage to misuse of msgpack
  • It may make maintaining msgpack-python hard.
  • It may make supporting msgpack-python users hard.

BTW this PR is not called "canonicalize", it is called "sort_keys".

If canonicalize is not purpose, how sort_keys is useful??
In JSON case, python -m json.tool < data.json | less is useful. (While I use jq . for the purpose in these days.)

@methane
Copy link
Member

methane commented Dec 26, 2015

bencode may be stable serialize format. How about it?

@Vayu
Copy link
Author

Vayu commented Dec 26, 2015

It encourage to misuse of msgpack

I was under the impression that msgpack and this package are advertised as a drop-in replacement of json. Python json package supports sort_keys, msgpack doesn't.

If canonicalize is not purpose, how sort_keys is useful??

Canonicalize implies universality, which as you correctly pointed out is impossible. It is not standardized and even if it was, it would be not mandatory therefore one cannot rely on it.

sort_keys does 2 things

  1. Improves compatibility with json package
  2. Gives limited optional reproducibility useful for things like unit-tests and caching. (In tests one is sure that encoder was the same. In caching nothing would break if files got actually different once in a while).
  • it doesn't make code slower
  • it doesn't break compatibility with other msgpack implementations

bencode may be stable serialize format. How about it?

For bencode stability is not just useful it is crucial and mandatory. That is not possible for msgpack since one cannot be both efficient and mandatorily stable. Still doesn't mean that optional limited stability is not useful.

@methane
Copy link
Member

methane commented Dec 26, 2015

I was under the impression that msgpack and this package are advertised as a drop-in replacement of json. Python json package supports sort_keys, msgpack doesn't.

I don't advertise this project as complete drop-in replacement.
I don't support indent option. So API compatibility is not enough reason to support sort_keys.

Gives limited optional reproducibility useful for things like unit-tests and caching.

I don't recommend to use msgpack for unittest since msgpack is less human readable.
If your test fails, it's hard to investigate why two serialized data is different than JSON.

  • it doesn't make code slower

It can disturb optimization in the future. For example, it makes more hard to reimplement in C instead of Cython.

Additionally, people may misuse the option and ask "Why my code doesn't work?".
Investigating difference of binary data is hard job and there are so many pitfalls.

@Vayu
Copy link
Author

Vayu commented Dec 26, 2015

Not recommend and make it impossible are two different things. Libraries are supposed to give people options, because one can misuse pretty much anything anyway.

I can hardly argue with what may happen in the future or what people may ask or how they may misuse the code. Anything can happen, but most of this things are unlikely to happen.

@methane
Copy link
Member

methane commented Dec 27, 2015

Not recommend and make it impossible are two different things.

CESU-8 and tail call optimization are not supported by Python since Python doesn't recommend them.

Libraries are supposed to give people options, because one can misuse pretty much anything anyway.

Only when "how to use the option in right way" is easy to explain.
Otherwise, hundreds of options used by only few people hides important options.

@methane
Copy link
Member

methane commented Dec 27, 2015

There are idea relating to your idea.
https://github.com/msgpack/msgpack/blob/master/spec.md#future-profiles
Could you post an issue on msgpack to discuss about it?

I'll discuss about:

  • Which type is allowd for map keys (only strings? binary and integer are allowed?)
  • Should encoders sort keys of all map in msgpack (your implementation doesn't sort OrderedDict).
  • Best practice to build reproducible serialize

Until consensus built, I think msgpack shouldn't support it.

@Vayu
Copy link
Author

Vayu commented Dec 27, 2015

I read the spec again. The wording is not 100% clear, but since the same language is used for both arrays and maps, it is logical to assume that maps in msgpack are ordered sequences.

Array format family stores a sequence of elements in 1, 3, or 5 bytes of extra bytes in addition to the elements.
Map format family stores a sequence of key-value pairs in 1, 3, or 5 bytes of extra bytes in addition to the key-value pairs.

In this light, I would say that msgpack-python is misusing msgpack format to store unordered python dictionaries in ordered map type. And even worse OrderedDict is deserialized as dict with loss of information.

So until msgpack has unordered map type, sort_keys is pure python feature, because it is up to implementation how to serialize unordered structure into ordered.

To clarify ordered property I created an issue msgpack/msgpack#210

ligurio added a commit to tarantool/tarantool-python that referenced this pull request Jan 18, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
ligurio added a commit to tarantool/tarantool that referenced this pull request Jan 19, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
ligurio added a commit to tarantool/tarantool that referenced this pull request Jan 19, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
ligurio added a commit to tarantool/tarantool that referenced this pull request Jan 19, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Jan 19, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Jan 22, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Jan 26, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Jan 27, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Feb 1, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Feb 8, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Feb 10, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Mar 3, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Mar 3, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Mar 9, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio added a commit to tarantool/tarantool that referenced this pull request Mar 9, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
Part of #5652
ligurio added a commit to tarantool/tarantool that referenced this pull request Mar 12, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3] and changes in test-run's commit
853c9532c56e523da68030a1b7956c2e753b917e "python3: fix printing of called box
command".

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
Part of #5652
ligurio added a commit to tarantool/tarantool that referenced this pull request Mar 13, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio pushed a commit to tarantool/tarantool that referenced this pull request Mar 13, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Follows up #5538
ligurio pushed a commit to tarantool/tarantool that referenced this pull request Mar 13, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5562
Follows up #5538
ligurio pushed a commit to tarantool/tarantool that referenced this pull request Mar 13, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5562
Follows up #5538
ligurio pushed a commit to tarantool/tarantool that referenced this pull request Mar 14, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5652
Follows up #5538
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 15, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5652
Follows up #5538
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 15, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5652
Follows up #5538

(cherry picked from commit 45972c9)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 15, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5652
Follows up #5538

(cherry picked from commit 45972c9)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 15, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5652
Follows up #5538

(cherry picked from commit 45972c9)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 16, 2021
Response object is a unpacked msgpack structure, returned by Tarantool.
Nor msgpack-python [1] that used in tarantool-python nor msgpack [2]
itself cannot guarantee an order of keys in unpacked dictionaries.
Therefore we have different keys order with running tests under Python 2
and Python 3, for example box-py/call.test.py. To workaround a problem
proposed a conversion of dictionaries in a tuple to lists before
printing Response object.

Requires changes in tarantool-python [3].

1. msgpack/msgpack-python#164
2. msgpack/msgpack#215
3. tarantool/tarantool-python#186

Part of #5652
Follows up #5538

(cherry picked from commit 45972c9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants