Skip to content

Encoding failures with unicode in paths in PY2 #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ankostis opened this issue Oct 23, 2016 · 8 comments
Closed

Encoding failures with unicode in paths in PY2 #35

ankostis opened this issue Oct 23, 2016 · 8 comments

Comments

@ankostis
Copy link
Contributor

ankostis commented Oct 23, 2016

Having python-2 working correctly (both in Linux and Windows) with unicodes in filepaths (and process-streams, but I don't think there are processes used in this project) is increasingly difficult, e.g. when running with unicodes in TEMPDIR env-var, as seen in this travis job, and discovered initially in gitpython issue #543.

The way to deal with such issues comes from PY3, using pep383 and the error='surrogateescape' error-handler when dencoding filepaths and process-streams. Unfortunately in python-2 there is no such codec error-handler.

Happily, the future projects provides a backported implementation.

I suggest to start depending this future project, for all git-python projects; in the end,
a lot of compatibility code can be substituted. But note that the POV for this project is the opposite from 2to3: you write PY3 code and you make sure it runs on PY2.

In any case, PY3 that is the future...

@Byron
Copy link
Member

Byron commented Oct 23, 2016

If I understand you correctly, GitDB needs the same fix as is already present in GitPython. To keep things minimal, what do you think about moving this implementation to GitDB and re-using it in GitPython to avoid unnecessary duplication ?

As an anecdote of the past, back in the days when I ported GitPython from py2 to py2|3, I started out using libraries which were supposed to make it easy, and pulled them in as a dependency. Even though I ended up with a version that worked, performance was crippled.
So I took back a step and implemented myself only what I needed, and ended up with only a minimal degradation of performance on py3, which I believe is just due to py3 itself.
That is the reason why I would rather not start depending on the future project, at least not if it is only for the surrogate escape.

@ankostis
Copy link
Contributor Author

ankostis commented Oct 23, 2016

No, I'm not suggesting future just for the surrogateescape.
I've used it before, and at least a year ago it has this magical thing, that it backports bytes and str in PY2, making them non-compatible, the same way as PY3 does - eventually you write PY3 code with safety-belts in PY2 that do not allow you to mix these 2 types, and end up os.path.join() do implicit encodings and complain if paths are non-ascii.

Besides these 2 types, it has a lot of other backports to offer (eg unittest.mock and other stuff that we alreayd start to satisfy bringing disparate packages), that do not not affect PY3 performance; I have no clue whether they degrade PY2 to be frank. In general, PY3 performance is superior to PY2 for all other issues than "dencodings", mostly due to core (eg; dict) enhancements, so I could not evaluate safely PY2 slowdowns - to be more frank, I didn't care that much.

I also hate proliferation of dependencies, but given that this project needs a great effort for PY2/3 compatibility, I believe ti will pay out in the end. But of course we can keep thins as they are, and I'm really running out of time for this. So if you do not want to support this, no prob at all.

@Byron
Copy link
Member

Byron commented Oct 23, 2016

Back in the days when I tried one of these compatibility project (I am vague, as I don't know if it was future or not), these things were relatively new. It could very well be that negative performance implications are negligible.
Therefore I would be by no means opposing bring in future.

You mentioned that it solves issues with os.path.join(), which I already ran into recently. There it could be solved with surrogateescape, which to me seems like the prime solution to everything.
In short, if something is a path, GitPython currently tries to make it into a string/unicode object. Previously, it used 'replace', now it does that with 'surrogateescape'. Thus functions that operate on paths would always get what they need, without degenerating information during the bytes->unicode conversion.

For all I am saying: If you think you it will solve major problems and make everything easier, please feel free to bring it in. I would prefer this to happen in a PR though, just to keep an eye out for the performance tests.

@ankostis
Copy link
Contributor Author

ankostis commented Oct 23, 2016

...you mentioned that it solves issues with os.path.join(), which I already ran into recently. There it could be solved with surrogateescape...

I wished it was that simple.
You see, the real problem is the "implicit" mixing of bytes with unicodes strings. If you haven't elimentated all mixing, surrogates can get you even further into trouble, because they will break later, when some half-bytes + half-surrogate-escaped unicode-string exits the app.

Now the typical way to fix PY2/3 encoding problems is this:

  • identify meretriciously all path-bytes at app-entrance points and decode them, and
  • use unicode-literals everywhere in the sources, so there are no implicit operations on bytes.

But if you miss an entrance-point, mixing will still happen in PY2 (PY3 screams). And you won't know it, unless you feed non-ascii unicode bytes from this entrance-point (ie with some TC); and that is not an exact science.

This where future steps in, and forbids mixing in the first place.

Frankly, I cannot get involved in this now, my attention is on the "leaks" of these library. It is good nevertheless to have discussed about it, if you or anyone else finds the time to implement such a strict separation.

@ankostis
Copy link
Contributor Author

@Byron can you explain these lines why they compare to the same value twice?:

@Byron
Copy link
Member

Byron commented Oct 24, 2016

@ankostis I believe this is a bug. The first line taking the last 20bytes of the file is actually the indexfile checksum. However, the 20bytes before those are indeed the packfile checksum.

| +--------------------------------+
| | packfile checksum              |
| +--------------------------------+
| | idxfile checksum               |
| +--------------------------------+

The text above was copied from the git pack file docs.
Please feel free to correct this - great catch !!

@hugovk
Copy link
Contributor

hugovk commented Aug 19, 2021

Can this be closed now support for EOL Python 2 has been dropped? df73d7f

@Byron
Copy link
Member

Byron commented Aug 19, 2021

Thanks for reviving this one to close it forever!

@Byron Byron closed this as completed Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants