-
Notifications
You must be signed in to change notification settings - Fork 511
Debugger Does Not Handle Unicode/UTF-8 Characters Properly #1392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @mdowst, thanks for opening an issue! We've been aware of an ongoing encoding issue, but it's been tricky to work out where the problem lies. Hopefully this issue will shed some more light. I have a couple of questions if that's ok:
|
To add to @rjmholt's helpful tips: The likeliest explanation is that your The solution is to always use UTF-8 with BOM as the character encoding, as both Windows PowerShell and PowerShell Core interpret that correctly. The tricky part is that modern editors such as Visual Studio Code create BOM-less UTF-8 files by default, so you have to remember to explicitly change the encoding to UTF-8 with BOM. |
If what I suspect is the true source of the problem, the behavior is not a bug, but by design. In Windows PowerShell you've always had to either use "ANSI"-encoded source code files (for characters in the system-locale extended-ASCII range only) or one of the standard Unicode encodings with BOM in order for non-ASCII characters in string literals to be recognized correctly. You can reproduce the problem as follows: # Note: [IO.File]::WriteAllText() writes UTF-8 files *without BOM* by default.
WinPS> [IO.File]::WriteAllText("$PWD/t.ps1", '"Greek sigma symbol: ∑"'); ./t.ps1
Greek sigma symbol: ∑ The 3 bytes that make up the UTF-8 encoded |
@rjmholt, thanks for your reply. I have provided the answers to your questions in-line below.
Please let me know if I can provide any additional information or testing. |
Ah, I've worked it out. It's the summation character, not capital sigma (I know you said that, but my mind oversimplified 😄). In UTF-8 that's encoded as So as in other scenarios I've seen, the copied glyph is saved as UTF-8 and then PowerShell itself is seeing the bytes as CP1252, causing this problem. In your scenario, it might be worth trying to set the integrated console's default encoding to UTF-8. I think this should work: [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$PSDefaultParameterValues['*:Encoding'] = 'utf8' But @mklement0 might have better advice there. @tylerl0706 I'm thinking we should look into how to make our hosted PowerShell environment default to UTF-8 encoding... I think that might be the issue plaguing EditorServices here |
Naturally I realise now that @mklement0 was way ahead of me here. But anyway... Yeah, despite the rest of the work going for BOM-less UTF-8, I guess our default for Windows PowerShell should be UTF-8-with-BOM, and for PS Core we should make an informed decision... |
Presumably, that is because it is an in-memory operation based on strings rather than script files using a specific encoding.
As explained, this happens when the file is UTF-8-encoded but lacks a BOM: You can tell by the status bar in VSCode stating just This is purely a PowerShell engine issue: It is about what character encoding it assumes when it reads a Output settings such as
Yes, the extension should be configured to default to encoding
Defaulting to That said, the presence of a BOM on Unix platforms can cause problems when external tools process such files - not sure if that's a real-world concern. |
And just to clarify: They do matter when piping data to and from external programs. To recap from PowerShell/PowerShell#3819 (comment), here's the command needed to make a console fully UTF-8 aware (on Windows): $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding As an aside: In PowerShell Core, this shouldn't be necessary, but currently (v6.1.0-preview.3) still is on Windows: see |
Just for reference, here are other issues that I think have the same root cause as this one:
@mklement0 as you say, this is an issue with Windows PowerShell and there's no simple way to get around it. But, as an extension, we should do our best to handle this, or at least pad around it where we can. Spitballing some things we could try doing to improve the situation:
@tylerl0706, @rkeithhill, @SeeminglyScience, @mklement0 any other ideas here? |
If that's the direction, I'd prefer a preference key. If Core is identifying encoding properly, it could cause unnecessary overhead there. |
To be honest, it's not really my true preference. But managing this issue is tricky, since it's clearly behaving pretty badly (not so much in this issue as in others). Hopefully we can open up the discussion a bit, as well as work out where exactly we need to deal with this problem. |
@TheIncorrigible1: My guess is that the performance impact (of looking for a BOM and selecting the encoding based on that, assuming that's what you meant) is negligible, but I've since noticed that even just using UTF8-with-BOM files with other editors on Unix platforms is problematic:
For that reason alone I now think we should not use a BOM by default when we create new files for PowerShell Core. For Windows PowerShell, however, we should. |
@rjmholt: Here are my thoughts on what the extension should and shouldn't do:
Let me know if that makes sense and/or if I missed something. |
here you can follow the following command to avoid special character doesn't require encoding.
I even tried encoding but at the end it just work for terminal but not at the core level. |
System Details
$PSVersionTable
:PSVersion : 5.0.10586.117
PSCompatibleVersions : 1.0 2.0 3.0 4.0 5.0.10586.117
BuildVersion : 10.0.10586.117
CLRVersion : 4.0.30319.42000
WSManStackVersion : 3.0
PSRemotingProtocolVersion : 2.3
SerializationVersion : 1.1.0.1
Issue Description
I've experienced an issue with the way the debugger handles non-ascii characters. If I create a script with a Unicode/UTF-8 character in, for example the sigma symbol "∑", when I press F5 to run the script through the debugger it translates the symbol like this, "∑". If I highlight the text and run it using F8, it displays the characters correctly.
I've tested this on 3 different machines, one Windows Server 2012 R2, which I included the system details for here. I also tested it on a Windows Server 2016 with the same versions of VS Code and PowerShell extensions and I saw the same results. However, I also tested on it Windows 10 1709, again with the same versions, and it did not have this issue. The only difference between the systems is the Windows 10 system listed the architecture as ia32 and the two servers are x64. Also the Windows 10 system is on PowerShell version 5.1.15063.1088 and the 2016 is on version 5.1.14393.2248.
Here is an example of the code I am running
Attached Logs
logs.zip
The text was updated successfully, but these errors were encountered: