Skip to content

Line endings #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 3, 2021
Merged

Line endings #285

merged 6 commits into from
Oct 3, 2021

Conversation

jgfoster
Copy link
Member

Possible fix for #283.

@ianfixes
Copy link
Collaborator

I'm nervous about merging this unless/until I can get some indication of what caused the problem to suddenly emerge -- it's been 3 years since any of the cpp/arduino/avr files were touched and roughly 50 of those files were marked as changed on my machine.

@jgfoster
Copy link
Member Author

Every couple years I spend a day or so trying to figure out Git line endings. It seems that every time I run into problems there is some magical incantation that pushes away the problem for another year or two. I'm pretty confident that the reason the problem just appeared for me is that this was my first clone of arduino_ci on a new (to me) machine and I didn't have the typical global configurations set (but I'm investigating this further, so stay tuned!).

I believe that this change will prevent future text files from being committed with CRLF line endings. As a separate PR (#286) I've converted the 266 files I found with CRLF so that they have LF line endings in the repository and check out clean even if you don't have any Git configurations that should make the change.

@jgfoster
Copy link
Member Author

Well, it isn't the new machine or some global configuration!

I went back to another machine that had arduino_ci and did a fetch/merge on master and there were no modified files. On the same machine I then did a fresh clone into a new directory and the files showed as modified. I then went back to the original directory where there were no modified files and executed the following:

grep -l "\r" cpp/arduino/avr/* | wc -l   # 266

This showed that there are 266 files with CRLF in them, but none show as modified. My current suspicion is that the change isn't in the files (whether or not they have CRLF line endings), but in whether Git thinks that they are modified. This is my next line of inquiry.

@jgfoster
Copy link
Member Author

I found git ls-files --eol (see here) to see what the line endings are in the index and working directory, as well as how they are configured based on .gitattributes or other config info. I don't understand why, but it seems that the files really are "wrong" but Git doesn't consistently notice them as being modified.

@jgfoster
Copy link
Member Author

As @prestoncarman, noted elsewhere, this was probably introduced by #215. We tried an experiment where we did a clone of his fork (which predated that merge), and there were no changes. Then we did a clone of this repository, and there were changes. So, I think the reason it "suddenly" appeared is that we did a new clone and the .gitattributes file we currently have is not ideal. I think that the proposed revision will be somewhat equivalent with regard to doing a checkout on Linux or macOS, but will be better for future commits. Ultimately, the proper fix for the existing "modifications" is #286 where the files are recommitted with proper line endings.

@fpistm
Copy link

fpistm commented Feb 17, 2021

You should read this: (if not already done) https://docs.github.com/en/github/using-git/configuring-git-to-handle-line-endings

And also this page from the "Further reading" section: http://adaptivepatchwork.com/2012/03/01/mind-the-end-of-your-line/

You could also consider adding a .editorconfig:
https://editorconfig.org/
Example: https://github.com/stm32duino/Arduino_Core_STM32/blob/master/.editorconfig

@ianfixes
Copy link
Collaborator

TL;DR:

  • line ending modification (based on .gitattributes or otherwise) is an operation that's applied at the moment of checkout
  • by necessity, fixing this will involve actually checking in the whitespace changes on a lot of files

My immediate thought is to favor Windows' preferred line endings, because in my head that works out to "the least risk of receiving contributions with lots of accidental whitespace changes"

@jgfoster
Copy link
Member Author

My understanding is that the line ending modification happens both at checkout and at commit. I believe the best approach is to convert to native on checkout (so unsophisticated Windows editors aren't confused) and convert to LF on commit so we have a uniform format that matches Linux.

@hlovdal
Copy link
Contributor

hlovdal commented Apr 5, 2021

Git crlf handling

I have now spent several hours digging into the git source and documentation and have gotten a decent understanding of how git handles crlf translation.

The TL;DR version:

After a .gitattributes file is added to a git repository, from then on git will (eventually) always do eol normalization (e.g. to LF) of any text file. You can configure what the eol encoding should be in the work-dir, but that is independent of what is stored in the index/git repository.

From https://git-scm.com/docs/gitattributes:

text

This attribute enables and controls end-of-line normalization. When a text file is normalized, its line endings are converted to LF in the repository. To control what line ending style is used in the working directory, use the eol attribute

However, just the action of adding a .gitattributes file does not immediately "clean up"/convert an existing repository (hence "eventually" part in the TL;DR). Adding the attributes file may make files dirty due to eol inconsistencies between .gitattributes configuration and actual work-dir encoding. These conflicts should then be resolved by adding them with git add --renormalize and then making a commit with the corresponding eol changes.

And this adding of files might not be enough. The file2.bat file in my testing below did not end up with expected w/crlf until deleting all files and checking out everything from scratch.

The git documentation also says

Set to string value "auto" ... When the file has been committed with CRLF, no conversion is done.

but that is not true. I testes with all combinations of the text and eol attributes (that are relevant for having files with crlf encoding) and in every single case the final result is i/lf for the last ls-files command.

#!/bin/sh

set -e

for attr in text.no_eol text.eol_crlf text_auto.no_eol text_auto.eol_crlf
do
	figlet -f big $attr 2>/dev/null || banner $attr 2>/dev/null || echo "##### $attr #####"
	mkdir test.$attr
	cd test.$attr

	git init -b main

	echo "== Initial commit =="
	printf "echo line 1\r\n" > file1.bat
	printf "echo line 1\n" > file2.bat
	git add file1.bat file2.bat
	git commit --quiet -m "Initial commit"
	file file*.bat && git check-attr -a file*.bat && git ls-files --eol file*.bat

	echo "== Add .gitattributes file =="
	case $attr
	in
		text.no_eol)
			echo "*.bat text" > .gitattributes
			;;
		text.eol_crlf)
			echo "*.bat text eol=crlf" > .gitattributes
			;;
		text_auto.no_eol)
			echo "*.bat text=auto" > .gitattributes
			;;
		text_auto.eol_crlf)
			echo "*.bat text=auto eol=crlf" > .gitattributes
			;;
	esac
	git add .gitattributes
	git commit --quiet -m "Add .gitattributes file"
	file file*.bat && git check-attr -a file*.bat && git ls-files --eol file*.bat

	echo "== Renormalize =="
	git add --renormalize .
	git commit --quiet -m "Renormalize"
	file file*.bat && git check-attr -a file*.bat && git ls-files --eol file*.bat

	echo "== Modifications & diff =="
	printf "echo line 2\r\n" >> file1.bat
	printf "echo line 2\n" >> file2.bat
	git diff
	file file*.bat && git check-attr -a file*.bat && git ls-files --eol file*.bat

	echo "== Second commit =="
	git add file1.bat file2.bat
	git commit --quiet -m "Second line"
	file file*.bat && git check-attr -a file*.bat && git ls-files --eol file*.bat

	echo "== Clear and reset work-dir =="
	git rm --cached -r .
	git reset --hard
	file file*.bat && git check-attr -a file*.bat && git ls-files --eol file*.bat

	cd ..
done

This comment is already quite long so I am not including output from this, but this should be quick to run and does not have any dependency other than git (never than 2.16 from 2018, which added --renormalize).

CRLF files in arduino_ci

From the above I conclude that having files stored in the index/repository with crlf (i.e. state i/crlf from ls-files) is impossible (long term) when a .gitattributes file is present, and that getting the i/... and w/... states settled to what they are supposed to be is non-trivial and cumbersome.

When starting to look into this my first attempt was to add entries like cpp/arduino/avr/io1200.h text eol=crlfto .gitattributes to resolve the reported inconsistency, except it did not work and thus my deep dive into git's eol handling.

Currently anyone that clones the repository will start with an initial conflict right out of the box:

$ git clone https://github.com/Arduino-CI/arduino_ci
Cloning into 'arduino_ci'...
remote: Enumerating objects: 4416, done.
remote: Total 4416 (delta 0), reused 0 (delta 0), pack-reused 4416
Receiving objects: 100% (4416/4416), 1.67 MiB | 2.71 MiB/s, done.
Resolving deltas: 100% (2812/2812), done.
$ cd arduino_ci/
$ cat .gitattributes
# Set the default behavior, in case people don't have core.autocrlf set.
* text eol=lf
$ git ls-files --eol  | grep crlf | wc -l
266
$ git ls-files --eol  | grep crlf | head -3
i/crlf  w/crlf  attr/text eol=lf        cpp/arduino/avr/io1200.h
i/crlf  w/crlf  attr/text eol=lf        cpp/arduino/avr/io2313.h
i/crlf  w/crlf  attr/text eol=lf        cpp/arduino/avr/io2323.h
$ git ls-files --eol  | grep crlf | tail -3
i/crlf  w/crlf  attr/text eol=lf        cpp/arduino/avr/iox64d3.h
i/crlf  w/crlf  attr/text eol=lf        cpp/arduino/avr/iox64d4.h
i/crlf  w/crlf  attr/text eol=lf        cpp/arduino/avr/iox8e5.h
$ git ls-files --eol  | grep crlf | grep -v cpp/arduino/avr/
$ ls cpp/arduino/avr/ | wc -l
275
$ git status | head
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   cpp/arduino/avr/iom64rfr2.h
        modified:   cpp/arduino/avr/iom8.h
        modified:   cpp/arduino/avr/iom8515.h
        modified:   cpp/arduino/avr/iom8535.h
        modified:   cpp/arduino/avr/iom88.h
        modified:   cpp/arduino/avr/iom88a.h
$ git status | tail
        modified:   cpp/arduino/avr/iox64a3u.h
        modified:   cpp/arduino/avr/iox64a4u.h
        modified:   cpp/arduino/avr/iox64b1.h
        modified:   cpp/arduino/avr/iox64b3.h
        modified:   cpp/arduino/avr/iox64c3.h
        modified:   cpp/arduino/avr/iox64d3.h
        modified:   cpp/arduino/avr/iox64d4.h
        modified:   cpp/arduino/avr/iox8e5.h

no changes added to commit (use "git add" and/or "git commit -a")
$ 

While the changes to .gitattributes proposed in this pull request are fine, they will not resolve this issue.

The core of the problem is the combination of i/crlf and .gitattributes, and in my opinion the only sensible way to resolve this is to run

dos2unix cpp/arduino/avr/*.h
git add cpp/arduino/avr/*.h
git commit -m "Change eol encoding to enforce git repository text normalization"

Then all files will be consistently lf encoded and there should be no problems. While it is also possible to alternatively try to use eol=crlf to keep files with the original encoding this is not worth the effort, especially since it is not consistently all of the header files.

This problem has been present since the .gitattributes file was added in November last year. From a pure content "diff" perspective i/crlf == w/crlf and was not a problem before. But with the addition of the attributes file git started considering attr/text eol=lf which is different and represents a conflict.

@jgfoster Can you update this pull request to contain a commit that changes eol encoding as described above?

@jgfoster
Copy link
Member Author

jgfoster commented Apr 5, 2021

@hlovdal, Does #286 do what you want?

@hlovdal
Copy link
Contributor

hlovdal commented Apr 5, 2021

Yes.

@ianfixes
Copy link
Collaborator

ianfixes commented Apr 5, 2021

So I'm taking this to mean that both #285 and #286 need to be applied to properly fix the issue?

@jgfoster
Copy link
Member Author

jgfoster commented Apr 5, 2021

Yes. They are separated into a PR that has substantive changes (this one) and one where there are hundreds of files modified, but the changes are all whitespace (changing CRLF to LF). I find that having one or two substantive changes mixed in with hundreds of non-substantive changes creates a poor signal-to-noise ratio. I like having a single PR that can be effectively ignored. I would suggest #286 be merged into main/master independently of anything else (and as soon as possible so I stop having hundreds of modified files in my checkout!). There will never be a time when we need to study that PR to figure out when a bug/feature was introduced and combining it into another PR (even a big one) will only create excess noise. (But that's just an explanation of my rationale; I'm delighted to have you merge it in any way you see fit!)

@hlovdal
Copy link
Contributor

hlovdal commented Apr 5, 2021

Pull request #286 is sufficient to solve issue #283, and can preferably be completed right away. This (#285) is related, but not dependent.

This PR could by the way be improved further by the way specifying a diff handler when we already are listing individual file extension entries, and there are a few more that probably are relevant and should be included. I just created the following in one of my projects:

# https://git-scm.com/docs/gitattributes
# https://gitattributes.io/

* text=auto

*.ino   text diff=cpp
*.c     text diff=c
*.cc    text diff=cpp
*.cxx   text diff=cpp
*.cpp   text diff=cpp
*.c++   text diff=cpp
*.hpp   text diff=cpp
*.h     text diff=c
*.h++   text diff=cpp
*.hh    text diff=cpp

@ianfixes
Copy link
Collaborator

ianfixes commented Apr 6, 2021

Merging all this together, I get the following:

# Set the default behavior, in case people don't have core.autocrlf set.
* text=auto eol=lf

# https://git-scm.com/docs/gitattributes
# https://gitattributes.io/

# Explicitly declare text files you want to always be normalized and converted
# to native line endings on checkout. Git would likely get these right, but
# we can be sure by adding them here.
*.ino   text diff=cpp
*.c     text diff=c
*.cc    text diff=cpp
*.cxx   text diff=cpp
*.cpp   text diff=cpp
*.c++   text diff=cpp
*.hpp   text diff=cpp
*.h     text diff=c
*.h++   text diff=cpp
*.hh    text diff=cpp

*.md    text
*.yaml  text
*.yml   text


# Denote all files that are truly binary and should not be modified.
# Even if we don't have any of these, they make a good example.
*.png   binary
*.jpg   binary

Does this look sane? If so I'll add it locally.

@jgfoster
Copy link
Member Author

jgfoster commented Apr 6, 2021

This is different from what I originally proposed in that it explicitly sets eol=lf along with text=auto. I'm not sure of the implications of that. Will it ensure that line endings are LF during a Windows checkout? Is that what you want?

@hlovdal
Copy link
Contributor

hlovdal commented Apr 6, 2021

* text=auto eol=lf would enforce LF everywhere, but I do not see any need for this and it could potentially create problems for Windows users. So just * text=auto is better.

@ianfixes
Copy link
Collaborator

ianfixes commented Apr 6, 2021

OK, will do. I was paying attention to this comment (from @prestoncarman), which appeared to suggest that I was wrong to remove lf:

-* text eol=lf
+* text=off

Just trying to cover all the comments that have been made. Are we all in favor of text=auto ?

@jgfoster
Copy link
Member Author

jgfoster commented Apr 6, 2021

I interpreted Preston to be giving us some background. We were wondering why this problem "suddenly" occurred for several of us in a short period of time, and he pointed out that it was tied to an attempt to fix a Windows problem. I don't think he was claiming that the previous fix was correct, just that it explains why we hadn't seen the issue before.

@jgfoster
Copy link
Member Author

jgfoster commented Apr 6, 2021

I think that the best practice is to store text files in LF format in the Git repository, check them out in native format (CRLF on Windows) so that simplistic native editors aren't confused (I'm looking at you, Notepad), and then return them to LF format during checkin so we don't have unnecessary diffs.

But, now that you bring us back to #215, as I review it I'm not so sure of how this will play out. According to that PR, bundle exec rubocop -D . complains when a Gemfile has a CR in it on Windows. According to my first paragraph above, we do want the Gemfile to have a CRLF format on Windows. So, we could force all text files to have LF only format even on Windows (who in their right mind will use Notepad, or even Windows for that matter! ;-) ). Or we could make an exception for Gemfile and any other files that rubocop flags. Or we could explain to rubocop that CRLF is appropriate for text files on Windows.

So, much as I want this whack-a-mole experience to be over, I'm not sure that we have a solution yet (though I think that fixing the existing CRLF files in #286 was still correct). The remaining issue is whether a Windows checkout should give you LF or CRLF (and how to handle rubocop if we pick CRLF).

@ianfixes
Copy link
Collaborator

ianfixes commented Apr 6, 2021

I have no opinion on text eol=lf, so if this edit looks acceptable to everyone then I'll go with it:

# Set the default behavior, in case people don't have core.autocrlf set.
* text=auto

# https://git-scm.com/docs/gitattributes
# https://gitattributes.io/

# Explicitly declare text files you want to always be normalized and converted
# to native line endings on checkout. Git would likely get these right, but
# we can be sure by adding them here.
*.ino   text diff=cpp
*.c     text diff=c
*.cc    text diff=cpp
*.cxx   text diff=cpp
*.cpp   text diff=cpp
*.c++   text diff=cpp
*.hpp   text diff=cpp
*.h     text diff=c
*.h++   text diff=cpp
*.hh    text diff=cpp

*.md    text
*.yaml  text
*.yml   text


# Denote all files that are truly binary and should not be modified.
# Even if we don't have any of these, they make a good example.
*.png   binary
*.jpg   binary

@jgfoster
Copy link
Member Author

jgfoster commented Apr 6, 2021

Actually, the fact that the checks have passed suggests that rubocop will not complain about the change, so let's move ahead!

@ianfixes
Copy link
Collaborator

ianfixes commented Apr 6, 2021

I just tried to make some edits to your branch, and for some reason those took effect immediately instead of showing up as a PR against your branch... is that some magic of how forks work in GitHub, or am I a contributor on your repo? Apologies if these were unwanted.

@ianfixes ianfixes requested a review from hlovdal April 7, 2021 02:09
@jgfoster
Copy link
Member Author

@hlovdal, I'm returning to a few of the open issues and in my GitHub view you have been invited to review this PR. I'm satisfied with the current files; how about you? If you agree then I'm merge the PR. Thanks!

@jgfoster jgfoster merged commit a242213 into Arduino-CI:master Oct 3, 2021
@jgfoster jgfoster deleted the line-endings branch October 3, 2021 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants