Skip to content

Fix keyword loading to use any whitespace as separator #6693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

bhagman
Copy link
Contributor

@bhagman bhagman commented Sep 4, 2017

Instead of forcing keywords.txt to use tabs, let library developers use spaces too.

Instead of forcing keywords.txt to use tabs, let library developers use spaces too.
@facchinm facchinm added the feature request A request to make an enhancement (not a bug fix) label Sep 4, 2017
@facchinm facchinm added this to the Next milestone Sep 7, 2017
@bperrybap
Copy link

So is this finally going to be fixed after years of neglect?

@facchinm
Copy link
Member

This PR was merged in beta-1.9 branch as 13824d7 , if no user will report any problem we are going to proceed merging it into master soon.

@cmaglie
Copy link
Member

cmaglie commented Oct 20, 2017

merged in eed9e70

@cmaglie cmaglie closed this Oct 20, 2017
@cmaglie cmaglie modified the milestones: Next, Release 1.8.6 Oct 20, 2017
@cmaglie
Copy link
Member

cmaglie commented Feb 19, 2018

@bhagman @bperrybap
unfortunately it's not as easy as it seems, the "old" keyword.txt format uses \t as a separator (not as a whitespace), if a row contains two consecutive tabs like A\t\tB it means that there are 3 fields: "A", empty string "" and "B".

Some examples are:

operator	LITERAL1		RESERVED_WORD_2
enum	LITERAL1		RESERVED_WORD_2
delete	LITERAL1		RESERVED_WORD_2
bool	LITERAL1		RESERVED_WORD_2
double	LITERAL1		RESERVED_WORD_2
null	LITERAL1		RESERVED_WORD_2
NULL	LITERAL1		RESERVED_WORD_2
new	LITERAL1		RESERVED_WORD_2
private	LITERAL1		RESERVED_WORD_2
protected	LITERAL1		RESERVED_WORD_2
public	LITERAL1		RESERVED_WORD_2
short	LITERAL1		RESERVED_WORD_2
signed	LITERAL1		RESERVED_WORD_2
true	LITERAL1		LITERAL_BOOLEAN
unsigned	LITERAL1		RESERVED_WORD_2
word	LITERAL1		RESERVED_WORD_2
Keyboard	KEYWORD1		DATA_TYPE
Mouse	KEYWORD1		DATA_TYPE
override	KEYWORD3		RESERVED_WORD
final	KEYWORD3		RESERVED_WORD
goto	KEYWORD3		RESERVED_WORD
throw	KEYWORD3		RESERVED_WORD
try	KEYWORD3		RESERVED_WORD
export	KEYWORD3		RESERVED_WORD

this is the IDE before the patch:

image

and this is after the patch:

image

as you can see the color coding is not taken but instead is interpreted as a reference-link (3rd position instead of 4th) this is confirmed if I right-click on the operator word and select "Find in reference":

image

this failure is also reported by the PdeKeywordsTest test:

java.lang.AssertionError: expected:<IncrementCompound> but was:<null>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:743)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.junit.Assert.assertEquals(Assert.java:144)
->	at processing.app.syntax.PdeKeywordsTest.testKeywordsTxtParsing(PdeKeywordsTest.java:49)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:539)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:761)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:461)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:207)

I'm going to revert this patch, until we find a different solution.

At this point I don't know if a backward-compatible solution may be find, maybe the best way forward is to keep the current format and provide a better documentation?

@bperrybap
Copy link

@cmaglie
I'm not understanding this:

if a row contains two consecutive tabs like A\t\tB it means that there are 3 fields: "A", empty string "" and "B"

Regardless of how the current code is parsing it, shouldn't a line like that actually mean two fields "A" and "B" ?

So are you saying that the keywords file format intentionally supports the ability to skip fields or just that the current parsing code to parse it is broken?

Off topic but, but your comments seem to suggest that there are fields that I'm not familiar with.
Is there any documentation on the keywords file?
I've not been able to find any.

Regardless of the number of fields in the keywords file,
it would seem better for the keywords file format to NOT support blank/null fields and use whitespace as a field separator.
Supporting null/blank fields is mess and forever precludes the ability to use whitespace as a field separator.

When using whitespace as a field seperator, you won't have empty fields anymore since whitespace means 1 or more of either or

It appears that the patch used didn't use whitespace as a separator but rather simply used a single space or a single tab which is not the same as whitespace.

@cmaglie
Copy link
Member

cmaglie commented Feb 19, 2018

So are you saying that the keywords file format intentionally supports the ability to skip fields or just that the current parsing code to parse it is broken?

Yes, there is the ability to skip a field using two tabs as in my A\t\tB example. This is how the parser works now, it's counter-intuitive I know, in fact I discovered it after merging this PR.

We can argue if this is nice or bad, but this is not something we can change now because hundreds of keywords.txt are based on that parsing, so we should support the old format.

Anyway let's continue the discussion in #7245 that looks more promising.

@bperrybap
Copy link

What is most important is the intended file format not the way the code may be parsing it.
Was the old format intentionally allowing blank fields? And that old format must be supported?
If so, then it is over since you can never use whitespace as a field separator.
IMO, supporting blank fields is a bad idea.

@per1234
Copy link
Collaborator

per1234 commented Feb 19, 2018

The only documentation I know of is here:
https://github.com/arduino/Arduino/wiki/Arduino-IDE-1.5:-Library-specification#keywords
and a brief mention here:
https://www.arduino.cc/en/Hacking/LibraryTutorial

Note they both do mention a tab separator. After looking at hundreds of keywords.txt files I can tell you that the use of spaces instead of tabs is very common (~15% of libraries). Of the rest, the great majority use multiple tabs as a separator (even though the documentation doesn't specify that being supported) and disturbingly high number have a random mixture of tabs and spaces.

I don't understand the purpose of the multiple identifiers per keyword used in the IDE's global keywords.txt. If relevant to libraries, it would be nice if that usage was documented on the library specification page. I'm happy to do the writing but can't without understanding how it works.

@bperrybap
Copy link

It is really messy since depending on the version of the IDE the parsers were changed/modified which alters the expected format and what is supported.
I think the library spec page should be updated to reflect fields supported and the expected format, even if the format (field separators) actually supported under the hood is a bit more liberal.
Currently, I've not found any information other than Cristian's recent comment here about the file format and fields supported:
#7245 (comment)

@CieNTi
Copy link

CieNTi commented May 28, 2019

Thanks to @per1234 to point me to this issue.

I didn't realize about the fact that is completely mandatory, inside keywords file, to use tabulator as member separator. First by the fact that the IDE provides a poor variety of code editors capabilities, and as I'm used to Sublime Text, it add 0 work to see a function as a function, or a variable as a variable.

Being a tabulator a (extremely old) type of CSV file, and giving tons of headaches along lot and lot of years because it's a width-variable and non-visible character. Why is still in use, and moreover as a unique and a mandatory way to create keywords in Arduino?

As a CSV-like file (which means positional members with the error-prone that it directly inferes), Why is not at least selectable which character to use for fields separators? If keywords does not allow spaces, what is the sense of use a non-space character for that?

There is a lot of ways to define keywords, in use in top editors:

Not a single one use positional fields for that.

And just offtopic (as we're not talking about indentation), but nice to read:

Come on Arduino! Great devices, great software, great libraries ... but this as the unique way for syntax highlight of custom keywords?

My support to @per1234, hard work he does!

@bperrybap
Copy link

bperrybap commented May 28, 2019

In a bit over 35 years of using s/w on many operating systems. The only tool and files that I have ever seen that have used such a rigid format for separators is make. makefiles are stupidly rigid and use a single tab field as separator for rules.

The format of the keyword file is even worse, in that not only is it using an invisible field separator, but the format of the file is not formally defined and its format has changed over the years making it impossible to create something that is fully compatible with all possible file formats from the past.

IMO, it is time to move to a new/extended format.
A format that is simple and mostly compatible with the existing format.

My suggestion would be to put a version #, key, or identifier on the first line so that the new format could be detected and then used.
If the identifier is not there, fall back to the older/current form of parsing.

The new format should be VERY liberal, and allow whitespace or a comma as field separators and NOT allow blank/missing/skipped fields.

That way, nearly everybody should be able to get something they like and get things to look "pretty" in the file.
single tab, multiple tabs, single space, multiple spaces, single comma, or any combination of those would be a single field separator.
i.e.

<tab><space><space><tab><comma> 

would be a single field separator.

If you want/need to skip a field or have a null field, use commas and put nothing between the two commas for that field.
That way it is obvious that the field is not being used.

The point is to be very liberal in the format to support a few different styles of field separation all while keeping the parsing very simple.
But mainly to totally eliminate the use of hidden/invisible non used fields which is what is creating all the issues.

@bperrybap
Copy link

Oh and once there is a version identifier in the file, it would be easy to change formats in the future and or add support for additional formats.
The other thing to consider is that this file is not really intended for end user consumption so it doesn't necessarily have to be a simple field separated format like it is today.

@per1234 per1234 added the Type: Invalid Off topic for this repository, or a bug report determined to not actually represent a bug label Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A request to make an enhancement (not a bug fix) Type: Invalid Off topic for this repository, or a bug report determined to not actually represent a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants