Uniform extention of Kotlin syntax


#1

Imagine if you could extend Kotlin grammar directly in the code. In c-like languages such extensions may be more or less precisely identified with macro definitions. But macros are ‘stupid’ in the sense that they are straightforward, dont’ consider any semantical or syntactical fitness (you can easily break type safety or syntactical correctness with such macro-substitutions).

What I’m speaking about may be exactly expressed as: we extend the grammar of the language inside code. Of course, to do it we must expand the core grammar with special construction for the description of the new grammar rule and it’s meaning. And the semantics of such construction is straightforward: given an AST node of newly defined kind, we transform the tree back to the core syntax, using the definition of the new syntax construction.

The main point of doing this: such things as syntactic sugar would be no more tied strongly to the core language, but may be left for the consideration of developers.

So, pros and contras of this feature:
pros:

  • syntax is becoming very flexible
  • significant part of discussions about the language design will move to the standard library design
  • some syntactic sugar features of the language may be transferred from the core to the library, so core language will become smaller
    contras:
  • in case of misuse this feature makes it possible to break the language syntax.
  • even in case of not so responsible usage it may introduce undesired ambiguity and inefficient parsing
  • performance overhead for the dynamic nature of the parser

The technical side of this proposal. If I didn’t implement something alike, I wouldn’t have raised such discussion. I made a parser for the language Russell, which has no built-in syntax of expressions. The parser for the syntax of expressions is generated on the fly, while parsing the main source. The size of the source files for Russell is quite large: about 150 Mb, the syntax of expressions is also quite complex: about a thousand of rules, and the parsing of such amount of source takes it all in all about 2 minutes (4 core CPU).

Some info about my implementation of such a dynamic parser. At first I tried naive LL parser, but it failed with grammar ambiguity. Then I made a huge effort and implemented an LR parser, which was very fast to run, but very slow to generate - so unfortunately, it was unacceptable - we can’t wait for hours to get the source compiled. At last, I ended with a version of LL parser which could parse all of the expressions in reasonable time, despite the fact of ambiguity.

Main problems with implementation of such feature:

  • All parsing subsystem of Kotlin must be completely revised
  • Operator precedence rules of the grammar: yet I don’t know how to implement it in the parsing algorithm which I used for Russell.
  • critical - the speed of parsing. It’s clear, that the dynamic parser will be slower then the highly optimized hand-written parser for a fixed grammar. The main point is the degree of this gap in performance.

As for me, I’m reservedly optimistic with these problems, but it’s clear, that it’s a huge work to implement such feature.


#2

Cons: source code becomes unreadable to every unskilled programmer.


#3

I tend to disagree. With flexible languages like Kotlin, Groovy, Scala, etc., there are many ways to solve a problem. Whether or not a developer less familiar with the language is able to comprehend what is written depends more on how easy it is to navigate to the definition of a construct, than on the knowledge about the construct itself. if you can navigate to it, you can see how it is done, and you can search for documentation about it.

For example: Function types are awesome, but they are less easy to navigate than interfaces. If you use an interface you can easily find all implementations. If you use a function type, you are lost at that point for two reasons:

  • It is not possible to find all implementations.
  • And even if you could find all implementations, the same function type is reused for many different purposes. So you still won’t know which functions are actually used in the code you are looking at right now.

If some sort of macro system is added to Kotlin (which IMHO would give a lot more flexibility when designing a DSL), you need to be able to easily navigate the macro definitions. This way you can quickly build an inlined version in your head, and understand what it is doing.

Although I am in favor of a macro system, I understand that this is hard to get right in terms of language design and performance. So not adding it to Kotlin is an understandable choice.


#4

Cons: source code becomes unsupported by IDE (syntax highligting/completion, refactoring, etc.) and other tools, that depend on syntax analysis (static analyzers, pretty printers, etc.)


#5

This will happen any time that you extend or modify a language. Java was extended with generics and annotations, which both broke existing tooling.

I do not say that existing tooling should not be taken into account, but it should also not be used to stop all language changes.


#6

fixed grammar << grammar evolution through releases << dynamic runtime grammar addition/mdofications are different things all together. The later will IMO break at first all tools outside the reach of the compiler vendor.

BTW: Have you ever tried PEG parsers? They are real fast, can deal with grammar ambiguity in a structured and predictable way and have been battle-tested with large langauges like Fortress.
And: there have been efforts to use PEG for “A Programming Language Where the Syntax and Semantics Are Mutable at Runtime”.
Please share your thoughts on this topic.


#7

I interpret dynamic runtime grammer as being able to extend a language during compilation (using specifications written in code, embedded in modules, etc.).

Again, breaking third-party tools will happen any time you extend a language. Several breaking changes have been introduced to Java over the years, and external tools simply had to adapt.

The questions are whether it is possible for external tools to adapt, and how much effort it will require.

If the language extensions are black boxes so tools cannot transform a new language construct to the eventual constructs in the base language, then tools will never be able to understand the future language. I am totally against this because tooling support is extremely important for the development experience. (I already think Kotlin is lacking in development experience at some points at this moment, and would hate for this to get worse because of some new language feature.)

Or if the effort to build a parser for an extensible language would require many man-years to complete, the chances of third-party tools being updated to handle the language changes will decrease significantly.

I would appreciate an extension mechanism, but both developers and tools need to be able to easily find the final code that is compiled so they can understand what a construct actually does. Developers will need to be able to inline, and tools need to be able to inspect.

No, I never tried those. I did look at XLR in the past. From a quick glance at Katahdin, I can say it looks very interesting.

I can’t really comment on language design and parsers as I am more of a language user. I think DSLs can be nicer with language extensions, but I have some grasp of the difficulties involved with implementing those. Tooling support being one of them.


#8

About PEG parsers. I’m grateful for the aiming me to it (shame on me, I didn’t know it before).
I found (abandoned) git repository Monaco. It has only 4 commits, last done 3 years ago. The source doesn’t compile on current kotlin compiler. I would like to repair it, but there’s no license, so by default I’m prohibited to do it.

Although, I have a little scepticism to the PEG parsers, as they fix parsing algorithm too strictly. I mean that it may be not so obvious how to write down proper PEG grammar for a reasonable complex language. Grammar rules as understood in a classical way are more easy to operate when building up a grammar.

As I see, the main objections for a dynamic grammar is:

  • external tools need to become too complex
  • navigation inside a code with a flexible grammar is not clear.

The first reason seems not principal to me: to have the complete control over the code it’s sufficient to use native parser/semantic engine of kotlin, which is open source. The other way how to deal with it: write an external libraries for parsing/semantic analysis (not including code generation) and use the in external tools.

The second reason is more subtle. Actually I think that it’s not possible to know the correct answer unless you implement an experimental system and use it for some time. The technical side of the problem is not an issue - you can use common tools to find declaration of new syntax constructions or to find all places when it is used (or any part of it). The principal question is human perception: will it be ok for a coder to operate in a language with constantly changing syntax or he will be get stuck at some point?

Personally I’m moderately optimistic about the ability of a man to work with a flexible syntax. I suspect that it will be rather unusual for those, who used fixed syntax for decades, but it’s not something completely mind breaking and user will get used to it at some point.


#9

I’m working on something in that vein for my PhD thesis.

It’s still in the (early) design / implementation phase, but all work on the topic will pop up here:

The idea is to enable the extension of base languages through module-scoped extensions (language extensions as libraries) defined with the language that is being extended.

It requires a lot of work upfront for the base language however (essentially a compiler re-implementation). Currently I’m targeting Java, but I hope to do something with Kotlin later.

Feel free to contact me if you have some interesting use cases!

PS: It does use PEG parsing :slight_smile:


#10

Really it’s very interesting (I mean whimsy). I browsed a little the repository, but found a lot of stubs for documentation, so yet I have no idea what whimsy is intended to be. Anyway, I have 2 questions concerning whimsy and PEG parsing in general.

  1. About PEG grammars. As far as grammar rules become order-dependant, it seems that PEG grammars are non-monotonic. By monotonicity I mean the following property:

to extend language it’s sufficient to add some extra rules, but there’s no need to modify the existing rules.

The classical CFG surely has monotonicity. Now let’s take the followong example: given the grammar

S ← ‘if’ C ‘then’ S

we want to extend it in a fashion that it should admit if-then-else statements. Then we should add rule

S ← ‘if’ C ‘then’ S ‘else’ S

after the initial one, so, in my understanding of PEG grammars it will always be shadowed with the first rule. Maybe I’m wrong with this concrete example, but I would like to be sure, that PEG grammars are monotonic (up to now it’s very doubtful for me)

  1. As I understand, you want to develop a language (whimsy), which supports the language extension. The extensions may be purely syntactical (i.e. syntactic sugar) - and it’s clear how to implement it. All other extensions are dealing with semantics, and I don’t see a uniform simple way how to define a semantic extension. For example imagine that you want to extend your imperative language with first-class function types - it’s a very complex extension, which demands a lot of semantics introduced and extended. There is a solution for such things - formal semantics, defined in some logical framework, like k framework, but it’s really HUGE project, involving advanced formal methods for analysis and synthesis of the proper semantic engine, capable of interpreting the ‘generated on-the-fly’ semantics. So, the second question is: how you are going to deal with extension of semantics?

#11

@norswap Then you might be interested in some prior work in this area: https://github.com/rsdn/nitra


#12

@dmitry-vlasov

There is a trade-off between PEG and CFG: by adding rules to a CFG, you risk making it ambiguous, while by adding rules to a PEG, you risk “hiding” valid constructions (the shadowing you refer to). Neither of these problems admits good solutions, so care is required in either case.

I’m currently looking at that side of things. The current plan is to build a “rule engine” that reactively derives properties or transformations of the AST.

It should make some extensions relatively easy (higher order functions should fall in that category). I expect some extensions will be quite hard however: if the require the re-interpretations of some constructs in the base language, then you have to override some rules (e.g. the typing rules for base language constructs) and that’s potentially a lot of work.

@ilya.gorbunov

Thanks, I didn’t know that one yet.

It seems to be slightly different than what I’m aiming at however. It’s a language workbench (like MPS, Xtend and Spoofax) whose goal is to build DSLs (which are limited in scope). In my case I want to make a base language (that can be an existing language or a new one) extensible. It’s aiming at that old pipe dream: to be able to do everything in the same language, as conveniently as possible. The cost to set up the base language is stupidly expensive: you essentially have to write a compiler (the framework should make it easier than writing a compiler from scratch - but it’s still hard work). However, after that you should be able to grow the language into whatever you like.


#13

What came to my mind right now (it’s strange that I didn’t think about it before) is that there’s no need to mix program code and language extension code. I mean that language extension may be stored in a separate file set, so it becomes more easy to track the change in syntax and to insert new rules.

Even more, you may store rules inside “patches”, which will, for example, place a new rule in the right place, actually changing the previously defined grammar. To do it easily, one should be able to see the whole grammar, available at some point, and if it is stored separately, then it becomes easier.

On the other hand, anyway there’s a need for some automation for grammar tracking, and for automatic tools it’s not a big deal to operate on the original source (with program and language extension code mixed). So, there’s subject to think over.


#14

That’s indeed how I intend it to work.

The physical separation makes a lot of sense from a UX perspective, because extension code will be run at compile-time.

And naturally, you don’t edit the base language definition when writing extensions (that would be forking rather than extending and completely removes any hope of composition).

Tooling is a big challenge. The direction that seems most promising is to annotate extensions with semantic information that can be used to perform lookups, refactorings, etc.


#15

Meanwhile I started an experimental project dynaparse in c++, to try the concept of dynamically extendible syntax.

I test it on the oberon syntax (frankly speaking it doesn’t work yet). The reason is: oberon is simple and well-known general purpose language.

Some thoughts about extensibility of semantics. With a great regret I leave the hope to use logical (i.e. formal) methods and come to the straight ad-hoc implementational approach. I mean that we can just equip the newly introduced syntax construction with a piece of code, written in the same language, as a compiler, which describes the semantics. In the process of compilation a compiler, as soon as it meets the definition of a new construction, should run the compiler to compile the definition of a construction to a shared library and instantly load it. Further, in case it meets the construction usage, it parses it with a dynamical parser and forward the implementation of it’s semantics to the compiled piece of code.


#16

Looks like you are looking for a way to xtend a language :slight_smile: