扩展JavaScript语法记录 - 掉坑初期工具

好久没写文章了，干脆发篇来凑数好了。

我们几个人闲得无聊，突然，就想做一个HaLang。

What's HaLang? https://github.com/laosb/halang

为什么要做这个呢？嘛，就是闲的蛋疼。

那怎么入手呢？

也就是几个keyword的替换，以及让parser支持中文符号而已。这样的话，没有任何必要重复造一个parser，反正也就改改tokenizer加几个token而已。更何况，JavaScript生态圈是一个充满各种奇形怪状的轮子的生态圈，我们当然倾向于用已有的方轮子来解决问题。

日常中，我们用到的最多的语法转义工具是babel，它号称自己是

The compiler for writing next generation JavaScript

并且，日常用到的所有功能都是以plugin的形式存在。看起来非常适合我们的需求，做一个babel-plugin的话，能最小化开发成本及最大化开发效益。

babel自身依靠的是babylon作为其parser。babylon本身也是支持plugin的，其jsx和flow就是作为plugin而存在。这样的话，展示在我们面前的蓝图就非常清晰了：

- babel-preset-halang

- babel-plugin-halang

- babel-plugin-halang-jsx

- babel-plugin-halang-flow

- babel-polyfill-halang

- babylon-plugin-halang

- babylon-plugin-halang-jsx

- babylon-plugin-halang-flow

- eslint-standard-halang

最后实现bootstrap，简直完美。

事不宜迟，先看看babylon的jsx插件作为示例吧。

import { TokenType, types as tt } from "../../tokenizer/types";
import { TokContext, types as tc } from "../../tokenizer/context";
import Parser from "../../parser";

tc.j_oTag = new TokContext("<tag", false);
tc.j_cTag = new TokContext("</tag", false);
tc.j_expr = new TokContext("<tag>...</tag>", true, true);

tt.jsxName = new TokenType("jsxName");
tt.jsxText = new TokenType("jsxText", { beforeExpr: true });
tt.jsxTagStart = new TokenType("jsxTagStart", { startsExpr: true });
tt.jsxTagEnd = new TokenType("jsxTagEnd");

tt.jsxTagStart.updateContext = // ...
tt.jsxTagEnd.updateContext = // ...
const pp = Parser.prototype;
// pp.jsxReadToken = function() {...}
// ...

export default function(instance) {
  instance.extend("parseExprAtom", function(inner) {
    return function(refShortHandDefaultPos) {
      // Do something
      return inner.call(this, refShortHandDefaultPos);
    };
  });

  instance.extend("readToken", function(inner) {
    return function(code) {
      // Do something
      return inner.call(this, code);
    };
  });

  instance.extend("updateContext", function(inner) {
    return function(prevType) {
      // Do something
      return inner.call(this, prevType);
    };
  });
}

嗯，非常清晰的结构。大体上就是先定义一些Token，然后给Parser的原型链扩展一下方法来搞别的。

extend(name: string, f: Function) {    
  this[name] = f(this[name]);    
}

这个extend函数是一个非常简单的插件实现。它把Parser里面原有的实现替换成了插件的实现，让插件先处理，发现自己力有不逮就返回原有实现。我们看上面的jsx的代码大体上也能摸个大概。这种虽然不是很pure，但是给了插件最大的自由度：这代表，我们无论做什么，可能都不会受到架构层面以外的束缚。

我们再回头看看插件的启用函数：https://github.com/babel/babylon/blob/b6c3b5aa8319582f695e7944503612e39aa08261/src/parser/index.js#L68

  loadPlugins(pluginList: Array<string>): { [key: string]: boolean } {
    // TODO: Deprecate "*" option in next major version of Babylon
    if (pluginList.indexOf("*") >= 0) {
      this.loadAllPlugins();

      return { "*": true };
    }

    const pluginMap = {};

    if (pluginList.indexOf("flow") >= 0) {
      // ensure flow plugin loads last
      pluginList = pluginList.filter((plugin) => plugin !== "flow");
      pluginList.push("flow");
    }

    for (const name of pluginList) {
      if (!pluginMap[name]) {
        pluginMap[name] = true;

        const plugin = plugins[name];
        if (plugin) plugin(this);
      }
    }

    return pluginMap;
  }

感觉有点不妙。插件插件，本来就应该要“Plug In”，而不是让系统来提供一个萝卜坑。系统给flow一个特殊优待这个设计不太科学，科学的做法是做一个优先级选项，然后一个sort解决问题。算了，暂且先跳过这个。

可以发现，这里让插件生效的方法，是先让插件把自身函数注册进plugins[name]，然后系统把this instance丢给插件导出函数，导出函数里调用extend，插件就进入系统了。

看起来没什么问题，让我们开始写吧。

为了方便，我们首先需要import那些Types和Context。但是，现在问题来了。

嗯，babylon发布在npm上的包是rolluped的，当然没有src的存在。并且整个babylon只导出了parse这个方法，其它完全不给。我们如果把npm里源设置为GitHub源码地址，倒是可以在某种意义上绕过这个问题。先写个Hello World吧。

……嗯？怎么注册进插件？文档里没说啊！我们看看上面那个plugins在哪里有被调用吧：https://github.com/babel/babylon/blob/b6c3b5aa8319582f695e7944503612e39aa08261/src/index.js#L14

import flowPlugin from "./plugins/flow";    
import jsxPlugin from "./plugins/jsx";    
plugins.flow = flowPlugin;    
plugins.jsx = jsxPlugin;

……

原来，flow和jsx插件，是非常完美地，嵌入在系统里面的呀。babylon也没有导出这个plugins对象，也就是说，不给你玩。

但是按照文档，它还支持以下插件。它是怎么做到的呢？

jsx
flow
doExpressions
objectRestSpread
decorators (Based on an outdated version of the Decorators proposal. Will be removed in a future version of Babylon)
classProperties
exportExtensions
asyncGenerators
functionBind
functionSent
dynamicImport

这个问题非常简单：https://github.com/babel/babylon/blob/b6c3b5aa8319582f695e7944503612e39aa08261/src/parser/expression.js#L433

case tt._do:    
    if (this.hasPlugin("doExpressions")) {    
        const node = this.startNode();    
        this.next();    
        // ....

搞了半天，babylon根本没有打算做插件机制。这些插件，名为“Plug In”，实际上，只是相当于一个功能的开关……至于jsx和flow，那更像是为了解耦合而强行让这两个模块独立出来。

看babel讨论版的讨论……嗯……看来是不用打算做babylon-plugin了http://discuss.babeljs.io/t/syntax-extensions-parser-plugins/122

那我们就只能放弃babylon-plugin的这条路了。试试别的路：比如Mozilla下的sweetjs（https://github.com/sweet-js/sweet.js）。

Sweetjs可是喊出了“Hygienic Macros for JavaScript!”的存在，允许“Macros allow you to build the language of your dreams.”。正遂我愿……吗？

Bazinga!

你以为你能实现

#define （ (
#define ） )

吗！Naive！

syntax ！ = function (ctx) {
  return #`!`;
}

这种代码当然会出现【Unexpected "！"】！为什么？

先看它怎么识别Indentifier的：https://github.com/sweet-js/sweet.js/blob/2051c1bb737f45a5028260afa63ed778869f8895/src/reader/read-identifier.js#L23

if (char === '\\' || 0xD800 <= code && code <= 0xDBFF) {    
    return new IdentifierToken({    
        value: getEscapedIdentifier.call(this, stream)    
    });    
}

支持了UTF-16那一堆全角字符！

醒醒，那和我们有什么关系，下面有个check，指向isIdentifierStart，这个是由esutils实现的：https://github.com/estools/esutils/blob/master/lib/code.js#L115

其实都到esutils里了，我觉得我们根本不用看了……

    // 7.6 Identifier Names and Identifiers

    function fromCodePoint(cp) {
        if (cp <= 0xFFFF) { return String.fromCharCode(cp); }
        var cu1 = String.fromCharCode(Math.floor((cp - 0x10000) / 0x400) + 0xD800);
        var cu2 = String.fromCharCode(((cp - 0x10000) % 0x400) + 0xDC00);
        return cu1 + cu2;
    }

    IDENTIFIER_START = new Array(0x80);
    for(ch = 0; ch < 0x80; ++ch) {
        IDENTIFIER_START[ch] =
            ch >= 0x61 && ch <= 0x7A ||  // a..z
            ch >= 0x41 && ch <= 0x5A ||  // A..Z
            ch === 0x24 || ch === 0x5F;  // $ (dollar) and _ (underscore)
    }
    
    function isIdentifierStartES6(ch) {
        return ch < 0x80 ? IDENTIFIER_START[ch] : ES6Regex.NonAsciiIdentifierStart.test(fromCodePoint(ch));
    }

Good Game!

好吧，我们从正经的JavaScript Parser里面去搞。回头看看acorn。acorn是拥有正经的插件机制的，也已经有了几个比较成功的样本。其实从上面剖析出来的插件机制就可以看出来，babylon就是一个acorn的魔改版本。回顾babel的历史，babel在使用babylon之前，用的就是魔改版acorn，甚至还讨论过切换回原始的acorn——后来在魔改版acorn的道路走得越来越远，搞出了“Heavily based on acorn and acorn-jsx”的babylon。看起来可以很轻易地替换babel的底层parser，更何况babel-core还有这么一个API：transformFromAst。

——事与愿违。

我们首先知道，acorn导出的AST是基于ESTree（https://github.com/estree/estree）标准的。但babylon不是，babylon用的是Babel AST Format（https://github.com/babel/babylon/blob/master/ast/spec.md）。虽说是基于ESTree，然而已经魔改了很多，acorn已经不可能完全兼容了。当然，另一个Parser：Esprima导出的也是符合ESTree标准的AST，也不能用它。

既然如此，我们是不是还可以不用babel呢？毕竟transpiler还有很多，比如Bublé？Bublé确实是基于acorn的，然而（https://buble.surge.sh/guide/#faqs）：

Can we have plugins?
No. The whole point of Bublé is that it's quick to install and takes no mental effort to set up. Plugins would ruin that.
How is Bublé so fast?
The traditional approach to code transformation is to generate an abstract syntax tree (or AST), analyse it, manipulate it, and then generate code from it. Bublé only does the first two parts. Rather than manipulating the AST, Bublé manipulates the source code directly, using the AST as a guide – so there's no costly code generation step at the end. As a bonus, your formatting and code style are left untouched.
This is made possible by magic-string.
Forgoing the flexibility of a plugin system also makes it easy for Bublé to be fast.

简单来说，就是不支持插件机制，且Bublé是直接操作源代码的，看起来要魔改一大堆。

还可以试试魔改uglifyjs2，但是想想，还是算了吧。

当然了，不用这些transpiler还有别的理由：它们对新特性的支持度和babel比还是有相当大的距离的。

最后，走上了acorn + astring插件的不归路。要实现需要的功能，提交了：

- 一个Pull Request: https://github.com/ternjs/acorn/pull/495

- 语言规范：https://laosb.github.io/halang/definitions.html

- 正在编写的实现：https://github.com/laosb/hatp