Javascript正则解析出代码的函数体

Javascript正则解析出代码的函数体
Javascript正则解析代码的函数体

 How to use regex to capture and extract “class/function” context in source code

在做软件工程研究时，我们通常会遇到处理代码片段的问题。

最近在研究上遇到一个问题：如何抽取Solidity代码中的contract和function内容？

前期准备
1. regex正则表达式基础
2. node.js或其他运行js的平台（前端）
遇到的问题
- 多层嵌套的花括号{{{{}}}}
- 引号里的字符串，尤其是字符串还带有了{}()等本来需要匹配的特殊字符
- 中文，或非ASCII码
- 注释中的花括号，引号等影响
第一步

首先我们需要先处理掉代码中的注释部分，因为这部分如果不去除，后面在使用正则解析时会遇到非常多奇奇怪怪的字符。

去除注释方法如下：
```
function clearCode(sources) {
    // merge files
    let code = ''
    for (const i in sources) code += sources[i].content
    // remove commments
    code = code.replace(/(\/\*[\s\S]*?\*\/)|((?<!:)\/\/.*)/g, '')
    // remove import
    code = code.replace(/import.*;/g, '')
    // remove pragma
    code = code.replace(/pragma.*;/g, '')
    return code.trim()
}
1
2
3
4
5
6
7
8
9
10
11
12
```
这里实际使用的一行正则是：(\/\*[\s\S]*?\*\/)|((?<!:)\/\/.*)；
注意下，这里排出了各种可能出现的://，因为代码中容易出现类似http://, ftp://这类字符串；
另外上述方法删除了Sol文件的前缀信息，包括pragma编译版本，import引入合约等。

第二步

第二步，我们需要使用base64来mask引号内容，也就是代码中的字符串，防止字符串里的特殊字符影响接下来的正则匹配。
```
// to mask string with '...', "..."
    hashStr(text, flag = true) {
        if (flag) {
            const reg = /("[^"]*")|('[^']*')/g
            text = text.replace(reg, s => `"${this.encode(s)}"`)
        } else {
            const reg = /"[^"]*"/g
            text = text.replace(reg, s => this.decode(s.slice(1).slice(0, -1)))
        }
        return text
    }
1
2
3
4
5
6
7
8
9
10
11
```
```
   encode(text) {
        return btoa(unescape(encodeURIComponent(text)))
   }
1
2
3
```
```
    decode(text) {
        return decodeURIComponent(escape(atob(text)))
    }
1
2
3
```
函数hashStr作用是遮掩（mask）掉代码文本中的所有字符串，思路是连带引号一起转换为base64码。

注意，base64前需要使用encode，这是因为你很可能遇到字符串中是UTF8编码内容，base64默认只支持latin（ASCII），反之解码base64之后，也需要decode出文本。这是一个常见的坑，需要注意⚠️。

hashStr关键的正则为：
hash时：("[^"]*")|('[^']*')；
反向hash时："[^"]*"
反向时仅匹配"是因为我在base64后，统一用双引号highlight出这些base64，为的是方便反向解码时容易匹配到。

第三步

最后，我们可以放心使用如下超长超长的regex进行capture啦，抓取contract和function内容吧。
获取class/contract
```
    getContracts(text) {
        text = this.hashStr(text)
        const reg =
            /(^|\s)(contract|interface|library|abstractcontract)\s[^;{}]*{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{[^{}]*})*})*})*})*})*})*})*})*})*}/g
        const res = text.match(reg) || []
        for (const i in res) res[i] = this.hashStr(res[i].trim(), false)
        return res
    }
1
2
3
4
5
6
7
8
```
获取functiion
```
getFunctions(text) {
        text = this.hashStr(text)
        const reg =
            /(^|\s)((function|event)\s|constructor\s*$.*$)[^{};]*({(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{[^}{]*})*})*})*})*})*})*})*})*}|;)/g
        const res = text.match(reg) || []
        for (const i in res) res[i] = this.hashStr(res[i].trim(), false)
        return res
    }
1
2
3
4
5
6
7
8
```
注意，regex本身是不支持多层嵌套的，这里勉强采用手动嵌套的方式来匹配嵌套的花括号。

getContracts支持10层嵌套，regex是：(^|\s)(contract|interface|library|abstractcontract)\s[^;{}]*{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{(?:[^{}]+|{[^{}]*})*})*})*})*})*})*})*})*})*}

getFunctions支持9层嵌套，regex是：(^|\s)((function|event)\s|constructor\s*$.*$)[^{};]*({(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{(?:[^}{]+|{[^}{]*})*})*})*})*})*})*})*})*}|;)

如需超出预设层数，js会出现timeout，可以自行添加层数

 补充

同时，为了方便获得函数名，类名，这里还给出一套方法：
```
nWord(str, n) {
        if (typeof n === 'number') {
            const m = str.match(new RegExp('^(?:\\w+\\W+){' + n + '}(\\w+)'))
            return m && m[1]
        } else if (n.length) {
            const arr = []
            for (const i of n) arr.push(this.nWord(str, i))
            return arr
        }
    },
    getContractName(contractCode) {
        const words = this.nWord(contractCode, [0, 1, 2])
        if (words[0] === 'abstract') return words[2]
        else return words[1]
    },
    getFunctionName(functionCode) {
        const words = this.nWord(functionCode, [0, 1])
        if (words[0] === 'constructor') return words[0]
        else return words[1]
    }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
```
nWord函数可以获取第n个单词，getContractName获得合约/类名，getFunctionName获取方法名。

以上所有步骤结合，就可以获取方法，类的内容，名称。

结合echarts插件还可以构建函数树。

开发中，regex常用工具网站：

工具一：可以显示regex的流程规则图
https://www.debuggex.com/

工具二：在线调试regex和需要匹配的样本文件
https://regexr.com/
相关阅读:
Ai项目十四：基于 LeNet5 的手写数字识别及训练
 SiO2/PAA/Ag复合纳米粒/酞菁修饰磁性温敏二氧化硅纳米微球/中空SiO2/TiO2纳米微球的制备
 【***操作系统---第三章***】
【Python】安装autopep8包，并在PyCharm中进行配置，以PEP8规范排版代码
 【Linux硬盘数据读取】WIN10访问linux分区解决方案：ext2fsd
亚洲央行部署外汇储备以对抗货币空头
 数学建模 | 灰色预测原理及python实现
 我试图给你分享一种自适应的负载均衡。
电脑数据恢复软件分享，需要的快收藏
 关于自定义的RabbitMQ的RabbitMessageContainer注解-实现原理
原文地址：https://blog.csdn.net/u014466109/article/details/125530650

Javascript正则解析代码的函数体

How to use regex to capture and extract “class/function” context in source code

前期准备

遇到的问题

第一步

第二步

第三步

补充