NLP - monocleaner - 码农知识堂

NLP - monocleaner
文章目录
- 关于 monocleaner
  安装
  打分 Scoring
  使用 monocleaner-download 下载数据
  monocleaner-train 训练数据
关于 monocleaner

monocleaner 是用于检测单语句子的流畅度的工具。

建议在 linux 上使用monocleaner，由于monocleaner 的依赖包 FastSpell 在 Mac上安装失败（如果你成功了，欢迎告知我安装方式），所以不建议在 Mac 上使用。
- 提供了训练工具 monocleaner-train, 同时你也可以直接使用语言包。
- 你可以使用 monocleaner-download 工具下载最新的数据，也可以访问 https://github.com/bitextor/monocleaner-data/releases/latest 下载。
安装
```
python3.7 -m pip install monocleaner
1
```
依赖项
- 大部分依赖项会在 monocleaner 安装的时候，会自动下载；
- KenLM，需要提前安装。可参考： https://blog.csdn.net/lovechris00/article/details/125424808
- monocleaner 也依赖于 FastSpell, 这个库在 macOS 上没法安装，所以 monoclear 只能在linux 上使用。
  FastSpell : https://github.com/mbanon/fastspell
  FastSpell 依赖于 python-dev 和 libhunspell-dev（安装：sudo apt install python-dev libhunspell-dev）
- 如果你需要支持相似的语言如 similar 所列出，需要安装 hunspell-es (sudo apt-get install hunspell-es)，或者下载外部资源，比如：https://github.com/wooorm/dictionaries/tree/main/dictionaries
  你也可以给 Hunspell 字典文件夹配置路径。
  - 如果你使用 pip安装，设置在 venv/lib/python3.7/site-packages/fastspell/config/hunspell.yaml
  - 如果你使用 setup.py 安装，配置在 /config/hunspell.yaml
  - 如果你直接使用代码运行，默认地址为：/usr/share/hunspell。
安装成功后，会生成可执行文件monocleaner, monocleaner-train, monocleaner-download 个文件在 python/installation/prefix/bin 下
比如：
在Mac上，我的文件在 /Library/Frameworks/Python.framework/Versions/3.7/bin/ 下
在 linux 上，我使用 ananconda 中的python，所以可执行文件在 /home/newtranx/anaconda3/bin 下方

查看版本信息和帮助
```
$ monocleaner -v
monocleaner Version 1.1.0 # 2021-03-07 # Add lang ident column # Jaume Zaragoza

$ monocleaner -h
usage: monocleaner [-h] [--scol SCOL] [--disable_lang_ident] [--disable_hardrules] [--disable_minimal_length] [--score_only]
                   [--add_lang_ident] [--annotated_output] [--debug] [-q] [-v]
                   model_dir [input] [output]

positional arguments:
  model_dir             Model directory to store LM file and metadata.
  input                 Input file. If omitted, read from 'stdin'.
  output                Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.

optional arguments:
  -h, --help            show this help message and exit
  --scol SCOL           Sentence column (starting in 1)
  --disable_lang_ident  Disables language identification in hardrules
  --disable_hardrules   Disables the hardrules filtering (only monocleaner fluency scoring is applied)
  --disable_minimal_length
                        Don't apply minimal length (3 words) rule
  --score_only          Only print the score for each sentence, omit all fields
  --add_lang_ident      Add another column with the identified language if it's not disabled.
  --annotated_output    Add hardrules annotation for each sentence
  --debug
  -q, --quiet
  -v, --version         show version of this script and exit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
```
打分 Scoring
- monocleaner 主要用于检测单语句子的流畅度。
- 每个句子的流畅度评分在 0–1 区间之内。分数越高越流畅。
- 在连续的评分之外，一些写死的规则也会将明显有问题的句子评分为0。
- 输入文件必须每行一个句子。
- 输出文件的行数和输入文件行数一致，会多一列分数值。
工具的运行语法格式如下：
```
monocleaner [-h]
            [--disable_minimal_length]
            [--disable_hardrules]
            [--score_only]
            [--annotated_output]
            [--add_lang_ident]
            [--debug]
            [-q]
            model_dir [input] [output]
1
2
3
4
5
6
7
8
9
```
参数说明
- Positional arguments:
  - model_dir: 模型存储的文件夹
  - input: 输入文件的地址。如果省略此项，将从终端交互中读取。
  - output: 输出文件，使用 tab 作为分隔符。
- 可选参数:
  - --score_only: 只输出分数。（默认为 False）
  - --add_lang_ident: 如果有效，根据给定的语言，添加其他列。
  - --disable_hardrules: （只是在流畅度评分中）取消 hardrules。（默认为 False）
  - --disable_minimal_length : 不适用最小长度规则。（默认为 False）
- 日志:
  - -q, --quiet: 静默日志模式（默认为 False）
  - --debug: 调试日志模式（默认为 False）
  - -v, --version: 显示版本信息
使用示例：

输入 command +
```
$ monocleaner xx/monocleaner/models/en
2022-06-25 13:17:35,372 - WARNING - Downloading FastText model...
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:18:01,280 - INFO - Start scoring text
hello, this my name is
hello, this my name is	0.676
hello, this is my name
hello, this is my name	0.706
1
2
3
4
5
6
7
8
```
只显示评分
```
$ monocleaner --score_only xx/monocleaner/models/en
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:23:13,298 - INFO - Start scoring text
hi, I wanna fly to the sky!
0.603
you're beautiful in white
0.800

1
2
3
4
5
6
7
8
```
使用 monocleaner-download 下载数据

monocleaner-download 好像没有查看版本一说，输入命令可以看到使用说明
```
$ monocleaner-download --version
Wrong number of arguments: --version
Script to download Bicleaner language packs.

Usage: monocleaner-download <lang> <download_path>
      <lang>         Language code.
      <download_path> Path where downloaded language pack should be placed.
1
2
3
4
5
6
7
```
那么我们可以尽情下载数据了
```
$ monocleaner-download es xx/monocleaner/models/
1
```
PS: 目前没有看到 zh 数据。可以使用 monocleaner-train 训练一个。
你也可以前往 https://github.com/bitextor/monocleaner-data/releases/latest 下载，或查看已有的语言支持。

monocleaner-train 训练数据
```
$ monocleaner-train -h
usage: monocleaner-train [-h] -l LANGUAGE [--dev_size DEV_SIZE]
                         [--lm_type {PLACEHOLDER,CHARACTER}]
                         [--tokenizer_command TOKENIZER_COMMAND] [--debug]
                         [-q]
                         train model_dir
1
2
3
4
5
6
```
- positional arguments:
  - train: 训练数据集文件，一行一句单语数据。
  - model_dir: Model directory to store LM file and metadata. 模型文件夹，用于存储 LM 文件和元数据。
- optional arguments:
  - -h, --help: show this help message and exit
  - -l LANGUAGE, --language LANGUAGE: Language code of the model.
  - --dev_size DEV_SIZE: Number of sentences used to estimate mean and stddev perplexity on noisy and clean text. Extracted from training the training corpus.
  - --lm_type {PLACEHOLDER,CHARACTER}
  - --tokenizer_command TOKENIZER_COMMAND: Tokenizer command to replace Moses tokenizer when using PLACEHOLDER LMType.
  - --debug
  - -q, --quiet
这里我没有做训练，所以不在这里说明训练结果和遇到的问题之类的。有机会再补上。

伊织 2022-06-25（六）
相关阅读:
网络安全（黑客）自学
 Excel管理Simulink SWC中的标定量与观测量之观测量
 jenkins 展示测试报告不友好？教你3招，甩出同事3条街！
P1547 [USACO05MAR] Out of Hay S 题解
 Hadoop学习笔记(2)——HDFS(1)
Android学习笔记 42. RxJava基本使用
 C++＞＞继承
 （六）七种元启发算法（DBO、LO、SWO、COA、LSO、KOA、GRO）求解无人机路径规划MATLAB
smt加工企业多不多？如何进行了解？
MFC中不同编码格式内容的写入
原文地址：https://blog.csdn.net/lovechris00/article/details/125458461

NLP - monocleaner

文章目录

关于 monocleaner

安装

打分 Scoring

使用 monocleaner-download 下载数据

monocleaner-train 训练数据