monocleaner 是用于检测单语句子的流畅度的工具。
建议在 linux 上使用monocleaner,由于monocleaner 的依赖包 FastSpell 在 Mac上安装失败(如果你成功了,欢迎告知我安装方式),所以不建议在 Mac 上使用。
monocleaner-train, 同时你也可以直接使用语言包。monocleaner-download 工具下载最新的数据,也可以访问 https://github.com/bitextor/monocleaner-data/releases/latest 下载。python3.7 -m pip install monocleaner
依赖项
python-dev 和 libhunspell-dev(安装:sudo apt install python-dev libhunspell-dev)hunspell-es (sudo apt-get install hunspell-es), 或者下载外部资源,比如:https://github.com/wooorm/dictionaries/tree/main/dictionariesvenv/lib/python3.7/site-packages/fastspell/config/hunspell.yamlsetup.py 安装,配置在 /config/hunspell.yaml/usr/share/hunspell。安装成功后,会生成可执行文件monocleaner, monocleaner-train, monocleaner-download 个文件在 python/installation/prefix/bin 下
比如:
在Mac上,我的文件在 /Library/Frameworks/Python.framework/Versions/3.7/bin/ 下
在 linux 上,我使用 ananconda 中的python,所以可执行文件在 /home/newtranx/anaconda3/bin 下方
查看版本信息和帮助
$ monocleaner -v
monocleaner Version 1.1.0 # 2021-03-07 # Add lang ident column # Jaume Zaragoza
$ monocleaner -h
usage: monocleaner [-h] [--scol SCOL] [--disable_lang_ident] [--disable_hardrules] [--disable_minimal_length] [--score_only]
[--add_lang_ident] [--annotated_output] [--debug] [-q] [-v]
model_dir [input] [output]
positional arguments:
model_dir Model directory to store LM file and metadata.
input Input file. If omitted, read from 'stdin'.
output Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
optional arguments:
-h, --help show this help message and exit
--scol SCOL Sentence column (starting in 1)
--disable_lang_ident Disables language identification in hardrules
--disable_hardrules Disables the hardrules filtering (only monocleaner fluency scoring is applied)
--disable_minimal_length
Don't apply minimal length (3 words) rule
--score_only Only print the score for each sentence, omit all fields
--add_lang_ident Add another column with the identified language if it's not disabled.
--annotated_output Add hardrules annotation for each sentence
--debug
-q, --quiet
-v, --version show version of this script and exit
工具的运行语法格式如下:
monocleaner [-h]
[--disable_minimal_length]
[--disable_hardrules]
[--score_only]
[--annotated_output]
[--add_lang_ident]
[--debug]
[-q]
model_dir [input] [output]
参数说明
model_dir: 模型存储的文件夹input: 输入文件的地址。如果省略此项,将从终端交互中读取。output: 输出文件,使用 tab 作为分隔符。--score_only: 只输出分数。(默认为 False)--add_lang_ident: 如果有效,根据给定的语言,添加其他列。--disable_hardrules: (只是在流畅度评分中)取消 hardrules。(默认为 False)--disable_minimal_length : 不适用最小长度规则。(默认为 False)-q, --quiet: 静默日志模式 (默认为 False)--debug: 调试日志模式 (默认为 False)-v, --version: 显示版本信息使用示例:
输入 command +
$ monocleaner xx/monocleaner/models/en
2022-06-25 13:17:35,372 - WARNING - Downloading FastText model...
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:18:01,280 - INFO - Start scoring text
hello, this my name is
hello, this my name is 0.676
hello, this is my name
hello, this is my name 0.706
只显示评分
$ monocleaner --score_only xx/monocleaner/models/en
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
2022-06-25 13:23:13,298 - INFO - Start scoring text
hi, I wanna fly to the sky!
0.603
you're beautiful in white
0.800
monocleaner-download 好像没有查看版本一说,输入命令可以看到使用说明
$ monocleaner-download --version
Wrong number of arguments: --version
Script to download Bicleaner language packs.
Usage: monocleaner-download <lang> <download_path>
<lang> Language code.
<download_path> Path where downloaded language pack should be placed.
那么我们可以尽情下载数据了
$ monocleaner-download es xx/monocleaner/models/
PS: 目前没有看到 zh 数据。可以使用 monocleaner-train 训练一个。
你也可以前往 https://github.com/bitextor/monocleaner-data/releases/latest 下载,或查看已有的语言支持。
$ monocleaner-train -h
usage: monocleaner-train [-h] -l LANGUAGE [--dev_size DEV_SIZE]
[--lm_type {PLACEHOLDER,CHARACTER}]
[--tokenizer_command TOKENIZER_COMMAND] [--debug]
[-q]
train model_dir
train: 训练数据集文件,一行一句单语数据。model_dir: Model directory to store LM file and metadata. 模型文件夹,用于存储 LM 文件和元数据。-h, --help: show this help message and exit-l LANGUAGE, --language LANGUAGE: Language code of the model.--dev_size DEV_SIZE: Number of sentences used to estimate mean and stddev perplexity on noisy and clean text. Extracted from training the training corpus.--lm_type {PLACEHOLDER,CHARACTER}--tokenizer_command TOKENIZER_COMMAND: Tokenizer command to replace Moses tokenizer when using PLACEHOLDER LMType.--debug-q, --quiet这里我没有做训练,所以不在这里说明训练结果和遇到的问题之类的。有机会再补上。
伊织 2022-06-25(六)