预训练语言模型复现CPT-1&Restructure_pretrain

(1）CPT -pretrain

CPT参数初始化，不是random initilized
是inference Robert的参数。
roberta_zh/: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint.

CPT的pretrain是基于megetron_LM （github中有官方介绍），代码中的pretrain文件下的部分。
这是实现model分块在不同显卡上完成训练。model parallel 可以是model的不同layer分在不同的GPU上，也可以是model的的Tensor calculation 分在不同的GPU上。

megatron-lm是一个包，PLM训练的工具包。

CPT的整个过程是reference了Megatron-PLM的过程（data preprocess，pretrain，finetune,downstream task evaluation）
在CPT的introducation中，介绍了the process is refered megatron plm.
the whole pretrain process of CPT model: https://github.com/fastnlp/CPT/blob/master/pretrain/README.md

(2) restructure _pretrain_plm_process

data format: json (not text) _数据工具：Datalab.——加载数据
https://github.com/ExpressAI/DataLab/tree/main/datasets

{“text”: “Parker was selected to represent Queensland as an interchange for games II and III of the 2004 State of Origin series and game III of the 2005 State of Origin series.”, “entities”: [“Queensland”, “2004 State of Origin”, “2005 State of Origin”]}
{“text”: “The image is only a small portion of the commercial product.”, “entities”: []}
{“text”: “Rock of Ages is a stylised English version of the thirteenth century Hebrew Hanukkah hymn Ma’oz Tzur.”, “entities”: [“Hanukkah”, “Ma’oz Tzur”]}
{“text”: “Ok, nevermind, I’m apparently clueless as to what’s been going on with these boxes. If Cide could look into it, I know he has a lot more experience with this than I do. z4ns4tsu\talk 19:22, 12 September 2006 (UTC)”, “entities”: []}
{“text”: “I hope the line height will be reduced soon, if not to the density that I prefer.”, “entities”: []}

公开的模型，适用于处理的任务类型存在一些差异，根据能够处理的任务类型公开的模型结构。
the model details

模型参数量大致在11billon级别（可对比已有PLM list找到位置，这个paramter还可。https://openbmb.github.io/BMList/）

ques1: what the data format used for plm?
ques2: the template used for every task is same?

answer:

signal structure:

general singal:

source: {corrupted text}
target: {corrupted position1}{target span1}{corrupted position2}{target span2}…

“Thank you <X> me to your party <Y> week.” and the prompted
target would be “<X> for inviting <Y> last <Z>”
1
2

task related singal:
• multiple-choice format
• generation format

a multiple-choice format prompt for the sentiment
classification task could be the following: I like this movie. Is this text ‘‘positive" or
‘‘negative"? while a generation format prompt could be the following: I like this movie. What’s
the sentiment of the previous text?. We use two special markers: “TEXT:” and “QUERY:” to
separate the general context and the intended task to be completed.

each type of signal, we construct multiple prompts so that the model can learn various query forms. We design a total of 1124 prompts for the 30 signals

相关阅读:
Linux介绍及基本操作
代码随想录训练营第35天|LeetCode 860.柠檬水找零、406.根据身高重建队列、452. 用最少数量的箭引爆气球
02 kafka 记录的获取
Mysql高级篇学习总结6：索引的概念及理解、B+树产生过程详解、MyISAM与InnoDB的对比
【精选】网络安全大厂面试题 2.0
附录A printf、varargs与stdarg A.2 使用varargs.h来实现可变参数列表
Java核心知识点整理大全7-笔记
牛客 HJ27 查找兄弟单词
查看ubuntu安装过什么包
从零开始，开发一个 Web Office 套件（5）：Mouse hover over text

原文地址：https://blog.csdn.net/Hekena/article/details/127706497