中文预训练ALBERT模型 - 资源下载

1 有用

3 下载

中文预训练ALBERT模型

文件列表（压缩包大小 968.73K)

免费

概述

中文语料上预训练ALBERT模型：参数更少，效果更好。预训练小模型也能拿下13项NLP任务，ALBERT三大改造登顶GLUE基准、

一键运行10个数据集、9个基线模型、不同任务上模型效果的详细对比，见中文语言理解基准测评 CLUE benchmark

一键运行CLUE中文任务：6个中文分类或句子对任务（新）

使用方式：

1、克隆项目

   git clone https://github.com/brightmart/albert_zh.git

2、运行一键运行脚本(GPU方式): 会自动下载模型和所有任务数据并开始运行。

   bash run_classifier_clue.sh

执行该一键运行脚本将会自动下载所有任务数据，并为所有任务找到最优模型，然后测试得到提交结果

模型下载 Download Pre-trained Models of Chinese

1、albert_tiny_zh, albert_tiny_zh(训练更久，累积学习20亿个样本)，文件大小16M、参数为4M

训练和推理预测速度提升约10倍，精度基本保留，模型大小为bert的1/25；语义相似度数据集LCQMC测试集上达到85.4%，相比bert_base仅下降1.5个点。
lcqmc训练使用如下参数: --max_seq_length=128 --train_batch_size=64 --learning_rate=1e-4 --num_train_epochs=5
albert_tiny使用同样的大规模中文语料数据，层数仅为4层、hidden size等向量维度大幅减少; 尝试使用如下学习率来获得更好效果：{2e-5, 6e-5, 1e-4}
【使用场景】任务相对比较简单一些或实时性要求高的任务，如语义相似度等句子对任务、分类任务；比较难的任务如阅读理解等，可以使用其他大模型。
例如，可以使用Tensorflow Lite在移动端进行部署，本文随后针对这一点进行了介绍，包括如何把模型转换成Tensorflow Lite格式和对其进行性能测试等。

一键运行albert_tiny_zh(linux,lcqmc任务)：

git clone https://github.com/brightmart/albert_zh
cd albert_zh
bash run_classifier_lcqmc.sh

albert_tiny_google_zh(累积学习10亿个样本,google版本) 模型大小16M、性能与albert_tiny_zh一致
albert_small_google_zh(累积学习10亿个样本,google版本) 速度比bert_base快4倍；LCQMC测试集上比Bert下降仅0.9个点；去掉adam后模型大小18.

2、albert_large_zh,参数量，层数24，文件大小为64M

参数量和模型大小为bert_base的六分之一；在口语化描述相似性数据集LCQMC的测试集上相比ber

3、albert_base_zh(额外训练了1.5亿个实例即 36k steps * batch_size 4096); albert_base_zh(小模型体验版), 参数量12M, 层数12，大小为40M

参数量为bert_base的十分之一，模型大小也十分之一；在口语化描述相似性数据集LCQMC的测试集上相比bert_base下降约0.6~1个点；相比未预训练，albert_base提升14个点

4、albert_xlarge_zh_177k ; albert_xlarge_zh_183k(优先尝试)参数量，层数24，文件大小为230M

参数量和模型大小为bert_base的二分之一；需要一张大的显卡；完整测试对比将后续添加；batch_si

快速加载

依托于Huggingface-Transformers 2.2.2，可轻松调用以上模型。

tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModel.from_pretrained("MODEL_NAME")

其中MODEL_NAME对应列表如下：

模型名	MODEL_NAME
albert_tiny_google_zh	voidful/albert_chinese_tiny
albert_small_google_zh	voidful/albert_chinese_small
albert_base_zh (from google)	voidful/albert_chinese_base
albert_large_zh (from google)	voidful/albert_chinese_large
albert_xlarge_zh (from google)	voidful/albert_chinese_xlarge
albert_xxlarge_zh (from google)	voidful/albert_chinese_xxlarge

更多通过transformers使用albert的示例

预训练 Pre-training

生成特定格式的文件(tfrecords) Generate tfrecords Files Run following command 运行以下命令即可。项目自动了一个示例的文本文件(data/news_zh_1.txt)

   bash create_pretrain_data.sh

如果你有很多文本文件，可以通过传入参数的方式，生成多个特定格式的文件(tfrecords）

Support English and Other Non-Chinese Language:

If you are doing pre-train for english or other language,which is not chinese, 
you should set hyperparameter of non_chinese to True on create_pretraining_data.py; 
otherwise, by default it is doing chinese pre-train using whole word mask of chi

执行预训练 pre-training on GPU/TPU using the command

GPU(brightmart版, tiny模型):
export BERT_BASE_DIR=./albert_tiny_zh
nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord  \
--output_dir=./my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/albert_config_tiny.json \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=51 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176    \
--save_checkpoints_steps=2000  --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &

GPU(Google版本, small模型):
export BERT_BASE_DIR=./albert_small_zh_google
nohup python3 run_pretraining_google.py --input_file=./data/tf*.tfrecord --eval_batch_size=64 \
--output_dir=./my_new_model_path --do_train=True --do_eval=True --albert_config_file=$BERT_BASE_DIR/albert_config_small_google.json  --export_dir=./my_new_model_path_export \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=20 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176   \
--save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt

TPU, add something like this:
    --use_tpu=True  --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a
    
注：如果你重头开始训练，可以不指定init_checkpoint；
如果你从现有的模型基础上训练，指定一下BERT_BASE_DIR的路径，并确保bert_config_file和init_checkpoint两个参数的值能对应到相应的文件上；
领域上的预训练，根据数据的大小，可以不用训练特别久。

环境 Environment

Use Python3 +
Tensorflow 1.x e.g. Tensorflow 1.4 or 1.5

下游任务 Fine-tuning on Downstream Task

使用TensorFlow: 以使用albert_base做LCQMC任务为例。LCQMC任务是在口语化描述的数据集上做文本的相似性预测。

We will use LCQMC dataset for fine-tuning, it is oral language corpus, it is used to train and predict semantic similarity of a pair of sentences.

下载LCQMC数据集，包含训练、验证和测试集，训练集包含24万口语化描述的中文句子对，标签为1或0。1为句子语义相似，0为语义不相似。

通过运行下列命令做LCQMC数据集上的fine-tuning:

1. Clone this project:
      
      git clone https://github.com/brightmart/albert_zh.git
      
2. Fine-tuning by running the following command.
    brightmart版本的tiny模型
    export BERT_BASE_DIR=./albert_tiny_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier.py   --task_name=lcqmc_pair   --do_train=true   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --bert_config_file=./albert_config/albert_config_tiny.json --max_seq_length=128 --train_batch_size=64   --learning_rate=1e-4  --num_train_epochs=5 \
    --output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &
    
    google版本的small模型
    export BERT_BASE_DIR=./albert_small_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier_sp_google.py --task_name=lcqmc_pair   --do_train=true   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --albert_config_file=./$BERT_BASE_DIR/albert_config_small_google.json --max_seq_length=128 --train_batch_size=64   --learning_rate=1e-4   --num_train_epochs=5 \
    --output_dir=./albert_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt &

Notice/注：
    1) you need to download pre-trained chinese albert model, and also download LCQMC dataset 
    你需要下载预训练的模型，并放入到项目当前项目，假设目录名称为albert_tiny_zh; 需要下载LCQMC数据集，并放入到当前项目，
    假设数据集目录名称为lcqmc

    2) for Fine-tuning, you can try to add small percentage of dropout(e.g. 0.1) by changing parameters of 
      attention_probs_dropout_prob & hidden_dropout_prob on albert_config_xxx.json. By default, we set dropout as zero. 
    
    3) you can try different learning rate {2e-5, 6e-5, 1e-4} for better performance