PyTorch-Transformers
模型描述
PyTorch-Transformers(前身為 pytorch-pretrained-bert)是一個用於自然語言處理 (NLP) 的先進預訓練模型庫。
該庫目前包含以下模型的 PyTorch 實現、預訓練模型權重、使用指令碼和轉換工具:
- BERT(來自 Google),釋出於 Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova 的論文 BERT:用於語言理解的深度雙向 Transformer 預訓練。
- GPT(來自 OpenAI),釋出於 Alec Radford、Karthik Narasimhan、Tim Salimans 和 Ilya Sutskever 的論文 透過生成式預訓練改進語言理解。
- GPT-2(來自 OpenAI),釋出於 Alec Radford*、Jeffrey Wu*、Rewon Child、David Luan、Dario Amodei** 和 Ilya Sutskever** 的論文 語言模型是無監督多工學習器。
- Transformer-XL(來自 Google/CMU),釋出於 Zihang Dai*、Zhilin Yang*、Yiming Yang、Jaime Carbonell、Quoc V. Le、Ruslan Salakhutdinov 的論文 Transformer-XL:超越固定長度上下文的注意力語言模型。
- XLNet(來自 Google/CMU),釋出於 Zhilin Yang*、Zihang Dai*、Yiming Yang、Jaime Carbonell、Ruslan Salakhutdinov、Quoc V. Le 的論文 XLNet:用於語言理解的廣義自迴歸預訓練。
- XLM(來自 Facebook),與 Guillaume Lample 和 Alexis Conneau 的論文 跨語言語言模型預訓練 一同釋出。
- RoBERTa(來自 Facebook),與 Yinhan Liu、Myle Ott、Naman Goyal、Jingfei Du、Mandar Joshi、Danqi Chen、Omer Levy、Mike Lewis、Luke Zettlemoyer、Veselin Stoyanov 的論文 一種穩健最佳化的 BERT 預訓練方法 一同釋出。
- DistilBERT(來自 HuggingFace),與 Victor Sanh、Lysandre Debut 和 Thomas Wolf 的部落格文章 更小、更快、更便宜、更輕:介紹 DistilBERT,一個精簡版的 BERT 一同釋出。
這裡提供的元件基於 pytorch-transformers 庫的 AutoModel 和 AutoTokenizer 類。
要求
與大多數其他 PyTorch Hub 模型不同,BERT 需要安裝一些額外的 Python 包。
pip install tqdm boto3 requests regex sentencepiece sacremoses
用法
可用的方法如下:
config:返回與指定模型或路徑對應的配置項。tokenizer:返回與指定模型或路徑對應的分詞器。model:返回與指定模型或路徑對應的模型。modelForCausalLM:返回帶有語言建模頭的模型,對應於指定的模型或路徑。modelForSequenceClassification:返回帶有序列分類器的模型,對應於指定的模型或路徑。modelForQuestionAnswering:返回帶有問答頭的模型,對應於指定的模型或路徑。
所有這些方法都共享以下引數:pretrained_model_or_path,它是一個字串,用於標識將返回例項的預訓練模型或路徑。每個模型都有幾個可用的檢查點,詳情如下:
可用的模型列在 transformers 文件的模型頁面。
文件
以下是一些詳細說明每種可用方法用法的示例。
分詞器
分詞器物件允許將字串轉換為不同模型能理解的標記。每個模型都有自己的分詞器,並且某些分詞方法在不同的分詞器之間有所不同。完整的文件可以在 此處 找到。
import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased') # Download vocabulary from S3 and cache.
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', './test/bert_saved_model/') # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`
模型
模型物件是繼承自 nn.Module 的模型例項。每個模型都附帶其儲存/載入方法,可以從本地檔案或目錄載入,也可以從預訓練配置載入(參見之前描述的 config)。每個模型的工作方式不同,不同模型的完整概述可以在 文件 中找到。
import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased') # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', output_attentions=True) # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
帶有語言建模頭的模型
先前提及的 model 例項,帶有一個額外的語言建模頭。
import torch
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2') # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './test/saved_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2', output_attentions=True) # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_pretrained('./tf_model/gpt_tf_model_config.json')
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './tf_model/gpt_tf_checkpoint.ckpt.index', from_tf=True, config=config)
帶有序列分類頭的模型
先前提及的 model 例項,帶有一個額外的序列分類頭。
import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased') # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased', output_attention=True) # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
帶有問答頭的模型
先前提及的 model 例項,帶有一個額外的問答頭。
import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased') # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased', output_attention=True) # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)
配置
配置是可選的。配置物件包含有關模型的資訊,例如頭/層的數量,模型是否應輸出注意力或隱藏狀態,或者是否應針對 TorchScript 進行調整。許多引數可用,其中一些特定於每個模型。完整的文件可以在 此處 找到。
import torch
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased') # Download configuration from S3 and cache.
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/') # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/my_configuration.json')
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False)
assert config.output_attention == True
config, unused_kwargs = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False, return_unused_kwargs=True)
assert config.output_attention == True
assert unused_kwargs == {'foo': False}
# Using the configuration with a model
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased')
config.output_attentions = True
config.output_hidden_states = True
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', config=config)
# Model will now output attentions and hidden states as well
使用示例
這是一個示例,說明如何對輸入文字進行分詞,作為 BERT 模型的輸入,然後獲取該模型計算的隱藏狀態,或者使用語言建模 BERT 模型預測被掩蓋的標記。
首先,對輸入進行分詞
import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)
使用 BertModel 將輸入句子編碼為最後一層隱藏狀態序列
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, token_type_ids=segments_tensors)
使用 modelForMaskedLM 預測 BERT 中的被掩蓋標記
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
indexed_tokens[masked_index] = tokenizer.mask_token_id
tokens_tensor = torch.tensor([indexed_tokens])
masked_lm_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForMaskedLM', 'bert-base-cased')
with torch.no_grad():
predictions = masked_lm_model(tokens_tensor, token_type_ids=segments_tensors)
# Get the predicted token
predicted_index = torch.argmax(predictions[0][0], dim=1)[masked_index].item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'Jim'
使用 modelForQuestionAnswering 進行 BERT 問答
question_answering_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-large-uncased-whole-word-masking-finetuned-squad')
question_answering_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-large-uncased-whole-word-masking-finetuned-squad')
# The format is paragraph first and then question
text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])
# Predict the start and end positions logits
with torch.no_grad():
out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)
# get the highest prediction
answer = question_answering_tokenizer.decode(indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1])
assert answer == "puppeteer"
# Or get the total loss which is the sum of the CrossEntropy loss for the start and end token positions (set model to train mode before if used for training)
start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
multiple_choice_loss = question_answering_model(tokens_tensor, token_type_ids=segments_tensors, start_positions=start_positions, end_positions=end_positions)
使用 modelForSequenceClassification 進行 BERT 複述分類
sequence_classification_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-cased-finetuned-mrpc')
sequence_classification_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased-finetuned-mrpc')
text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = sequence_classification_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])
# Predict the sequence classification logits
with torch.no_grad():
seq_classif_logits = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors)
predicted_labels = torch.argmax(seq_classif_logits[0]).item()
assert predicted_labels == 0 # In MRPC dataset this means the two sentences are not paraphrasing each other
# Or get the sequence classification loss (set model to train mode before if used for training)
labels = torch.tensor([1])
seq_classif_loss = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors, labels=label