PyTorch-Transformers

模型描述

PyTorch-Transformers(前身為 pytorch-pretrained-bert)是一個用於自然語言處理 (NLP) 的先進預訓練模型庫。

該庫目前包含以下模型的 PyTorch 實現、預訓練模型權重、使用指令碼和轉換工具:

  1. BERT(來自 Google),釋出於 Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova 的論文 BERT:用於語言理解的深度雙向 Transformer 預訓練
  2. GPT(來自 OpenAI),釋出於 Alec Radford、Karthik Narasimhan、Tim Salimans 和 Ilya Sutskever 的論文 透過生成式預訓練改進語言理解
  3. GPT-2(來自 OpenAI),釋出於 Alec Radford*、Jeffrey Wu*、Rewon Child、David Luan、Dario Amodei** 和 Ilya Sutskever** 的論文 語言模型是無監督多工學習器
  4. Transformer-XL(來自 Google/CMU),釋出於 Zihang Dai*、Zhilin Yang*、Yiming Yang、Jaime Carbonell、Quoc V. Le、Ruslan Salakhutdinov 的論文 Transformer-XL:超越固定長度上下文的注意力語言模型
  5. XLNet(來自 Google/CMU),釋出於 Zhilin Yang*、Zihang Dai*、Yiming Yang、Jaime Carbonell、Ruslan Salakhutdinov、Quoc V. Le 的論文 XLNet:用於語言理解的廣義自迴歸預訓練
  6. XLM(來自 Facebook),與 Guillaume Lample 和 Alexis Conneau 的論文 跨語言語言模型預訓練 一同釋出。
  7. RoBERTa(來自 Facebook),與 Yinhan Liu、Myle Ott、Naman Goyal、Jingfei Du、Mandar Joshi、Danqi Chen、Omer Levy、Mike Lewis、Luke Zettlemoyer、Veselin Stoyanov 的論文 一種穩健最佳化的 BERT 預訓練方法 一同釋出。
  8. DistilBERT(來自 HuggingFace),與 Victor Sanh、Lysandre Debut 和 Thomas Wolf 的部落格文章 更小、更快、更便宜、更輕:介紹 DistilBERT,一個精簡版的 BERT 一同釋出。

這裡提供的元件基於 pytorch-transformers 庫的 AutoModelAutoTokenizer 類。

要求

與大多數其他 PyTorch Hub 模型不同,BERT 需要安裝一些額外的 Python 包。

pip install tqdm boto3 requests regex sentencepiece sacremoses

用法

可用的方法如下:

  • config:返回與指定模型或路徑對應的配置項。
  • tokenizer:返回與指定模型或路徑對應的分詞器。
  • model:返回與指定模型或路徑對應的模型。
  • modelForCausalLM:返回帶有語言建模頭的模型,對應於指定的模型或路徑。
  • modelForSequenceClassification:返回帶有序列分類器的模型,對應於指定的模型或路徑。
  • modelForQuestionAnswering:返回帶有問答頭的模型,對應於指定的模型或路徑。

所有這些方法都共享以下引數:pretrained_model_or_path,它是一個字串,用於標識將返回例項的預訓練模型或路徑。每個模型都有幾個可用的檢查點,詳情如下:

可用的模型列在 transformers 文件的模型頁面

文件

以下是一些詳細說明每種可用方法用法的示例。

分詞器

分詞器物件允許將字串轉換為不同模型能理解的標記。每個模型都有自己的分詞器,並且某些分詞方法在不同的分詞器之間有所不同。完整的文件可以在 此處 找到。

import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')    # Download vocabulary from S3 and cache.
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', './test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`

模型

模型物件是繼承自 nn.Module 的模型例項。每個模型都附帶其儲存/載入方法,可以從本地檔案或目錄載入,也可以從預訓練配置載入(參見之前描述的 config)。每個模型的工作方式不同,不同模型的完整概述可以在 文件 中找到。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

帶有語言建模頭的模型

先前提及的 model 例項,帶有一個額外的語言建模頭。

import torch
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2')    # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_pretrained('./tf_model/gpt_tf_model_config.json')
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './tf_model/gpt_tf_checkpoint.ckpt.index', from_tf=True, config=config)

帶有序列分類頭的模型

先前提及的 model 例項,帶有一個額外的序列分類頭。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

帶有問答頭的模型

先前提及的 model 例項,帶有一個額外的問答頭。

import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

配置

配置是可選的。配置物件包含有關模型的資訊,例如頭/層的數量,模型是否應輸出注意力或隱藏狀態,或者是否應針對 TorchScript 進行調整。許多引數可用,其中一些特定於每個模型。完整的文件可以在 此處 找到。

import torch
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased')  # Download configuration from S3 and cache.
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
config = torch.hub.load('huggingface/pytorch-transformers', 'config', './test/bert_saved_model/my_configuration.json')
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False)
assert config.output_attention == True
config, unused_kwargs = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased', output_attention=True, foo=False, return_unused_kwargs=True)
assert config.output_attention == True
assert unused_kwargs == {'foo': False}

# Using the configuration with a model
config = torch.hub.load('huggingface/pytorch-transformers', 'config', 'bert-base-uncased')
config.output_attentions = True
config.output_hidden_states = True
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', config=config)
# Model will now output attentions and hidden states as well

使用示例

這是一個示例,說明如何對輸入文字進行分詞,作為 BERT 模型的輸入,然後獲取該模型計算的隱藏狀態,或者使用語言建模 BERT 模型預測被掩蓋的標記。

首先,對輸入進行分詞

import torch
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')

text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"

# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)

使用 BertModel 將輸入句子編碼為最後一層隱藏狀態序列

# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')

with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, token_type_ids=segments_tensors)

使用 modelForMaskedLM 預測 BERT 中的被掩蓋標記

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
indexed_tokens[masked_index] = tokenizer.mask_token_id
tokens_tensor = torch.tensor([indexed_tokens])

masked_lm_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForMaskedLM', 'bert-base-cased')

with torch.no_grad():
    predictions = masked_lm_model(tokens_tensor, token_type_ids=segments_tensors)

# Get the predicted token
predicted_index = torch.argmax(predictions[0][0], dim=1)[masked_index].item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'Jim'

使用 modelForQuestionAnswering 進行 BERT 問答

question_answering_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-large-uncased-whole-word-masking-finetuned-squad')
question_answering_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-large-uncased-whole-word-masking-finetuned-squad')

# The format is paragraph first and then question
text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the start and end positions logits
with torch.no_grad():
    out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)

# get the highest prediction
answer = question_answering_tokenizer.decode(indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1])
assert answer == "puppeteer"

# Or get the total loss which is the sum of the CrossEntropy loss for the start and end token positions (set model to train mode before if used for training)
start_positions, end_positions = torch.tensor([12]), torch.tensor([14])
multiple_choice_loss = question_answering_model(tokens_tensor, token_type_ids=segments_tensors, start_positions=start_positions, end_positions=end_positions)

使用 modelForSequenceClassification 進行 BERT 複述分類

sequence_classification_model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-cased-finetuned-mrpc')
sequence_classification_tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased-finetuned-mrpc')

text_1 = "Jim Henson was a puppeteer"
text_2 = "Who was Jim Henson ?"
indexed_tokens = sequence_classification_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])

# Predict the sequence classification logits
with torch.no_grad():
    seq_classif_logits = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors)

predicted_labels = torch.argmax(seq_classif_logits[0]).item()

assert predicted_labels == 0  # In MRPC dataset this means the two sentences are not paraphrasing each other

# Or get the sequence classification loss (set model to train mode before if used for training)
labels = torch.tensor([1])
seq_classif_loss = sequence_classification_model(tokens_tensor, token_type_ids=segments_tensors, labels=label

流行 NLP Transformers 的 PyTorch 實現

模型型別: Nlp
提交者: HuggingFace 團隊