PromptTensorDictTokenizer¶

class torchrl.data.PromptTensorDictTokenizer(tokenizer, max_length, key='prompt', padding='max_length', truncation=True, return_tensordict=True, device=None)[source]¶

提示資料集的 Tokenization 方法。

返回一個 tokenizer 函式，該函式讀取包含提示和標籤的示例並進行 tokenization。

引數:

tokenizer (來自 transformers 庫的 tokenizer) – 要使用的 tokenizer。
max_length (int) – 序列的最大長度。
key (str, optional) – 查詢文字的鍵。預設為 "prompt"。
padding (str, optional) – 填充型別。預設為 "max_length"。
truncation (bool, optional) – 序列是否應截斷到 max_length。
return_tensordict (bool, optional) – 如果為 True，則返回一個 TensoDict。否則，將返回原始資料。
device (torch.device, optional) – 儲存資料的裝置。如果 return_tensordict=False，則忽略此選項。

此類的 __call__() 方法將執行以下操作

讀取與 label 字串連線的 prompt 字串並進行 tokenization。結果將儲存在 "input_ids" TensorDict 條目中。

寫入一個 "prompt_rindex" 條目，其中包含提示中最後一個有效 token 的索引。

寫入一個 "valid_sample" 條目，用於標識 tensordict 中的哪個條目有足夠的 token 滿足 max_length 條件。

返回一個包含 tokenized 輸入的 tensordict.TensorDict 例項。

tensordict 的 batch-size 將與輸入的 batch-size 匹配。

示例

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> tokenizer.pad_token = tokenizer.eos_token
>>> example = {
...     "prompt": ["This prompt is long enough to be tokenized.", "this one too!"],
...     "label": ["Indeed it is.", 'It might as well be.'],
... }
>>> fn = PromptTensorDictTokenizer(tokenizer, 50)
>>> print(fn(example))
TensorDict(
    fields={
        attention_mask: Tensor(shape=torch.Size([2, 50]), device=cpu, dtype=torch.int64, is_shared=False),
        input_ids: Tensor(shape=torch.Size([2, 50]), device=cpu, dtype=torch.int64, is_shared=False),
        prompt_rindex: Tensor(shape=torch.Size([2]), device=cpu, dtype=torch.int64, is_shared=False),
        valid_sample: Tensor(shape=torch.Size([2]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([2]),
    device=None,
    is_shared=False)

PromptTensorDictTokenizer¶

文件

教程

資源