閱讀(4.2k) 書簽贊(0) 我要糾錯(cuò)

PyTorch 使用 TorchText 進(jìn)行語(yǔ)言翻譯

2025-06-18 17:00 更新

自然語(yǔ)言處理（NLP）技術(shù)日新月異，語(yǔ)言翻譯作為其中的關(guān)鍵應(yīng)用之一，正改變著人們跨語(yǔ)言交流的方式。PyTorch 作為深度學(xué)習(xí)領(lǐng)域備受青睞的框架，攜手 TorchText 庫(kù)，為構(gòu)建語(yǔ)言翻譯模型提供了強(qiáng)大而便捷的工具。即使您是編程新手，也能通過本文循序漸進(jìn)地掌握利用 PyTorch 和 TorchText 實(shí)現(xiàn)語(yǔ)言翻譯的全過程，開啟您的 NLP 之旅。

一、環(huán)境搭建：開啟翻譯之旅的鑰匙

在開始構(gòu)建語(yǔ)言翻譯模型之前，確保您的開發(fā)環(huán)境已正確配置相關(guān)的依賴庫(kù)，這是保證后續(xù)代碼順利運(yùn)行的基礎(chǔ)。以下是搭建環(huán)境的步驟：

安裝 PyTorch：訪問官方網(wǎng)址，根據(jù)您的系統(tǒng)配置（如操作系統(tǒng)、CUDA 版本等）獲取適合的安裝命令。例如，對(duì)于 Linux 系統(tǒng)且 CUDA 版本為 11.8 的用戶，可以使用以下命令安裝 PyTorch：
- 命令：conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

安裝 TorchText：TorchText 是專門用于處理文本數(shù)據(jù)的 PyTorch 擴(kuò)展庫(kù)，它提供了豐富的數(shù)據(jù)處理工具和預(yù)定義的數(shù)據(jù)集，極大地方便了文本相關(guān)任務(wù)的開展。安裝命令如下：
- 命令：pip install torchtext

安裝 Spacy：為了實(shí)現(xiàn)高效的文本分詞操作，本文推薦使用 Spacy 庫(kù)，它支持多種語(yǔ)言的分詞，并且與 TorchText 配合默契。
- 命令：pip install spacy
- 接下來(lái)，您需要下載對(duì)應(yīng)語(yǔ)言的分詞模型，例如英語(yǔ)和德語(yǔ)模型：
  - 下載英語(yǔ)模型：python -m spacy download en
  - 下載德語(yǔ)模型：python -m spacy download de

二、數(shù)據(jù)準(zhǔn)備：翻譯模型的基石

數(shù)據(jù)是構(gòu)建模型的基石，對(duì)于語(yǔ)言翻譯任務(wù)，我們選擇 Multi30k 數(shù)據(jù)集，該數(shù)據(jù)集包含了約 30,000 個(gè)英語(yǔ)和德語(yǔ)句子對(duì)，句子平均長(zhǎng)度約為 13 個(gè)單詞，非常適合作為訓(xùn)練和評(píng)估語(yǔ)言翻譯模型的數(shù)據(jù)基礎(chǔ)。

（一）數(shù)據(jù)集加載

TorchText 提供了便捷的數(shù)據(jù)集加載方式，能夠輕松加載 Multi30k 數(shù)據(jù)集并進(jìn)行初步處理。以下是加載數(shù)據(jù)集的代碼示例：

from torchtext.datasets import Multi30k


## 加載 Multi30k 數(shù)據(jù)集
train_data, valid_data, test_data = Multi30k(language_pair=("de", "en"))

（二）字段定義

為了對(duì)數(shù)據(jù)進(jìn)行進(jìn)一步處理，如分詞、添加起始和結(jié)束標(biāo)記等，我們需要定義字段（Field）。字段定義了如何對(duì)文本數(shù)據(jù)進(jìn)行預(yù)處理，以下是對(duì)源語(yǔ)言（德語(yǔ)）和目標(biāo)語(yǔ)言（英語(yǔ)）分別定義字段的代碼：

from torchtext.data import Field


## 定義德語(yǔ)字段（源語(yǔ)言）
SRC = Field(tokenize="spacy", tokenizer_language="de", init_token="<sos>", eos_token="<eos>", lower=True)


## 定義英語(yǔ)字段（目標(biāo)語(yǔ)言）
TRG = Field(tokenize="spacy", tokenizer_language="en", init_token="<sos>", eos_token="<eos>", lower=True)

（三）詞匯表構(gòu)建

在對(duì)數(shù)據(jù)進(jìn)行編碼和解碼之前，我們需要構(gòu)建詞匯表，將文本單詞映射到數(shù)值索引。這一步驟對(duì)于將文本數(shù)據(jù)轉(zhuǎn)換為模型可處理的數(shù)值形式至關(guān)重要。以下是構(gòu)建詞匯表的代碼：

## 構(gòu)建德語(yǔ)詞匯表
SRC.build_vocab(train_data, min_freq=2)


## 構(gòu)建英語(yǔ)詞匯表
TRG.build_vocab(train_data, min_freq=2)

三、數(shù)據(jù)迭代器：高效的數(shù)據(jù)喂入

為了高效地將數(shù)據(jù)喂入模型進(jìn)行訓(xùn)練和評(píng)估，我們需要使用數(shù)據(jù)迭代器。TorchText 提供了 BucketIterator，它能夠根據(jù)序列長(zhǎng)度將相似長(zhǎng)度的樣本劃分到一個(gè)批次中，從而減少填充操作，提高訓(xùn)練效率。以下是創(chuàng)建數(shù)據(jù)迭代器的代碼示例：

from torchtext.data import BucketIterator


## 設(shè)置設(shè)備（GPU 或 CPU）
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


## 定義批次大小
BATCH_SIZE = 128


## 創(chuàng)建訓(xùn)練、驗(yàn)證和測(cè)試數(shù)據(jù)迭代器
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    device=device
)

四、模型構(gòu)建：翻譯的核心引擎

（一）編碼器（Encoder）

編碼器的作用是將源語(yǔ)言序列編碼為一個(gè)固定長(zhǎng)度的上下文向量，這個(gè)向量包含了源語(yǔ)言序列的語(yǔ)義信息。以下是編碼器的代碼實(shí)現(xiàn)：

import torch.nn as nn


class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)


    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.rnn(embedded)
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)))
        return outputs, hidden

（二）注意力機(jī)制（Attention）

注意力機(jī)制允許模型在解碼過程中動(dòng)態(tài)地關(guān)注源語(yǔ)言序列中不同的位置，從而更好地捕捉源語(yǔ)言和目標(biāo)語(yǔ)言之間的對(duì)應(yīng)關(guān)系。以下是注意力機(jī)制的代碼：

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim, attn_dim):
        super().__init__()
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.attn_in = (enc_hid_dim * 2) + dec_hid_dim
        self.attn = nn.Linear(self.attn_in, attn_dim)


    def forward(self, decoder_hidden, encoder_outputs):
        src_len = encoder_outputs.shape[0]
        repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        energy = torch.tanh(self.attn(torch.cat((repeated_decoder_hidden, encoder_outputs), dim=2)))
        attention = torch.sum(energy, dim=2)
        return torch.softmax(attention, dim=1)

（三）解碼器（Decoder）

解碼器根據(jù)編碼器生成的上下文向量和注意力機(jī)制提供的信息，逐步生成目標(biāo)語(yǔ)言序列。以下是解碼器的代碼：

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.out = nn.Linear(self.attention.attn_in + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)


    def _weighted_encoder_rep(self, decoder_hidden, encoder_outputs):
        a = self.attention(decoder_hidden, encoder_outputs)
        a = a.unsqueeze(1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        weighted_encoder_rep = torch.bmm(a, encoder_outputs)
        weighted_encoder_rep = weighted_encoder_rep.permute(1, 0, 2)
        return weighted_encoder_rep


    def forward(self, input, decoder_hidden, encoder_outputs):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        weighted_encoder_rep = self._weighted_encoder_rep(decoder_hidden, encoder_outputs)
        rnn_input = torch.cat((embedded, weighted_encoder_rep), dim=2)
        output, decoder_hidden = self.rnn(rnn_input, decoder_hidden.unsqueeze(0))
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted_encoder_rep = weighted_encoder_rep.squeeze(0)
        output = self.out(torch.cat((output, weighted_encoder_rep, embedded), dim=1))
        return output, decoder_hidden.squeeze(0)

（四）序列到序列模型（Seq2Seq）

序列到序列模型將編碼器和解碼器組合在一起，形成完整的翻譯模型架構(gòu)。以下是序列到序列模型的代碼：

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device


    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        encoder_outputs, hidden = self.encoder(src)
        output = trg[0, :]
        for t in range(1, max_len):
            output, hidden = self.decoder(output, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)
        return outputs

五、模型訓(xùn)練與評(píng)估：打磨翻譯模型的利器

（一）超參數(shù)設(shè)置

在訓(xùn)練模型之前，我們需要設(shè)置一些超參數(shù)，這些超參數(shù)將影響模型的訓(xùn)練過程和最終性能。以下是超參數(shù)設(shè)置的代碼：

## 設(shè)置超參數(shù)
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 32
DEC_EMB_DIM = 32
ENC_HID_DIM = 64
DEC_HID_DIM = 64
ATTN_DIM = 8
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

（二）模型初始化與優(yōu)化器定義

接下來(lái)，我們根據(jù)超參數(shù)初始化編碼器、注意力機(jī)制、解碼器和序列到序列模型，并定義優(yōu)化器來(lái)更新模型參數(shù)。以下是相關(guān)代碼：

## 初始化編碼器、注意力機(jī)制、解碼器和序列到序列模型
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
attn = Attention(ENC_HID_DIM, DEC_HID_DIM, ATTN_DIM)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)
model = Seq2Seq(enc, dec, device).to(device)


## 定義優(yōu)化器
optimizer = torch.optim.Adam(model.parameters())

（三）損失函數(shù)定義

為了評(píng)估模型的性能并指導(dǎo)模型訓(xùn)練，我們需要定義損失函數(shù)。在語(yǔ)言翻譯任務(wù)中，通常使用交叉熵?fù)p失函數(shù)，并忽略填充部分的損失計(jì)算。以下是損失函數(shù)定義的代碼：

## 定義損失函數(shù)
PAD_IDX = TRG.vocab.stoi["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

（四）訓(xùn)練過程

訓(xùn)練過程是模型學(xué)習(xí)數(shù)據(jù)模式并不斷提升翻譯性能的關(guān)鍵階段。在每個(gè)訓(xùn)練周期（epoch）中，模型會(huì)處理整個(gè)訓(xùn)練數(shù)據(jù)集，并根據(jù)計(jì)算得到的損失更新模型參數(shù)。以下是訓(xùn)練函數(shù)的代碼：

import math
import time


def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    for _, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        optimizer.zero_grad()
        output = model(src, trg)
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

（五）評(píng)估過程

評(píng)估過程用于在驗(yàn)證集或測(cè)試集上評(píng)估模型的性能，以了解模型在未見過的數(shù)據(jù)上的表現(xiàn)。以下是評(píng)估函數(shù)的代碼：

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for _, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg
            output = model(src, trg, 0)
            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

（六）訓(xùn)練與評(píng)估模型

現(xiàn)在，我們已經(jīng)定義好了訓(xùn)練和評(píng)估函數(shù)，接下來(lái)就可以開始訓(xùn)練和評(píng)估模型了。以下是訓(xùn)練和評(píng)估模型的代碼：

import math
import time


N_EPOCHS = 10
CLIP = 1


best_valid_loss = float("inf")


for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    print(f"Epoch: {epoch + 1:02} | Time: {epoch_mins}m {epoch_secs}s")
    print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
    print(f"\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}")


test_loss = evaluate(model, test_iterator, criterion)
print(f"| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |")

六、總結(jié)與展望

通過本文，您已經(jīng)學(xué)習(xí)了如何利用 PyTorch 和 TorchText 構(gòu)建一個(gè)基本的語(yǔ)言翻譯模型，從環(huán)境搭建、數(shù)據(jù)準(zhǔn)備、模型構(gòu)建到訓(xùn)練與評(píng)估，每一步都至關(guān)重要。在實(shí)際應(yīng)用中，您可以根據(jù)需求進(jìn)一步優(yōu)化模型，如調(diào)整超參數(shù)、使用更復(fù)雜的模型架構(gòu)（如 Transformer）等，以提升翻譯性能。未來(lái)，隨著深度學(xué)習(xí)技術(shù)的不斷發(fā)展，語(yǔ)言翻譯模型將更加智能和精準(zhǔn)，為人們的跨語(yǔ)言交流提供更便捷的支持。編程獅將持續(xù)為您提供更多優(yōu)質(zhì)的技術(shù)教程和資源，助力您的編程學(xué)習(xí)之旅。

以上內(nèi)容是否對(duì)您有幫助：

← PyTorch 使用 TorchText 進(jìn)行文本分類

PyTorch 使用 nn.Transformer 和 TorchText 進(jìn)行序列到序列建模 →

寫筆記

我要補(bǔ)充

查看完整版筆記

99re热视频这里只精品,久久久天堂国产精品女人,国产av一区二区三区,久久久精品成人免费看片,99久久精品免费看国产一区二区三区