Cargando…

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Transformer-based models have gained significant advances in neural machine translation (NMT). The main component of the transformer is the multihead attention layer. In theory, more heads enhance the expressive power of the NMT model. But this is not always the case in practice. On the one hand, th...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Dongxing, Luo, Zuying
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9239798/
https://www.ncbi.nlm.nih.gov/pubmed/35774445
http://dx.doi.org/10.1155/2022/2998242
_version_ 1784737384467791872
author Li, Dongxing
Luo, Zuying
author_facet Li, Dongxing
Luo, Zuying
author_sort Li, Dongxing
collection PubMed
description Transformer-based models have gained significant advances in neural machine translation (NMT). The main component of the transformer is the multihead attention layer. In theory, more heads enhance the expressive power of the NMT model. But this is not always the case in practice. On the one hand, the computations of each head attention are conducted in the same subspace, without considering the different subspaces of all the tokens. On the other hand, the low-rank bottleneck may occur, when the number of heads surpasses a threshold. To address the low-rank bottleneck, the two mainstream methods make the head size equal to the sequence length and complicate the distribution of self-attention heads. However, these methods are challenged by the variable sequence length in the corpus and the sheer number of parameters to be learned. Therefore, this paper proposes the interacting-head attention mechanism, which induces deeper and wider interactions across the attention heads by low-dimension computations in different subspaces of all the tokens, and chooses the appropriate number of heads to avoid low-rank bottleneck. The proposed model was tested on machine translation tasks of IWSLT2016 DE-EN, WMT17 EN-DE, and WMT17 EN-CS. Compared to the original multihead attention, our model improved the performance by 2.78 BLEU/0.85 WER/2.90 METEOR/2.65 ROUGE_L/0.29 CIDEr/2.97 YiSi and 2.43 BLEU/1.38 WER/3.05 METEOR/2.70 ROUGE_L/0.30 CIDEr/3.59 YiSi on the evaluation set and the test set, respectively, for IWSLT2016 DE-EN, 2.31 BLEU/5.94 WER/1.46 METEOR/1.35 ROUGE_L/0.07 CIDEr/0.33 YiSi and 1.62 BLEU/6.04 WER/1.39 METEOR/0.11 CIDEr/0.87 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-DE, and 3.87 BLEU/3.05 WER/9.22 METEOR/3.81 ROUGE_L/0.36 CIDEr/4.14 YiSi and 4.62 BLEU/2.41 WER/9.82 METEOR/4.82 ROUGE_L/0.44 CIDEr/5.25 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-CS.
format Online
Article
Text
id pubmed-9239798
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-92397982022-06-29 An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention Li, Dongxing Luo, Zuying Comput Intell Neurosci Research Article Transformer-based models have gained significant advances in neural machine translation (NMT). The main component of the transformer is the multihead attention layer. In theory, more heads enhance the expressive power of the NMT model. But this is not always the case in practice. On the one hand, the computations of each head attention are conducted in the same subspace, without considering the different subspaces of all the tokens. On the other hand, the low-rank bottleneck may occur, when the number of heads surpasses a threshold. To address the low-rank bottleneck, the two mainstream methods make the head size equal to the sequence length and complicate the distribution of self-attention heads. However, these methods are challenged by the variable sequence length in the corpus and the sheer number of parameters to be learned. Therefore, this paper proposes the interacting-head attention mechanism, which induces deeper and wider interactions across the attention heads by low-dimension computations in different subspaces of all the tokens, and chooses the appropriate number of heads to avoid low-rank bottleneck. The proposed model was tested on machine translation tasks of IWSLT2016 DE-EN, WMT17 EN-DE, and WMT17 EN-CS. Compared to the original multihead attention, our model improved the performance by 2.78 BLEU/0.85 WER/2.90 METEOR/2.65 ROUGE_L/0.29 CIDEr/2.97 YiSi and 2.43 BLEU/1.38 WER/3.05 METEOR/2.70 ROUGE_L/0.30 CIDEr/3.59 YiSi on the evaluation set and the test set, respectively, for IWSLT2016 DE-EN, 2.31 BLEU/5.94 WER/1.46 METEOR/1.35 ROUGE_L/0.07 CIDEr/0.33 YiSi and 1.62 BLEU/6.04 WER/1.39 METEOR/0.11 CIDEr/0.87 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-DE, and 3.87 BLEU/3.05 WER/9.22 METEOR/3.81 ROUGE_L/0.36 CIDEr/4.14 YiSi and 4.62 BLEU/2.41 WER/9.82 METEOR/4.82 ROUGE_L/0.44 CIDEr/5.25 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-CS. Hindawi 2022-06-21 /pmc/articles/PMC9239798/ /pubmed/35774445 http://dx.doi.org/10.1155/2022/2998242 Text en Copyright © 2022 Dongxing Li and Zuying Luo. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Li, Dongxing
Luo, Zuying
An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
title An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
title_full An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
title_fullStr An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
title_full_unstemmed An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
title_short An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
title_sort improved transformer-based neural machine translation strategy: interacting-head attention
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9239798/
https://www.ncbi.nlm.nih.gov/pubmed/35774445
http://dx.doi.org/10.1155/2022/2998242
work_keys_str_mv AT lidongxing animprovedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention
AT luozuying animprovedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention
AT lidongxing improvedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention
AT luozuying improvedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention