Cargando…
An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
Transformer-based models have gained significant advances in neural machine translation (NMT). The main component of the transformer is the multihead attention layer. In theory, more heads enhance the expressive power of the NMT model. But this is not always the case in practice. On the one hand, th...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Hindawi
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9239798/ https://www.ncbi.nlm.nih.gov/pubmed/35774445 http://dx.doi.org/10.1155/2022/2998242 |
_version_ | 1784737384467791872 |
---|---|
author | Li, Dongxing Luo, Zuying |
author_facet | Li, Dongxing Luo, Zuying |
author_sort | Li, Dongxing |
collection | PubMed |
description | Transformer-based models have gained significant advances in neural machine translation (NMT). The main component of the transformer is the multihead attention layer. In theory, more heads enhance the expressive power of the NMT model. But this is not always the case in practice. On the one hand, the computations of each head attention are conducted in the same subspace, without considering the different subspaces of all the tokens. On the other hand, the low-rank bottleneck may occur, when the number of heads surpasses a threshold. To address the low-rank bottleneck, the two mainstream methods make the head size equal to the sequence length and complicate the distribution of self-attention heads. However, these methods are challenged by the variable sequence length in the corpus and the sheer number of parameters to be learned. Therefore, this paper proposes the interacting-head attention mechanism, which induces deeper and wider interactions across the attention heads by low-dimension computations in different subspaces of all the tokens, and chooses the appropriate number of heads to avoid low-rank bottleneck. The proposed model was tested on machine translation tasks of IWSLT2016 DE-EN, WMT17 EN-DE, and WMT17 EN-CS. Compared to the original multihead attention, our model improved the performance by 2.78 BLEU/0.85 WER/2.90 METEOR/2.65 ROUGE_L/0.29 CIDEr/2.97 YiSi and 2.43 BLEU/1.38 WER/3.05 METEOR/2.70 ROUGE_L/0.30 CIDEr/3.59 YiSi on the evaluation set and the test set, respectively, for IWSLT2016 DE-EN, 2.31 BLEU/5.94 WER/1.46 METEOR/1.35 ROUGE_L/0.07 CIDEr/0.33 YiSi and 1.62 BLEU/6.04 WER/1.39 METEOR/0.11 CIDEr/0.87 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-DE, and 3.87 BLEU/3.05 WER/9.22 METEOR/3.81 ROUGE_L/0.36 CIDEr/4.14 YiSi and 4.62 BLEU/2.41 WER/9.82 METEOR/4.82 ROUGE_L/0.44 CIDEr/5.25 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-CS. |
format | Online Article Text |
id | pubmed-9239798 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Hindawi |
record_format | MEDLINE/PubMed |
spelling | pubmed-92397982022-06-29 An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention Li, Dongxing Luo, Zuying Comput Intell Neurosci Research Article Transformer-based models have gained significant advances in neural machine translation (NMT). The main component of the transformer is the multihead attention layer. In theory, more heads enhance the expressive power of the NMT model. But this is not always the case in practice. On the one hand, the computations of each head attention are conducted in the same subspace, without considering the different subspaces of all the tokens. On the other hand, the low-rank bottleneck may occur, when the number of heads surpasses a threshold. To address the low-rank bottleneck, the two mainstream methods make the head size equal to the sequence length and complicate the distribution of self-attention heads. However, these methods are challenged by the variable sequence length in the corpus and the sheer number of parameters to be learned. Therefore, this paper proposes the interacting-head attention mechanism, which induces deeper and wider interactions across the attention heads by low-dimension computations in different subspaces of all the tokens, and chooses the appropriate number of heads to avoid low-rank bottleneck. The proposed model was tested on machine translation tasks of IWSLT2016 DE-EN, WMT17 EN-DE, and WMT17 EN-CS. Compared to the original multihead attention, our model improved the performance by 2.78 BLEU/0.85 WER/2.90 METEOR/2.65 ROUGE_L/0.29 CIDEr/2.97 YiSi and 2.43 BLEU/1.38 WER/3.05 METEOR/2.70 ROUGE_L/0.30 CIDEr/3.59 YiSi on the evaluation set and the test set, respectively, for IWSLT2016 DE-EN, 2.31 BLEU/5.94 WER/1.46 METEOR/1.35 ROUGE_L/0.07 CIDEr/0.33 YiSi and 1.62 BLEU/6.04 WER/1.39 METEOR/0.11 CIDEr/0.87 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-DE, and 3.87 BLEU/3.05 WER/9.22 METEOR/3.81 ROUGE_L/0.36 CIDEr/4.14 YiSi and 4.62 BLEU/2.41 WER/9.82 METEOR/4.82 ROUGE_L/0.44 CIDEr/5.25 YiSi on the evaluation set and newstest2014, respectively, for WMT17 EN-CS. Hindawi 2022-06-21 /pmc/articles/PMC9239798/ /pubmed/35774445 http://dx.doi.org/10.1155/2022/2998242 Text en Copyright © 2022 Dongxing Li and Zuying Luo. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Li, Dongxing Luo, Zuying An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention |
title | An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention |
title_full | An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention |
title_fullStr | An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention |
title_full_unstemmed | An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention |
title_short | An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention |
title_sort | improved transformer-based neural machine translation strategy: interacting-head attention |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9239798/ https://www.ncbi.nlm.nih.gov/pubmed/35774445 http://dx.doi.org/10.1155/2022/2998242 |
work_keys_str_mv | AT lidongxing animprovedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention AT luozuying animprovedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention AT lidongxing improvedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention AT luozuying improvedtransformerbasedneuralmachinetranslationstrategyinteractingheadattention |