ACTA Scientiarum Naturalium Universitatis Pekinensis
Integrating Voice Features into Japanese-english Hierarchical Phrase Based Model
WANG Nan, XU Jin’an†, MING Fang, CHEN Yufeng, ZHANG Yujie
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044; † Corresponding author, E-mail: jaxu@bjtu.edu.cn
Abstract The voice of each language usually keeps different syntactic structure. In machine translation, it causes relatively low translation quality. To resolve this problem, an approach is proposed by integrating voice features into hierarchical phrase based (HPB) models. In the proposed method, corpus is firstly classified into three categories from Japanese side: passive voice, potential voice and others. Secondly, passive and potential sentences are classified into several groups according to the characteristics of English to build maximum entropy models for rules. Finally, bilingual voice features are integrated into log linear model for improving translation results and the accuracy of rule selection during the translation of passive and potential sentences. In Japanese to English translation task, large scale experiment shows that the proposed method can not only improve the problem of long distance reordering but also improve translation quality of both passive and potential voice test sets. Key words passive voice; potential voice; statistical machine translation; maximum entropy models
日语通过谓词的词尾形式变化表示相应语态,由于其被动语态和可能语态的部分词尾形式相同,因而在机器翻译过程中难以正确识别及翻译。日语与英语在语言结构上有显著差异, 日语为 SOV (主宾谓)结构, 英语为 SVO (主谓宾)结构, 句法结构的差异会影响日英机器翻译的质量。其中, 语态不同导致的词汇翻译不准确和结构不当的问题尤为突出。如何正确翻译被动语态与可能语态句子是日英
翻译中的重要任务。现有研究大部分从语义及结构上区分日语的可能语态与被动语态[1], 通过制定翻译规则对不同语态进行处理[23], 但基于规则的翻译方法无法直接应用于统计机器翻译系统。统计翻译模型按照概率进行规则选择, 训练语料中可能语态和被动语态的数据稀疏, 统计方法处理远距离调序困难, 难以有效地利用句子全局结构, 这些特征导致翻译精度低