ACTA Scientiarum Naturalium Universitatis Pekinensis
Research on the Construction and Application of Paraphrase Parallel Corpus
WANG Yasong, LIU Mingtong, ZHANG Yujie†, XU Jin’an, CHEN Yufeng
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044; † Corresponding author, E-mail: yjzhang@bjtu.edu.cn
Abstract Taking Chinese as the research object, the authors put forward the method to construct large-scale and high-quality paraphrase parallel corpora. The paraphrase data augmentation method include transfering English paraphrase corpus to Chinese, by using the method of translation engines, and manually annotating evaluation data set. Based on the constructed Chinese paraphrase data, the validity of the paraphrase data construction application method is verified in the paraphrase recognition task and natural language inference task. Firstly, the paraphrase recognition data is generated based on the constructed paraphrase corpus, and the attention-based neural network model of sentence matching is pre-trained to capture the paraphrase information. Then, the pre-trained model is applied to the natural language inference task to improve the performance. The experimental results on the open set show that the constructed paraphrase corpus can be effectively applied to the paraphrase recognition task, and the model can learn paraphrase knowledge. When applied to natural language inference task, paraphrase knowledge can effectively improve the accuracy of natural language inference models and verify the effectiveness of paraphrase knowledge for downstream semantic understanding tasks. Meanwhile, the proposed construction method for the paraphrase corpus is language-independent, which can provide more training data for other languages and fields, generate high-quality paraphrase data, and further improve the performance of other tasks. Key words paraphrase corpus construction; data augmentation; transfer learning; paraphrase recognition; natural language inference