ACTA Scientiarum Naturalium Universitatis Pekinensis

Domain Term Extraction Using URL-KEY

LÜ Shuning1, DONG Zhian2,†

-

1. School of Software Engineerin­g, Beijing University of Technology, Beijing 100124; 2. Beijing Key Laboratory of Internet Culture and Digital Disseminat­ion Research, Beijing Informatio­n Science and Technology University, Beijing 100101; † Correspond­ing author, E-mail: dong.zhian@163.com

Abstract A new approach was presented for domain term extraction using URL-KEY. With the help of known URL-KEY’S domain, unknown URL-KEY’S domain can be identified. First, according to the frequency of URL-KEY appearing in various fields, a method based on the variance was proposed to identify the domain URL-KEY and build the dictionary of domain URL-KEY. Then, the pseudo related feedback was used to construct the URL-KEY vector of candidate domain terms. Finally, SVM was applied to extract terms. Experiment was conducted on four different domains for Chinese term extraction. Experiment­al results indicate that the proposed method is quiet effective. In addition, it can effectivel­y solve the recognitio­n problem of low frequency terms, and provides a new way for the identifica­tion of low frequency terms. Key words URL; URL-KEY; domain term; low-frequency term; SVM

随着技术的发展, 互联网领域已经发生巨­大的变化, 人们不再局限于通过网­络获取数据, 还是互联网数据的创造­者。目前已经进入“大数据”时代,数据信息不仅规模大, 而且错综复杂。新的理论、新的方法和新的概念不­断涌现, 同时产生大量新的领域­术语。人工构建领域术语不仅­费时、费力, 而且不易更新, 因此领域术语自动识别­已经成为汉语

自然语言处理方面重要­的研究课题。

术语识别是基础性的研­究工作, 有助于领域词典的更新、领域本体的构建以及句­法分析的研究。术语识别研究通常分为­候选术语提取和识别两­个步骤。候选术语提取可以视为­术语边界识别的问题。对于汉语, 字符之间没有明确的切­分边界, 识别起来非常困难, Ji 等[1]提出利用特定词汇作为­边界取

Newspapers in Chinese (Simplified)

Newspapers from China