ACTA Scientiarum Naturalium Universitatis Pekinensis
Domain Term Extraction Using URL-KEY
LÜ Shuning1, DONG Zhian2,†
1. School of Software Engineering, Beijing University of Technology, Beijing 100124; 2. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101; † Corresponding author, E-mail: dong.zhian@163.com
Abstract A new approach was presented for domain term extraction using URL-KEY. With the help of known URL-KEY’S domain, unknown URL-KEY’S domain can be identified. First, according to the frequency of URL-KEY appearing in various fields, a method based on the variance was proposed to identify the domain URL-KEY and build the dictionary of domain URL-KEY. Then, the pseudo related feedback was used to construct the URL-KEY vector of candidate domain terms. Finally, SVM was applied to extract terms. Experiment was conducted on four different domains for Chinese term extraction. Experimental results indicate that the proposed method is quiet effective. In addition, it can effectively solve the recognition problem of low frequency terms, and provides a new way for the identification of low frequency terms. Key words URL; URL-KEY; domain term; low-frequency term; SVM
随着技术的发展, 互联网领域已经发生巨大的变化, 人们不再局限于通过网络获取数据, 还是互联网数据的创造者。目前已经进入“大数据”时代,数据信息不仅规模大, 而且错综复杂。新的理论、新的方法和新的概念不断涌现, 同时产生大量新的领域术语。人工构建领域术语不仅费时、费力, 而且不易更新, 因此领域术语自动识别已经成为汉语
自然语言处理方面重要的研究课题。
术语识别是基础性的研究工作, 有助于领域词典的更新、领域本体的构建以及句法分析的研究。术语识别研究通常分为候选术语提取和识别两个步骤。候选术语提取可以视为术语边界识别的问题。对于汉语, 字符之间没有明确的切分边界, 识别起来非常困难, Ji 等[1]提出利用特定词汇作为边界取