ACTA Scientiarum Naturalium Universitatis Pekinensis

Regular Path Queries on Large Graph Data

ZHANG Yu, ZENG Li, ZOU Lei†

2018-03-20 -

School of Electronics and Computer Science, Peking University, Beijing 100871; † Corresponding author, E-mail: zoulei@pku.edu.cn

Abstract The authors propose a divide-and-conquer based solution over gstore, an existing RDF search engine, to process property path query on large scale graph data. In proposed solution, regular expression is partitioned within the path query and then preprocess strings of fixed length. The authors handle the search over those subqueries of wildcards. The proposed method is able to filter lots of unpromising search and efficient on solving the regular path match problem over large scale graph data. The corresponding experiments on Dbpedia and LUBM confirm that proposed method can response for queries in seconds on average. Key words property path; regular expression; SPARQL; RDF; gstore

随着语义网的发展, 网络上涌现大量的资源描述框架[1](resource description framework, RDF)数据集, 如 YAGO[2], Freebase[3]和 Dbpedia[4]等, 这些数据的 RDF 三元组数量通常都在十亿甚至百亿以上, 海量的数据使 RDF 数据查询面临新的挑战。SPARQL[5] (SPARQL Protocol and RDF Query Language)查询是W3C 推出的一种 RDF 数据查询语言,语法与 SQL 语句的语法相似。SPARQL 1.1 标准定义了属性路径(property paths)的概念和基本查询单元, 使用简洁的方式为属性路径的查询提供支持。但是, 对于一些复杂的语义, 在使用嵌套式的正则表达式时, 就需要一些特殊的查询方法来解决,因此正则路径查询(regular path queries, RPQS)成为一个被广泛研究的课题。

本文提出的基于 gstore[6]引擎的正则路径检索方法, 有两个方面的创新: 一是解决了正则表达式路径匹配问题, 能够根据 SPARQL 路径查询中的正则表达式, 返回搜索到的匹配路径; 二是基于gstore 系统, 采用子图匹配的方式进行 SPARQL 查询, 能够大大提升查询性能, 并且随着 gstore 系统性能的提升, 本方法所支持的图数据规模以及查询性能也会随之提升。

1 研究背景1.1 RDF 和 SPARQL

RDF 通常称为一种“语言”, 但更准确地说, 它是一种数据模型, 由 W3C 作为语义网运动的组成部分而推出。RDF 由一系列陈述, 即主体(subject,

Regular Path Queries on Large Graph Data

ZHANG Yu, ZENG Li, ZOU Lei†

Newspapers in Chinese (Simplified)

Newspapers from China