The Forming Exclusive Groups

Published Date: 02 Nov 2017

Keywords â€“SPARQL, fedX, semantic web.

Introduction

Our work is based fully on studying the FedX framework which is a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources.We also tried to tweak its source selection parameters so as to get better performance. We used virtuoso-open source[3] for creating live SPARQL endpoints. We earlier tried to use fuseki but found that virtuoso was more appropriate.

We used various datasets which include LinkedMDB[4], DBpedia[5], Geonames[6], Ny times[7].

In Sect.2 we have discussed the basic structure of our experiment which introduces to every technology used in this experiment and also the specifications of the machine used in the experiment, moving forward, Sect.3 is about the algorithm used to optimize source selection , Sect.4 covers the queries used to test the system, the Sect.5 is about the comparison between the FedX and the FedX with optimized source selection, finally the Sect.6 the conclusion and future work has been discussed.

Basic Structure

Following are the tools and technologies used in the experiment:

FedX

Virtuoso-Open Source

Data Sets Used

System Specification

2.1 FedX

FedX has been implemented in Java and extends the Sesame framework.

Fig. 1 The query processing model of FedX.

Optimization techniques that are already incorporated in SPARQL are as follows:

2.1.1 Source Selection

FedX uses SPARQL ASK requests for this purpose [01]. Cache is used to reduce the number of ASK queries sent. Further deatils are explained in Sect.3.

2.1.2 Group Ordering

The number of intermediate results are determined by the join order hence it is a very important factor in optimization. . A rule-based join optimizer, which orders a list of join arguments (i.e., triple patterns or groups of triple patterns) according to a heuristics-based cost estimation(8).

2.1.3 Forming Exclusive Groups

In short the triple patterns which have a single relevant source. Exclusive groups with size â‰¥ 2 can be used for optimization: rather than sending the triples of such type sequentially to the (single) relevant source, they are sent together (as a conjunctive query), thus executing them in a single subquery at the respective endpoint.which as a result drastically improves the persormance of the system as both the number of remote requests plus the data sent over the network are minimized.

2.1.4 Bound Join

If the joins are computed in a block nested loop fashion, i.e., as a distributed semijoin, it is possible to reduce the number of requests by a factor equivalent to block size, in the following referred to as an input sequence. In this optimization a set of mappings is grouped into in a single subquery using SPARQL UNION constructs. This grouped subquery is then fired onto the relevant data sources in a single remote request, a little post-processing is required to obtain the result(8).

2.2 Virtuoso-Open Source

We used virtuoso for creating SPARQL endpoints on which the queries will be fired. For this we loaded the viruoso databse with ".nt" files which are line-based, plain textÂ serialisationÂ format forÂ RDFÂ (Resource Description Framework) graphs, and a subset of theÂ TurtleÂ (Terse RDF Triple Language) format.We created 5 live SPARQL endpoints for the experiment. The ".nt" files are availabe at the respective website for downloading.

2.3 Data Sets Used

2.3.1 LinkedMDB[3]

It publishes Linked Open data about movies using D2r server. It has a very large collection of RDF tripples Itâ€™s a first semantic web dataset in its domain.

2.3.2 Dbpedia[4]

It is the centre of the web of linked open data. It extracts structured information from Wikipedia which is one of the biggest knowledge databases. Dbpedia also responds to the changes being made in Wikipedia. Since it takes information from Wikipedia it covers various domains, It is also multilingual. It allows complex queries against Wikipedia.

2.3.3 GeoNames[5]

It is the Geographical dataset, which provide information about different locations. GeoNames provides various type of information about every location such as population, nearby places, elevations, Map of the place, etc. In this dataset each feature is described as a Resource and has been given a unique URI. This dataset provided with the information like map, population, nearby places of the shooting location.

2.3.4 NYTimes[6]

It is one of the most authoritative news vocabularies ever created. The New York Times has published approximately ,10,000 subject headings asÂ linked open dataÂ under aÂ CC BY license.The various tag types and mapping strategies include people, Organizations, Locations, Descriptors.

2.4 Experimental Setup

The experiments were carried out on an IntelÂ® Coreâ„¢ i3-350M Processor 2.26 GHz*1 with 3 MB (L3 Cache), 3GB DDR3 RAM, and a 320 GB hard drive. A 64bit Windows 8 operating system, Virtuoso open source version was used for creating SPARQL endpoints. We ran all the tests against local SPARQL endpoint to avoid effects of unpredictable network delay. All the queries have been run 10 times and their average is noted.

Source Selecction Optimization

FedX evaluates SPARQL query triples on all data sources and retrieves results from them. But in real the triples of SPARQL query needs to be evaluated only on relevant data sources. Relevant sources can be obtained by maintaining a meta-data, but it is highly inefficient and impractical. Therefore FedX sends SPAQRL ASK queries to federation members and on the basis of results obtained, it annotates triples with relevant sources. FedX also has cache which is used to remember annotation of whether source is relevant or not for the given triple. This cache helps in reducing number of SPARQL ASK queries sent to the data sources. Our work helps in retrieving relevant endpoints from cache using following algorithm:

Obtaining triples from SPARQL query:

A SPARQL query in the form of the string is passed which needs to be disintegrated into multiple triples to find a source which is relevant to the triples.

Eg. For the following sparql query,

SELECT ?predicate ?object WHERE {

{ <http://dbpedia.org/resource/Barack_Obama> ?predicate ?object }

UNION

{ ?subject <http://www.w3.org/2002/07/owl#sameAs>

<http://dbpedia.org.resource/Barack_Obama>.

?subject ?predicate ?object }