The Forming Exclusive Groups

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Keywords –SPARQL, fedX, semantic web.

Introduction

Our work is based fully on studying the FedX framework which is a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources.We also tried to tweak its source selection parameters so as to get better performance. We used virtuoso-open source[3] for creating live SPARQL endpoints. We earlier tried to use fuseki but found that virtuoso was more appropriate.

We used various datasets which include LinkedMDB[4], DBpedia[5], Geonames[6], Ny times[7].

In Sect.2 we have discussed the basic structure of our experiment which introduces to every technology used in this experiment and also the specifications of the machine used in the experiment, moving forward, Sect.3 is about the algorithm used to optimize source selection , Sect.4 covers the queries used to test the system, the Sect.5 is about the comparison between the FedX and the FedX with optimized source selection, finally the Sect.6 the conclusion and future work has been discussed.

Basic Structure

Following are the tools and technologies used in the experiment:

FedX

Virtuoso-Open Source

Data Sets Used

System Specification

2.1 FedX

FedX has been implemented in Java and extends the Sesame framework.

Fig. 1 The query processing model of FedX.

Optimization techniques that are already incorporated in SPARQL are as follows:

2.1.1 Source Selection

FedX uses SPARQL ASK requests for this purpose [01]. Cache is used to reduce the number of ASK queries sent. Further deatils are explained in Sect.3.

2.1.2 Group Ordering

The number of intermediate results are determined by the join order hence it is a very important factor in optimization. . A rule-based join optimizer, which orders a list of join arguments (i.e., triple patterns or groups of triple patterns) according to a heuristics-based cost estimation(8).

2.1.3 Forming Exclusive Groups

In short the triple patterns which have a single relevant source. Exclusive groups with size ≥ 2 can be used for optimization: rather than sending the triples of such type sequentially to the (single) relevant source, they are sent together (as a conjunctive query), thus executing them in a single subquery at the respective endpoint.which as a result drastically improves the persormance of the system as both the number of remote requests plus the data sent over the network are minimized.

2.1.4 Bound Join

If the joins are computed in a block nested loop fashion, i.e., as a distributed semijoin, it is possible to reduce the number of requests by a factor equivalent to block size, in the following referred to as an input sequence. In this optimization a set of mappings is grouped into in a single subquery using SPARQL UNION constructs. This grouped subquery is then fired onto the relevant data sources in a single remote request, a little post-processing is required to obtain the result(8).

2.2 Virtuoso-Open Source

We used virtuoso for creating SPARQL endpoints on which the queries will be fired. For this we loaded the viruoso databse with ".nt" files which are line-based, plain text serialisation format for RDF (Resource Description Framework) graphs, and a subset of the Turtle (Terse RDF Triple Language) format.We created 5 live SPARQL endpoints for the experiment. The ".nt" files are availabe at the respective website for downloading.

2.3 Data Sets Used

2.3.1 LinkedMDB[3]

It publishes Linked Open data about movies using D2r server. It has a very large collection of RDF tripples It’s a first semantic web dataset in its domain.

2.3.2 Dbpedia[4]

It is the centre of the web of linked open data. It extracts structured information from Wikipedia which is one of the biggest knowledge databases. Dbpedia also responds to the changes being made in Wikipedia. Since it takes information from Wikipedia it covers various domains, It is also multilingual. It allows complex queries against Wikipedia.

2.3.3 GeoNames[5]

It is the Geographical dataset, which provide information about different locations. GeoNames provides various type of information about every location such as population, nearby places, elevations, Map of the place, etc. In this dataset each feature is described as a Resource and has been given a unique URI. This dataset provided with the information like map, population, nearby places of the shooting location.

2.3.4 NYTimes[6]

It is one of the most authoritative news vocabularies ever created. The New York Times has published approximately ,10,000 subject headings as linked open data under a CC BY license.The various tag types and mapping strategies include people, Organizations, Locations, Descriptors.

2.4 Experimental Setup

The experiments were carried out on an Intel® Core™ i3-350M Processor 2.26 GHz*1 with 3 MB (L3 Cache), 3GB DDR3 RAM, and a 320 GB hard drive. A 64bit Windows 8 operating system, Virtuoso open source version was used for creating SPARQL endpoints. We ran all the tests against local SPARQL endpoint to avoid effects of unpredictable network delay. All the queries have been run 10 times and their average is noted.

Source Selecction Optimization

FedX evaluates SPARQL query triples on all data sources and retrieves results from them. But in real the triples of SPARQL query needs to be evaluated only on relevant data sources. Relevant sources can be obtained by maintaining a meta-data, but it is highly inefficient and impractical. Therefore FedX sends SPAQRL ASK queries to federation members and on the basis of results obtained, it annotates triples with relevant sources. FedX also has cache which is used to remember annotation of whether source is relevant or not for the given triple. This cache helps in reducing number of SPARQL ASK queries sent to the data sources. Our work helps in retrieving relevant endpoints from cache using following algorithm:

Obtaining triples from SPARQL query:

A SPARQL query in the form of the string is passed which needs to be disintegrated into multiple triples to find a source which is relevant to the triples.

Eg. For the following sparql query,

SELECT ?predicate ?object WHERE {

{ <http://dbpedia.org/resource/Barack_Obama> ?predicate ?object }

UNION

{ ?subject <http://www.w3.org/2002/07/owl#sameAs>

<http://dbpedia.org.resource/Barack_Obama>.

?subject ?predicate ?object }

}

Set of triples are:

<http://dbpedia.org/resource/Barack_Obama> ?predicate ?object

?subject <http://www.w3.org/2002/07/owl#sameAs>

<http://dbpedia.org.resource/Barack_Obama>

?subject ?predicate ?object

Convert triples to SubQuery:

Once a set of triples is obtained, all the variables are to be removed and subject, Predicate and Object are to be identified for each triple. Note that triples like (?x c c) and (?y c c) should be treated as same. Hence all the variables are substituted with null values in the subquery.

Eg. For the following Triple:

?subject1 <http://www.w3.org/2002/07/owl#sameAs>

<http://dbpedia.org.resource/Barack_Obama>

?subject2 <http://www.w3.org/2002/07/owl#sameAs>

<http://dbpedia.org.resource/Barack_Obama>

Corresponding subquery is:

<null><http://www.w3.org/2002/07/owl#sameAs>

<http://dbpedia.org.resource/Barack_Obama>

<null> http://www.w3.org/2002/07/owl#sameAs>

<http://dbpedia.org.resource/Barack_Obama>

Determining endpoints for ASK queries:

We need to form two different sets of endpoints-

which have cache entry and can possibly act as relevant sources.

which can’t be determined as relevant source and needs to be determined by ASK queries.

While checking for all the endpoints, if an entry for an endpoint exists in cache, then enum value HAS_LOCAL_STATEMENTS is returned. While, if endpoint doesn’t have any entry, then enum value POSSIBLY_HAS_STATEMENTS is returned. Endpoints returning enum value POSSIBLY_HAS_STATEMENTS are queried with ASK queries.

Determining Relevant sources and other endpoints for ASK queries:

Cache is implemented as a hashmap of subqueries mapped to endpoints. For the given subquery, if the endpoint has value HAS_LOCAL_STATEMENTS then one of the following is possible:

sIf the exact subquery exists in cache then endpoint mapped to it is the relevant source and HAS_STATEMENTS is returned for the given endpoint.

Subset Inference: If given subquery is (?x ?y c) and cache has entry for subquery (?x c c) then the endpoint mapped to it must contain results for subquery(?x ?y c). Hence value HAS_STATEMENTS is returned for the given endpoint.

Superset Inference: If given subquery is (?x c c) and cache has entry for subquery (?x ?y c), then the emdpoint mapped to it may or may not contain results for subquery(?x c c). Hence value POSSIBLE_HAS_STATEMENTS is returned for the given endpoint and then it is queried with ASK queries.

Fig.2 The Flow chart showing Source Selection Process

Queries & Experimental results

4.1 Queries

Following are the Cross-Domain queries used to test the experiment. These are the same queries used by fedbench[x] in their experiments:

Query CD1: Find all information about Barack Obama.

SELECT ?predicate ?object WHERE { { <http://dbpedia.org/resource/Barack_Obama> ?predicate ?object } UNION { ?subject <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Barack_Obama> . ?subject ?predicate ?object } }

Query CD2: Return Barack Obama's party membership and news pages.

SELECT ?party ?page WHERE { <http://dbpedia.org/resource/Barack_Obama> <http://dbpedia.org/ontology/party> ?party . ?x <http://data.nytimes.com/elements/topicPage> ?page . ?x <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Barack_Obama> .}

Query CD3: Find all news about actors starring in a movie with name Tarzan.

SELECT ?actor ?news WHERE { ?film <http://purl.org/dc/terms/title> 'Tarzan' . ?film <http://data.linkedmdb.org/resource/movie/actor> ?actor . ?actor <http://www.w3.org/2002/07/owl#sameAs> ?x. ?y <http://www.w3.org/2002/07/owl#sameAs> ?x . ?y <http://data.nytimes.com/elements/topicPage> ?news

}

Query CD4: Finding all news about the locations in California.

SELECT ?location ?news WHERE {

?location <http://www.geonames.org/ontology#parentFeature> ?parent .

?parent <http://www.geonames.org/ontology#name> 'California' .

?y <http://www.w3.org/2002/07/owl#sameAs> ?location .

?y <http://data.nytimes.com/elements/topicPage> ?news

}

4.2 Experimental Results

Follwing graph shows the comparision between execution time of optimized and unoptimized source selection process.

Fig.3 Results (Time in ms)

Query

Unoptimized

Optimized

CD-1

387

345

CD-2

324

293

CD-3

481

409

CD-4

369

313

Conclusion and future work

We worked on tweaking the source selection process of FedX and the results shows that there was a good reduction in execution time of the query. We were successful in reducing the number of ASK requests sent to the endpoints. If repositories setup on a network were used instead of local repositories, the reduction in number of ASK queries would reduce the network overhead in query execution as compared to unoptimized source selection process.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now