Self Sustaining Ontologies For Search Engines

Published Date: 02 Nov 2017

Abstractâ€”Ontologies have become the de-facto modeling tool of choice, employed in many applications and prominently in the Semantic Web. Nevertheless, ontology construction remains a daunting task. Ontological bootstrapping, which aims at automatically generating concepts and their relations in a given domain, is a promising technique for ontology construction. Bootstrapping an ontology based on a set of predefined textual sources, such as Web services, must address the problem of multiple, largely unrelated concepts. In this paper, we propose an ontology bootstrapping process for Web services. We exploit the advantage that Web services usually consist of both WSDL and free text descriptors. The WSDL descriptor is evaluated using two methods, namely Term Frequency/Inverse Document Frequency (TF/IDF) and Web context generation. Our proposed ontology bootstrapping process integrates the results of both methods and applies a third method to validate the concepts using the service free text descriptor, thereby offering a more accurate definition of ontologies. Extensively validated bootstrapping method using a large repository of real-world Web services and verified the results against existing ontologies. The experimental results indicate high precision. Furthermore, the recall versus precision comparison of the results when each method is separately implemented presents the advantage of our integrated bootstrapping approach.

Keywordsâ€” Web Services Discovery, Metadata of Services Interfaces, Service-Oriented Relationship Modeling

Introduction

ONTOLOGIES are used in an increasing range of applications, notably the Semantic web, and essentially have become the preferred modelling tool. However, the design and maintenance of ontologies is a formidable process [1], [2]. Ontology bootstrapping, which has recently emerged as an important technology for ontology construction, involves automatic identification of concepts relevant to a domain and relations between the concepts [3]. Previous work on ontology bootstrapping focused on either a limited domain [4] or expanding an existing ontology. In the field of web services, registries such as the Universal Description, Discovery and Integration (UDDI) have been created to encourage interoperability and adoption of web services. Unfortunately, UDDI registries have some major flaws. In particular, UDDI registries either are publicly available and contain many obsolete entries or require registration that limits access. In either case, a registry only stores a limited description of the available services. Ontologies created for classifying and utilizing web services can serve as an alternative solution. However, the increasing number of available web services makes it difficult to classify web services using single domain ontology or a set of existing ontologies created for other purposes. Furthermore, constant increase in the number of web services requires continuous manual effort to evolve ontology.

The web service ontology bootstrapping process proposed in this project is based on the advantage that a web service can be separated into two types of descriptions: 1) the Web Service Description Language (WSDL) describing "how" the service should be used and 2) a textual description of the web service in free text describing "what" the service does. This advantage allows bootstrapping the ontology based on WSDL and verifying the process based on the web service free text descriptor.

The ontology bootstrapping process is based on analyzing a web service using three different methods, where each method represents a different perspective of viewing the web service. As a result, the process provides a more accurate definition of the ontology and yields better results. In particular, the Term Frequency/Inverse Document Frequency (TF/IDF) method analyzes the web service from an internal point of view, i.e., what concept in the text best describes the WSDL document content. The Web Context Extraction method describes the WSDL document from an external point of view, i.e., what most common concept represents the answers to the web search queries based on the WSDL content. Finally, the Free Text Description Verification method is used to resolve inconsistencies with the current ontology. An ontology evolution is performed when all three analysis methods agree on the identification of a new concept or a relation change between the ontology concepts. The relation between two concepts is defined using the descriptors related to both concepts. This approach can assist in ontology construction and reduce the maintenance effort substantially. The approach facilitates automatic building of an ontology that can assist in expanding, classifying, and retrieving relevant services, without the prior training required by previously developed approaches.

ONTOLOGY CREATION AND EVOLUTION

The field of automatic annotation of web services contains several works relevant to our research. Patil et al. [5] present a combined approach toward automatic semantic annotation of web services. The approach relies on several matchers (e.g., string matcher, structural matcher, and synonym finder), which are combined using a simple aggregation function. Chabeb et al. [6] describe. Oldham et al. [7] use a simple machine learning (ML) technique, namely NaÄ±Â¨ve Bayesian Classifier, to improve the precision of service annotation. Machine learning is also used in a tool called Assam, which uses existing annotation of semantic web services to improve new annotations. Categorizing and matching web service against existing ontology was proposed by Liang [8]. A context-based semantic approach to the problem of matching and ranking web services for possible service composition is suggested in [9]. Unfortunately, all these approaches require clear and formal semantic mapping to existing ontologies.

Recent work has focused on ontology creation and evolution and in particular on schema matching. Many heuristics were proposed for the automatic matching of schemata (e.g., GLUE [10], and Onto Builder [11]), and several theoretical models were proposed to represent various aspects of the matching process such as representation of mappings between ontologies, ontology matching using upper ontologies, and modelling and evaluating automatic semantic reconciliation. However, all the methodologies described require comparison between existing ontologies. The realm of information science has produced an extensive body of literature and practice in ontology construction. Work has been done in ontology learning, such as Text-To-Onto [12]. Finally, researchers in the field of knowledge representation have studied ontology interoperability. The works described are limited to ontology management that involves manual assistance to the ontology construction process.

Ontology evolution has been researched on domain specific websites and digital library collections [4]. A bootstrapping approach to knowledge acquisition in the fields of visual media and multimedia uses existing ontologies for ontology evolution. Another perspective focuses on reusing ontologies and language components for ontology generation. Noy and Klein [1] defined a set of ontology-change operations and their effects on instance data used during the ontology evolution process. Unlike previous work, which was heavily based on existing ontology or domain specific, our work automatically evolves an ontology for web services from the beginning.

Surveys on ontology techniques implementations to the semantic web and service discovery approaches suggest ontology evolution as one of the future directions of research. Ontology learning tools for semantic web service descriptions have been developed based on Natural Language Processing (NLP). Their work mentions the importance of further research concentrating on context directed ontology learning in order to overcome the limitations of NLP. In addition, a survey on the state-of-the- art web service repositories suggests that analyzing the web service textual description in addition to the WSDL description can be more useful than analyzing each descriptor separately. The survey mentions the limitation of existing ontology evolution techniques that yield low recall.

THE BOOTSTRAPPING ONTOLOGY MODEL

The bootstrapping ontology model proposed in this paper is based on the continuous analysis of WSDL documents and employs an ontology model based on concepts and relationships. The innovation of the proposed bootstrapping model centers on i) the combination of the use of two different extraction methods, TF/IDF and Web based concept generation, and ii) the verification of the results using a Free Text Description Verification method by an alyzing the external service descriptor. We utilize these three methods to demonstrate the feasibility of our model. It should be noted that other more complex methods, from the field of Machine Learning (ML) and Information Retrieval (IR),can also be used to implement the model. However, the use of the methods in a straight forward manner emphasizes that many methods can be "plugged in"and that the results are attributed to the modelâ€™s process of combination and verification. Our model integrates these three specific methods since each method presents a unique advantage - internal perspective of the Web service by the TF/IDF, external perspective of the Web service by the Web Context Extraction, and a comparison to a free text description, a manual evaluation of the results, for verification purposes.

Fig. 1. Web Service Ontology Bootstrapping Process

The overall bootstrapping ontology process is described in Figure 1. There are four main steps in the process. The token extraction step extracts tokens representing relevant information from a WSDL document. This step extracts all the name labels, parses the tokens, and performs initial filtering.

The second step analyzes in parallel the extracted WSDL tokens using two methods. In particular, TF/IDF analyzes the most common terms appearing in each Web service document and appearing less frequently in other documents. Web Context Extraction uses the sets of tokens as a query to a search engine, clusters the results according to textual descriptors, and classifies which set of descriptors identifies the context of the Web service. The concept evocation step identifies the descriptors which appear in both the TF/IDF method and the Web context method. These descriptors identify possible concept names that could be utilized by the ontology evolution. The context descriptors also assist in theconvergence process of the relations between concepts. Finally, the ontology evolution step expands the ontology as required according to the newly identified concepts and modifies the relations between them. The relations are defined as an ongoing process according to the most common context descriptors between the concepts. After the ontology evolution, the whole process continues to the next WSDL with the evolved ontology concepts and relations. It should be noted that the processing order of WSDL documents is arbitrary. In the continuation, we describe each step of our approach in detail. The following three Web services will be used as an example to illustrate our approach:

DomainSpy is a Web service that allows domain registrants to be identified by region or registrant name. It maintains an XML-based domain database with over 7 million domain registrants in the U.S.

AcademicVerifier is a Web service that determines whether an email address or domain namebelongs to an academic institution.

ZipCodeResolver is a Web service that resolvespartial U.S. mailing addresses and returns proper ZIP Code. The service uses an XML interface.

Token Extraction

The analysis starts with token extraction, representing each service, S, using a set of tokens called descriptor. Each token is a textual term, extracted by simply parsing the underlying documentation of the service. The descriptor represents the WSDL document, formally put as , where ti is a token. WSDL tokens require special handling, since meaningful tokens (such as names of parameters and operations) are usually composed of a sequence of words with each first letter of the words capitalized (e.g., GetDomainsByRegistrantNameResponse). Therefore, the descriptors are divided into separate tokens. It is worth mentioning that we initially considered using pre-defined WSDL documentation tags for extraction and evaluation but found them less valuable since Web service developers usually do not include tags in their services. Figure 2 depicts a WSDL document with the token list bolded. The extracted token list serves as a baseline. These tokens are extracted from the WSDL document of a Web service DomainSpy. The service is used as an initial step in our example in building the ontology. Additional services will be used later to illustrate the process of expanding the ontology. All elements classified as name are extracted, including tokens that might be less relevant. The sequence of words is expanded as previously mentioned using the capital letter of each word. The tokens are filtered using a list of stop-words, removing words with no substantive semantics. Next, we describe the two methods used for the description extraction of Web services: TF/IDF and context extraction.

Fig. 2. WSDL Example of the Service DomainSpy

TF/IDF Analysis

TF/IDF is a common mechanism in IR for generating a robust set of representative keywords from a corpus of documents. The method is applied here to the WSDL descriptors. By building an independent corpus for each document, irrelevant terms are more distinct and can be thrown away with a higher confidence. To formally define TF/IDF, we start by defining freq(ti;Di) as the number of occurrences of the token ti within the document descriptor Di. We define the term frequency of

each token ti as:

Domain

Address

Registrant

Name

Location

Fig. 3. Example of the TF/IDF Method Results for DomainSpy

We define Dwsdl to be the corpus of WSDL descriptors. The inverse document frequency is calculated as the ratio between the total number of documents and the number of documents that contain the term:

Here, D is defined as a specific WSDL descriptor. The TF/IDF weight of a token, annotated as w(ti), is calculated as:

While the common implementation of TF/IDF gives equal weights to the term frequency and inverse document frequency (i.e., w = tf _ idf ), we chose to give higher weight to the idf value. The reason behind this modification is to normalize the inherent bias of the tf measure in short documents. Traditional TF/IDF applications are concerned with verbose documents (e.g., books, articles, and human-readable Web pages). However, WSDL documents have relatively short descriptions. Therefore, the frequency of a word within a document tends to be incidental, and the document length component of the TF generally has little or no influence. The token weight is used to induce ranking over the descriptorâ€™s tokens. We define the ranking using a precedence relation _tf=idf , which is a partial order over D, such that tl _tf=idf tk if w(tl) < w(tk). The ranking is used to filter the tokens according to a threshold that filters out words with a frequency count higher than the second standard deviation from the average weight of token w value. The effectiveness of the threshold was validated by our experiments. Figure 3 presents the list of tokens that received a higher weight than the threshold for the DomainSpy service. Several tokens that appeared in the baseline list (see Figure 2) were removed due to the filtering process. For instance, words such as "Response", "Result", and "Get" received below-thethreshold TF/IDF weight, due to their high IDF value.

Web Context Extraction

We define a context descriptor ci from domain DOM as

an index term used to identify a record of information , which in our case is aWeb service. It can consist of a word, phrase, or alphanumerical term. A weight wi 2 < identifies the importance of descriptor ci in relation to the Web service. For example, we can have a descriptor 5 c1 = Address and w1 = 42. A descriptor set fhci;wiigi is defined by a set of pairs, descriptors and weights. Each descriptor can define a different point of view of the concept. The descriptor set eventually defines all the different perspectives and their relevant weights, which identify the importance of each perspective.

By collecting all the different view points delineated by the different descriptors, we obtain the context. A context

j is a set of finite sets of descriptors, where i represents each context descriptor and j represents the index of each set. For example, a context C may be a set of words (hence DOM is a set of all possible character combinations) defining a Web service and the weights can represent the relevance of a descriptor to the Web service. The input of the algorithm is defined as tokens extracted from the Web service WSDL descriptor. The sets of tokens are extracted from elements classified as name, for example Get Domains By Zip, as described in Figure 4. Each set of tokens is then sent to a Web search engine and a set of descriptors is extracted by clustering the Web pages search results for each token set. The Web pages clustering algorithm is based on the concise all pairs profiling (CAPP) clustering method. This method approximates profiling of large classifications. It compares all classes pairwise and then minimizes the total number of features required to guarantee that each pair of classes is contrasted by at least one feature. Then each class profile is assigned its own minimized list of features, characterized by how these features differentiate the class from the other features. Figure 4 shows an example that presents the results for the extraction and clustering performed on tokens Get Domains By Zip. The context descriptors extracted include:

fhZip Code, (50, 2)i, hDownload (35, 1)i, hRegistration (27, 7)i, hSale (15, 1)i, hSecurity (10, 1)i, hNetwork (12, 1)i, hPicture (9, 1)i, hFree Domains (4, 3)ig. A different point of view of the concept can been seen in the previous set of tokens Domains where the context descriptors extracted include fhHosting, (46, 1)i, hDomain (27, 7) i, hAddress (9, 4)i, hSale (5, 1)i, hPremium (5, 1)i, hWhois (5, 1) ig. It should be noted that each descriptor is accompanied by two initial weights. The first weight represents the number of references on the Web (i.e., the number of returned Web pages) for that descriptor in the specific query. The second weight represents the number of references to the descriptor in the WSDL (i.e., for how many name token sets was the descriptor retrieved). For instance, in the above example, Registration appeared in

27 Web pages and 7 different name token sets in the WSDL referred to it. The algorithm then calculates the sum of the number of Web pages that identify the same descriptor and the sum of number of references to the descriptor in the WSDL. A high ranking in only one of the weights does

Fig. 4. Example of the Context Extraction Method for DomainSpy

not necessarily indicate the importance of the context descriptor. For example, high ranking in only Web references may mean that the descriptor is important since the descriptor widely appears on the Web, but it might not be relevant to the topic of the Web service (e.g., Download descriptor for the DomainSpy Web Service, see Figure 4). To combine values of both the Web page references and the appearances in the WSDL, the two values are weighted to contribute equally to final weight value.

Figure 4 provides the outcome of the Web context extraction method for the DomainSpy service (see bottom right part). The figure shows only the highest ranking descriptors to be included in the context. For example, Domain, Address, Registration, Hosting, Software, and Search are the context descriptors selected to describe the DomainSpy service.

Ontology Evolution

The ontology evolution consists of four steps including: i) building new concepts, ii) determining the concept relations, iii) identifying relations types, and iv) re-setting the process for the next WSDL document. The descriptor is further validated using the textual service descriptor. The analysis is based on the advantage that a Web service can be separated into two descriptions: the WSDL description and a textual description of the Web service in free text. The WSDL descriptor is analyzed to extract the context descriptors and possible concepts as described previously. The second descriptor , represents the textual description of the service supplied by the service developer in free text. These descriptions are relatively short and include up to a few sentences describing the Web service. Figure 6 presents an example of free text description for the DomainSpy Web service. The verification process includes matching the concept descriptors in simple string matching against all the descriptors of the service textual descriptor. We use a simple stringmatching function, matchstr, which returns 1 if two strings match and 0 otherwise.

Expanding the example in Figure 5, we can see the concept evocation step on the top and the ontology evolution on the bottom, both based on the same set of services. Analysis of the AcademicVerifier service yields only one descriptor as a possible concept. The descriptor Domain was identified by both the TF/IDF and the Web Context results and matched with a textual descriptor. It is similar for the Domain and Address appearing in the DomainSpy service. However, for the ZipCodeResolver service both Address and XML are possible concepts but only Address passes the verification with the textual descriptor. As a result, the concept is split into two separate concepts and the ZipCodeResolver service descriptors are associated with both of them.

To evaluate the relation between concepts, we analyze the overlapping context descriptors between different concepts. In this case, we use descriptors that were included in the union of the descriptors extracted by both the TF/IDF and the Web context methods. Precedence is given to descriptors that appear in both concept definitions over descriptors that appear in the context descriptors. In our example, the descriptors related to both Domain and Domain Address are: Software, Registration, Domain, Name, and Address. However, only the

Domain descriptor belongs to both concepts and receives the priority to serve as the relation. The result is a relation that can be identified as a subclass, where Domain Address is a subclass of Domain.

The process of analyzing the relation between concepts is performed after the concepts are identified. The identification of a concept prior to the relation allows in the case of Domain Address and Address to again apply the subclass relation based on the similar concept descriptor. However, the relation of Address and XML concepts remains undefined at the current iteration of the process since it would include all the descriptors that relate to ZipCodeResolver service. The relation described in the example is based on descriptors that are the intersection of the concepts. Basing the relations on a minimum number of Web services belonging to both concepts will result in a less rigid classification of relations.

The process is performed iteratively for each additional service that is related to the ontology. The concepts and relations are defined iteratively as more services are added. The iterations stop once all the services are analyzed.

Fig. 5. Example of Web Service Ontology Bootstrapping

4 CONCLUSION

The paper proposes an approach for bootstrapping an ontology based on Web service descriptions. The approach is based on analyzing Web services from multiple perspectives and integrating the results. Our approach takes advantage of the fact that Web services usually consist of both WSDL and free text descriptors. This allows bootstrapping the ontology based on WSDL and verifying the process based on the Web service free text descriptor.

The main advantage of the proposed approach is its

high precision results and recall versus precision results of the ontology concepts. The value of the concept relations is obtained by analysis of the union and intersection of the concept results. The approach enables the automatic construction of an ontology that can assist, classify, and retrieve relevant services, without the prior training required by previously developed methods. As a result, ontology construction and maintenance effort can be substantially reduced. Since the task of designing and maintaining ontologies remains difficult, our approach, as presented in this paper, can be valuable in practice.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now