Personalization Of User Models Using Clustering Approach

Published Date: 02 Nov 2017

ABSTRACT

This paper presents a knowledge discovery framework for the construction of Web Directories. In this context, the Web directory is viewed as a thematic hierarchy and personalization is realized by constructing user models on the basis of usage data. In contrast to most of the work on Web usage mining, the usage data that are analyzed here correspond to user navigation throughout the Web, rather than a particular Web site, exhibiting as a result a high degree of thematic diversity. For modeling the user communities, we introduce a novel methodology that combines the usersâ€™ browsing behavior with thematic information from the Web directories. Following this methodology, we enhance the clustering approaches and data preprocessing approach. The proposed personalization methodology is evaluated on a specialized artificial Web directory, indicating its potential value to the Web user. The experiments also assess the effectiveness of the different machine learning techniques on the task.

Introduction

we applied personalization on an artificial Web directory that was constructed from the Web pages included in the log files themselves, using document clustering.

One approach towards the alleviation of this problem is the organization of Web content into thematic hierarchies, also known as Web directories. A Web directory, such as Yahoo! [19] or the Open Directory Project [17], allows Web users to locate information that relates to their interests, through a hierarchy navigation process. This approach suffers though from a number of problems. The manual creation and maintenance of the Web directories leads to limited coverage of the topics that are contained in those directories, since there are millions of Web pages and the rate of expansion is very high. In addition, the size and complexity of the directories is cancelling out any gains that were expected with respect to the information overload problem, i.e., it is often difficult for a particular user to navigate to interesting information.

An alternative solution is the personalization of the services on the Web.Web Personalization [12] focuses on the adaptability of Web-based information systems to the needs and interests of individuals or groups of users and aims to make the Web a friendlier environment. Typically, a personalized Web site recognizes its users, collects information about their preferences and adapts its services, in order to match the usersâ€™ needs. A major obstacle towards realizing Web personalization is the acquisition of accurate and operational models for the users. Reliance to manual creation of these models, either by the users or by domain experts, is inadequate for various reasons, among which the annoyance of the users and the difficulty of verifying and maintaining the resulting models. An alternative approach is that of Web Usage Mining [18], which uses data mining methods to create models, based on the analysis of usage data, i.e., records of how a service on the Web is used.

The construction of community directories with usage mining raises a number of interesting research issues, which are addressed in this paper. The first challenge is the analysis of large datasets in order to identify community behavior. In addition to the heavy traffic expected at a central node, such as an ISP, a peculiarity of the data is that they do not correspond to hits within the boundaries of a site, but record outgoing traffic to the whole of the Web. This fact leads to the increased dimensionality and the semantic incoherence of the data, i.e., the Web pages that have been accessed. In order to address these issues we create a thematic hierarchy of the Web pages by examining their content, and assign the Web pages to the categories of this hierarchy. An agglomerative clustering approach is used to construct the hierarchy with nodes representing content categories. A community construction method then exploits the constructed hierarchy and specializes it to the interests of particular communities.

The basic data mining algorithm that has been developed for that purpose, the Directory Miner (DM), is an extension of the cluster mining algorithm, which has been used for the construction of site-specific communities. The new method proposed here is able to ascend an existing directory in order to arrive at a suitable level of semantic characterization of the interests of a particular community.

The rest of this paper is organized as follows. Section 2 presents existing approaches to Web personalization with usage mining methods that are related to the work presented here. Section 3 presents in detail our methodology for the construction of Web directories. Section 4 provides results of the application of the methodology to the usage data of an ISP. Finally section 5 summarizes the most interesting conclusions of this work and presents promising paths for future research.

2 Related Work

In recent years, the exploitation of usage mining methods for Web personalization has attracted considerable attention and a number of systems use information from Web server log files to construct user models that represent the behavior of the users. Their differences are in the method that they employ for the construction of user models, as well as in the way that this knowledge, i.e.,the models, is exploited. Clustering methods, e.g. [8], [11] and [20], classification methods, e.g. [14], and sequential pattern discovery, e.g. [16], have been employed to create user models. These models are subsequently used to customize the Web site and recommend links to follow. Usage data have also been combined with the content of Web pages in [13]. In this approach content profiles are created using clustering techniques. Content profiles represent the usersâ€™ interests for accessed pages with similar content and are combined with usage profiles to support the recommendation process. A similar approach is presented in [7]. Content and usage data are aggregated and clustering methods are employed for the creation of richer user profiles. In [5], Web content data are clustered for the categorization of the Web pages that are accessed by users. These categories are subsequently used to classify Web usage

data. Personalized Web directories, on the other hand, are mainly associated with services such as Yahoo! [19] and Excite [6], which support manual personalization by the user. A semi-automated approach for the personalization of a Web Directory like ODP, is presented in [17]. In this work, a high level language, named Semantic Channel Specification Language (SCSL), is defined in order to allow users to specify a personalized view of the directory. This view consists of categories from the Web Directories that are chosen by exploiting the declarations, offered by SCSL.

Full automation of the personalization process, with the aid of usage mining methods is proposed in the Montage system [1]. This system is used to create personalized portals, consisting primarily of links to the Web pages that a particular user has visited, organized into thematic categories according to the ODP directory. For the construction of the user model, a number of heuristic metrics are used, such as the interest in a page or a topic, the probability of revisiting a page, etc. An alternative approach is the construction of a directory of useful links (bookmarks) for an individual user, as adopted by the PowerBookmarks system [9]. The system collects "bookmark" information for a particular user, such as frequently visited pages, query results from a search engine, etc. Text classification techniques are used for the assignment of labels to Web pages. An important issue regarding these methods is the scalability of the classification methods that they use. These methods may be suitable for constructing models of what a single user usually views, but their extendibility to aggregate user models is questionable. Furthermore, the requirement for a small set of predefined classes complicates the construction of rich hierarchical models.

3 Constructing Web community Directories

The construction of Web community directories is seen here as the end result of a usage mining process on data collected at the proxy servers of a central service on the Web. This process consists of the following steps:

â€“ Data Collection and Preprocessing, comprising the collection and cleaning of the data, their characterization using the content of the Web pages, and the identification of user sessions. Note that this step involves a separate data mining process for the discovery of content categories and the characterization of the pages.

â€“ Pattern Discovery, comprising the extraction of user communities from the data with a suitably extended cluster mining technique, which is able to ascend a thematic hierarchy, in order to discover interesting patterns.

â€“ Knowledge Post-Processing, comprising the translation of community models into Web community directories and their evaluation.

An architectural overview of the discovery process is given in Figure 1, and described in the following sections.

Figure 1

3.1 Data Collection and Preprocessing

The usage data that form the basis for the construction of the communities are collected in the access log files of proxy servers, e.g. ISP cache proxy servers. These data record the navigation of the subscribers through the Web. No record of the userâ€™s identification is being used, in order to avoid privacy violations.

However, the data collected in the logs are usually diverse and voluminous. The outgoing traffic is much higher than the usual incoming traffic of a Web site and the visited pages less coherent semantically. The task of data preprocessing is to assemble these data into a consistent, integrated and comprehensive view, in order to be used for pattern discovery.

The first stage of data preprocessing involves data cleaning. The aim is to remove as much noise from the data as possible, in order to keep only the Web pages that are directly related to the user behavior. This involves the filtering of the log files to remove data that are downloaded without a user explicitly requesting them, such as multimedia content, advertisements, Web counters, etc. Records with HTTP error codes that correspond to bad requests, or unauthorized accesses are also removed.

3.1.1 Hierarchical Agglomerative Clustering Algorithm

Thus reducing the dimensionality and the semantic diversity of data we introduce a new approach. Typically, Web page categorization approaches, e.g. [4] and [10], use text classification methods to construct models for a small number of known thematic categories of a Web directory, such as that of Yahoo!. These models are then used to assign each visited page to a category. The limitation of this approach with respect to the methodology proposed here, is that it is based on a dataset for training the classifiers, which is usually limited in scope, i.e., covers only part of the directory. Furthermore, a manually-constructed Web directory is required, suffering from low coverage of the Web.

Unlike the partitioned algorithms that build the hierarchical solution for top to bottom, agglomerative algorithms build the solution by initially assigning each document to its own cluster and then repeatedly selecting and merging pairs of clusters, to obtain a single all-inclusive cluster. The key parameter in agglomerative algorithms is the method used to determine the pair of clusters to be merged at each step. In most agglomerative algorithms, this is accomplished by selecting the most similar pair of clusters, and numerous approaches have been developed for computing the similarity between two clusters.

The single scheme measures the similarity of two clusters by the maximum similarity between the documents from each cluster. That is, the similarity between two clusters Ci and Cj is given by

â€¦â€¦â€¦â€¦(1)

In contrast, the complete scheme uses the minimum similarity between a pair of documents to measure the same similarity. That is,

â€¦â€¦â€¦â€¦..(2)

In general, both the single and the complete approaches do not work very well because they either base their decisions on limited amount of information (single) or they assume that all the documents in the cluster are very similar to each other (complete approach). We overcomes these problems by measuring the similarity of two clusters as the average of the pairwise similarity of the documents from each cluster. That is,

.

â€¦â€¦â€¦â€¦(3)

The nodes of the resulting hierarchy represent clusters of Web pages that form thematic categories. By exploiting this taxonomy, a mapping can be obtained between the Web pages and the categories that each page is assigned to. Moreover, the most important terms for each category can be extracted, and be used for descriptive labeling of the category. For the sake of brevity we choose to label each category using a numeric coding scheme, representing the path from the root to the category node, e.g. "1.4.8.19" where "1" corresponds to the root of the tree.

This document clustering approach has the following advantages: first a hierarchical classification of Web documents is constructed without any human expert intervention or other external knowledge; second the dimensionality of the space is significantly reduced since we are now examining the page categories instead of the pages themselves; and third the thematic categorization is directly related to the preferences and interests of the users, i.e. the pages they have chosen to visit.

The third stage of preprocessing involves the extraction of access sessions. An access session is a sequence of log entries, i.e., accesses to Web pages by the same IP address, where the time interval between two subsequent entries does not exceed a certain time interval. In our approach, pages are mapped onto thematic categories and therefore an access session is translated into a sequence of categories. Access sessions are the main input to the pattern discovery phase, and are extracted as follows:

1. Grouping the logs by date and IP address.

2. Selecting a time-frame within which two records from the same IP address can be considered to belong in the same access session.

3. Grouping the Web pages (thematic categories) accessed by the same IP address within the selected time-frame to form a session.

Finally, access sessions are translated into binary feature vectors. Each feature in the vector represents the presence of a category in that session.

3.2 Extraction of Web Community Directories

Once the data have been translated into feature vectors, they are used to discover patterns of interest, in the form of community models. This is done by the Directory Miner (DM), an enhanced version of the cluster mining algorithm

Cluster mining discovers patterns of common behavior by looking for all maximal fully-connected subgraphs (cliques) of a graph that represents the usersâ€™ characteristic features, i.e., thematic categories in our case. The method starts by constructing a weighted graph G(A,E,WA,WE). The set of vertices A corresponds to the descriptive features used in the input data. The set of edges E corresponds to feature co-occurrence as observed in the data. For instance, if the user visits pages belonging to categories "1.3.5" and "1.7.8" an edge is created between the relevant vertices. The weights on the vertices WA and the edges WE are computed as the feature occurrence and co-occurrence frequencies respectively.

4 Experimental Results

The methodology introduced in this paper for the construction of Web directories has been tested in the context of a research project, which focuses on the analysis of usage data from the proxy server logs of an Internet Service Provider.

5 Conclusions and Future Work

This paper has introduced the concept of a Web Community Directory, as a Web Directory that specializes to the needs and interests of particular user communities.

Furthermore, it presented a novel methodology for the construction of such directories, with the aid of document clustering and Web usage mining. In this case, user community models take the form of thematic hierarchies and are constructed by a cluster mining algorithm, which has been extended to take advantage of an existing directory, and ascend its hierarchical structure. The initial directory is generated by a document clustering algorithm, based on the content of the pages appearing in an access log.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now