Data Mining Using Reverse Nearest Neighbor Search

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

ABSTRACT

Reverse nearest neighbor (RNN) queries are useful in identifying objects that are of significant influence or importance. Existing methods reply on pre-computation of nearest neighbor distances, do not scale well with high dimensionality, or do not produce exact solutions. In this work we motivate and investigate the problem of reverse nearest neighbor search on high dimensional, multimedia data. We propose several heuristic algorithms to approximate the optimal solution and reduce the computation complexity. We propose exact and approximate algorithms that do not require pre-computation of nearest neighbor distances, and can potentially prune off most of the search space. We demonstrate the utility of reverse nearest neighbor search by showing how it can help improve the classification accuracy.

With the increasing presence and adoption of Web services on the World Wide Web, the demand of efficient Web service quality evaluation approaches is becoming unprecedentedly strong. To avoid the expensive and time-consuming Web service invocations, this paper proposes a collaborative Quality-of-Service (QOS) prediction approach for Web services by taking advantages of the past Web service usage experiences of service users. We first apply the concept of user-collaboration for the Web service QOS information sharing. Then, based on the collected QOS data, a neighborhood-integrated approach is designed for personalized Web service QOS value prediction. The comprehensive experimental studies show that our proposed approach achieves higher prediction accuracy than other approaches. The public release of our Web service QOS dataset provides valuable real-world data for future research. Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. We purpose techniques presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information.

INTRODUCTION

The nearest neighbor (NN) search has long been accepted as one of the classic data mining methods, and its role in classification and similarity search is well documented. Given a query object, nearest neighbor search returns the object in the database that is the most similar to the query object. Similarly, the K nearest neighbor search or the K-NN search returns the k most similar objects to the query. NN and K-NN search problems have applications in many disciplines: information retrieval (find the most similar website to the query website), GIS (find the closest hospital to a certain location), etc. Contrary to nearest neighbor search, less considered is the related but much more computationally complicated problem of reverse nearest neighbor (RNN) search. Our heuristic algorithms include Successive Selection, Greedy, Quasi-Greedy and K-best.

Web services are self-described software applications designed to support interoperable machine to machine interaction over a network via standard interfaces and communication protocols. Strongly promoted by the leading industrial companies, Web services have been widely employed in a lot of domains. Quality of Service (QOS) is usually employed to describe the non-functional characteristics of Web services. With the growing presence and adoption of Web services on the World Wide Web, QOS has become an important selling and differentiating point of the functionally equivalent Web services.

Accurate QOS values of Web services are required for these QOS-based approaches to work well. To address the fundamental problem of how to obtain the Web service QOS values, effective and efficient Web service QOS value obtaining approaches are urgently needed.

The QOS values of Web services can be measured either at the server-side or at the client-side. QOS values measured at the server-side (e.g., price, popularity, etc.) are usually advertised by the service providers and identical for different users, while QOS values measured at the client-side (e.g., response time, throughput, availability, etc.) can vary widely among users influenced by the unpredictable Internet connections and the heterogeneous user environments. To obtain accurate and personalized client-side Web service QOS values for different service users, client-side Web service evaluations are usually needed.

However, conducting real-world Web service evaluation at the client-side is difficult and sometimes even impossible. Because: (a) Web service invocations may be charged since the Web services are usually provided and hosted by other organizations. Even if the Web services are free, executing real-world Web service invocations for evaluation purposes consumes resources of service providers and imposes costs of service users. (b) It is time-consuming and impractical for service users to evaluate all the Web service candidates, since there are a lot of Web services in the Internet. (c) Service users are usually not experts on Web service evaluation and the common time-to-market constraints make in-depth evaluations of the target Web services difficult.

Without sufficient client-side evaluations, accurate Web service QOS values cannot be obtained. It is thus difficult for various QOS-based approaches, which employ these QOS values as input, to work well. To attack this critical challenge, we propose a neighborhood-integrated matrix factorization (NIMF) approach for collaborative and personalized Web service QOS value prediction. The idea is that client-side Web service QOS values of a service user can be predicted by taking advantage of the past Web service usage experiences of other service users.

To encourage QOS value sharing among service users (usually developers of service-oriented systems), a framework is proposed based on a key concept of Web 2.0, i.e., user-collaboration. In this framework, the users are encouraged to contribute their individually observed Web service QOS information to exchange for accurate and personalized Web service QOS prediction. Employing the Web service QOS values from different users, our neighborhood-integrated matrix factorization (NIMF) approach first finds out a set of similar users for the current users by calculating user similarities. Then, the NIMF approach employs both the local information of similar users and the global information of all available QOS values for fitting a factor model, and use this factor model to make personalized Web service QOS prediction.

Given a query object, reverse nearest neighbor search finds all objects in the database whose nearest neighbors are the query object. Note that since the relation of NN is not symmetric, the NN of a query object might differ from its RNN(s). Object B being the nearest neighbor of object A does not automatically make it the reverse nearest neighbor of A, since A is not the nearest neighbor of B. The problem of reverse nearest neighbor search has many practical applications that can be applied to areas such as business impact analysis and customer profile analysis.

An example of business impact analysis using RNN search can be found in the selection of a location for a new supermarket. In the decision process of where to locate a new supermarket, we may wish to evaluate the number of potential customers who would find this store to be the closest supermarket to their homes. Another example can be found in the area of customer profiling. Emerging technologies on the internet have allowed businesses to push information such as news articles or advertisements to their customer’s browser.

An effective marketing strategy will attempt to filter this data, so that only those of items of interest to the customer will be pushed. A customer who is flooded with information that is not of interest to them will soon stop viewing the data or stop using the service. In this example, an RNN search can be used to identify customer profiles that find the information closest to their liking. It is interesting to note that the examples above present a slight variation of the RNN problem known as the Dichromatic RNN. In the Dichromatic problem, we consider two classes in our set of objects S. The set has been subdivided into supermarkets and potential customers for the first example, and information/products and internet viewers for the second example. We are interested in distances between objects from different classes.

On the other hand, the Monochromatic RNN search is one in which all objects in the database are treated the same and we are interested in similarities between them. For example, in the medical domain, doctors might wish to identify all patients who exhibit certain medical conditions such as heartbeat patterns similar to a specific patient or a specific prototypical heartbeat pattern. In all the examples described above, it is important to note the distinction between (k-) NN queries, range queries, and RNN queries.

Nearest Neighbor Search: Concepts and Algorithm

The nearest-neighbor (NN) problem occurs in the literature under many names, including the best match or the post office problem. The problem is of significant importance to several areas of computer science, including pattern recognition, searching in multimedia data, vector compression, computational statistics, and data mining. For many of these applications, including some described in this book, large amounts of data are available. This makes nearest-neighbor approaches particularly appealing, but on the other hand it increases the concern regarding the computational complexity of NN search. Thus it is important to design algorithms for nearest-neighbor search, as well as for the related classification, regression, and retrieval tasks, which remain efficient even as the number of points or the dimensionality of the data grows large. This is a research area on the boundary of a number of disciplines: computational geometry, algorithmic theory, and the application fields such as machine learning. This area is the focus of this book, which contains contributions from researchers in all of those fields. Below we define the exact and approximate nearest-neighbor search problems, and briefly survey a number of popular data structures and algorithms developed for these problems.

1.2. Heuristic Algorithms

The most important among a variety of topics that relate to computation are algorithm validation, complexity estimation and optimization. Wide part of theoretical computer science deals with these tasks. Complexity of tasks in general is examined studying the most relevant computational resources like execution time and space. The ranging of problems that are solvable with a given limited amount of time and space into well-defined classes is a very intricate task, but it can help incredibly to save time and money spent on the algorithms design. A vast collection of papers were dedicated to algorithm development. We do not discuss precise definition of algorithm and complexity. Modern problems tend to be very intricate and relate to analysis of large data sets. Even if an exact algorithm can be developed its time or space complexity may turn out unacceptable.

But in reality it is often sufficient to find an approximate or partial solution. Such admission extends the set of techniques to cope with the problem. We discuss heuristic algorithms which suggest some approximations to the solution of optimization problems. In such problems the objective is to find the optimal of all possible solutions that is one that minimizes or maximizes an objective function. The objective function is a function used to evaluate a quality of the generated solution. Many real-world issues are easily stated as optimization problems.

The collection of all possible solutions for a given problem can be regarded as a search space, and optimization algorithms in their turn, are often referred to as search algorithms. Approximate algorithms entail the interesting issue of quality estimation of the solutions they find.

Taking into account that normally the optimal solution is unknown, this problem can be a real challenge involving strong mathematical analysis.In connection with the quality issue the goal of the heuristic algorithm is to find as good solution as possible for all instances of the problem. There are general heuristic strategies that are successfully applied to manifold problems.

Algorithms and complexity

It is difficult to imagine the variety of existing computational tasks and the number of algorithms developed to solve them. Algorithms that either give nearly the right answer or provide a solution not for all instances of the problem are called heuristic algorithms. This group includes a plentiful spectrum of methods based on traditional techniques as well as specific ones. For the beginning we sum up the main principles of traditional search algorithms. The simplest of search algorithms is exhaustive search that tries all possible solutions from a predetermined set and subsequently picks the best one. Local search is a version of exhaustive search that only focuses on a limited area of the search space. Local search can be organized in different ways.

Popular hill-climbing techniques belong to this class. Such algorithms consistently replace the current solution with the best of its neighbors if it is better than the current. For example, heuristics for the problem of intergroup replication for multimedia distribution service based on Peer-to-Peer network is based on hill-climbing strategy. Divide and conquer algorithms try to split a problem into smaller problems that are easier to solve. Solutions of the small problems must be combinable to a solution for the original one. This technique is effective but its use is limited because there is no a great number of problems that can be easily partitioned and combined in such way.

Branch-and-bound technique is a critical enumeration of the search space. It enumerates, but constantly tries to rule out parts of the search space that cannot contain the best solution. Dynamic programming is an exhaustive search that avoids re-computation by storing the solutions of sub problems.

The key point for using this technique is formulating the solution process as a recursion. A popular method to construct successively space of solutions is greedy technique that is based on the evident principle of taking the (local) best choice at each stage of the algorithm in order to find the global optimum of some objective function. Usually heuristic algorithms are used for problems that cannot be easily solved. Classes of time complexity are defined to distinguish problems according to their"hardness". Class P consists of all those problems that can be solved on a deterministic Turing machine in polynomial time from the size of the input. Turing machines are an abstraction that is used to formalize the notion of algorithm and computational complexity.

Class NP consists of all those problems whose solution can be found in polynomial time on a non-deterministic Turing machine. Since such a machine does not exist, practically it means that an exponential algorithm can be written for an NP-problem, nothing is asserted whether a polynomial algorithm exists or not. A subclass of NP, class NP-complete includes problems such that a polynomial algorithm for one of them could be transformed to polynomial algorithms for solving all other NP problems. Finally, the class NP-hard can be understood as the class of problems that are NP-complete or harder. NP-hard problems have the same trait as NP-complete problems but they do not necessary belong to class NP, that is class NP-hard includes also problems for which no algorithms at all can be provided.

1.3. K-Best Algorithm

The K-best algorithm disables M −K monitors in a single step, based on their performance in the All-On stage. It starts from the All-On configuration and calculates the maximum achievable β and optimal traffic assignment γyij . It then ranks all monitors in ascending order using one of the following metrics and directly disables the top M − K monitors:

Least-utility (y pijγyijIy). We disable the monitors with the least measurement utilities. Since measurement utility is the same as our optimization objective, we expect this metric will achieve the best β.

Least-traffic ( y γy ijφy). The intuition behind this metric is that the monitors with the least amount of traffic passing through them are also expected to have the least contribution to the overall measurement utility.

Least-importance (y γy ijIy). This metric only considers the flowset importance, regardless of the sampling rate.

It treats all flowset with the same traffic demand and all monitors with the same sampling rate.

Least-rate (pij ). We disable monitors with the least sampling rates since they are the least capable.

Least-neighbor ( k:(k,i)∈E 1 + k:(j,k)∈E 1). From a topology perspective, the monitors that are the least connected are also likely to provide the least amount of freedom to MMPR for routing optimization. The K-Best algorithm greatly saves computation time since only two LP problems are involved. The first LP decides the γyij for the All-On stage.

Ranked in ascending order using one of the above metrics, the top M − K monitors are disabled. Then, with these M − K monitors turned off, a second LP is solved to maximize β using MeasuRouting. However, since K-Best ranks the importance of each monitor based on metrics evaluated from the initial All-On stage, the measurement gain is predicted to diverge from the optimal.

According to using a fixed number of proofs to approximate the probability is Fundamental when many queries have to be evaluated because it allows controlling the overall complexity. The k-best algorithm uses branch and bound to find the k most probable explanations, where k is a user-defined parameter. The algorithm records the k best explanations. Given a partial explanation, its probability (obtained by multiplying the probability of each atomic choice it contains) is an upper bound on the probability that a complete explanation extending it can achieve.

Therefore, a partial explanation can be pruned if its probability falls below the probability of the k-the best explanation. Our implementation

of the k-best algorithm interleaves tree expansion and pruning: a set of partial

Explanations are kept and are iteratively expanded for some steps. Those whose

Upper bound is worse than the k-the best explanation is pruned. Once the proof

Tree has been completely expanded; the k best explanations are translated into

a BDD to compute a lower bound of the probability of the query. This solution

uses a meta-interpreter while ProbLog uses a form of iterative deepening that

builds derivations up to a certain probability threshold and then increases the

Algorithm 1 Function solve

1: function solve(Goal, Explan)

2: if Goal is empty then

3: return 1

4: else

5: Let Goal = [G|T ail]

6: if G =(\+ Atom) then

7: Valid :=solve([Atom], Explan)

8: if Valid = 0 then

9: return solve(T ail, Explain)

10: else

11: return 0

12: end if

13: else

14: Let L be the list of couples (GL, Step) where GL is obtained by resolving

15: Goal on G with a program clause C on head i with substitution θ

16: and Step = (C, θ, i)

17: return sample cycle(L, Explan)

18: end if

19: end if

20: end function

Threshold if k explanations have not been found. The meta-interpreter approach

has the advantages of avoiding to repeat resolution steps at the expense of a

more complex book keeping.

1.4 Successive Selection Algorithm

Algorithm 2 Successive Selection Algorithm

1: while More than K monitors are left do

2: Maximize β by using all remaining monitors

3: find the corresponding γyij

4: for Each remaining monitor (i, j) ∈ ˆM do

5: Calculate its performance metric for one of the five

principles with γyij

6: end for

7: Disable D monitors with least performance-metric

8: end while

The Successive Selection algorithm also starts from the initial All-On configuration with all M monitors and iteratively chooses D monitors to disable. Here, we use the same five metrics introduced in Section IV-B. The selection of which D monitors to disable is based on the ranking of remaining monitors ˆM using one of the five metrics. In particular, it disables D monitors based on their ranking calculated from the previous iteration (Line 7). This means we use the information from the previous iteration (i.e., planned routes γyij , etc.) to calculate the metric for each monitor in the current iteration (Line 5). Note that if the metric used is either the "least-rate" or the "least-neighbor" metric, both Successive Selection and K-Best will have the same selection of monitors and measurement gain since the metrics do not involve γyij .

Greedy Algorithm

Similar to Successive Selection, the Greedy algorithm also disables D monitors in each iteration, until K monitors are left. However, it is more complicated since it tests all remaining monitors ˆM in each iteration. In order to test a monitor, it re-computes the maximized β after turning it off (Line 2-7), which essentially involves using MeasuRouting [12] (Line 4). Based on the testing of every remaining monitor, it disables D of them that have least impact on β (Line 8).

Algorithm 3 Greedy Algorithm

1: while More than K monitors are left do

2: for Each remaining monitor (i, j) ∈ ˆM do

3: Disable the monitor

4: Maximize β based on remaining monitors

5: Store β

6: Enable the monitor

7: end for

8: Find D monitors with largest β ∈ ˆM when they are disabled

9: ˆM ← Mˆ/{(i, j) ∈ D} 10: end while

Since the Greedy algorithm exhaustively tests individual monitors at each iteration, its performance is hypothesized to be close to the optimal solution. It is still suboptimal since it tests individual monitors instead of every possible combination. However, the algorithm remains computationally costly, since it tests O( ˆM ) monitors with O( ˆM ) LP problems in each iteration. To reduce the computation time, we propose a less heavy-weighted algorithm called "Quasi-Greedy", which is a derivation of the Greedy algorithm. In Quasi-Greedy, instead of testing every remaining monitor, it only tests λ fraction candidates, where 0 < λ < 1. We use C to denote candidate sets.

Algorithm 4 Quasi-Greedy Algorithm (λ)

1: while More than K monitors are left do

2: Maximize β by using all remaining monitors

3: Calculate measurement utility of each monitor (i, j) ∈ ˆM

4: Choose C=λ fraction remaining monitor (i, j) ∈ ˆM as candidates

5: for Each candidate monitor ∈ C do

6: Disable the monitor

7: Maximize β based on remaining monitors

8: Store β

9: Enable the monitor

10: end for

11: Find D monitors ∈ C with largest β when they are disabled

12: ˆM ← Mˆ/{(i, j) ∈ D} 13: end while

The candidates C are chosen based on the least-utility metric (Line 4), where utility is defined as y pijγyijIy. It benchmarks how much utility a monitor measures (Line 3).In each iteration, the Quasi-Greedy algorithm re-computes all the corresponding β by turning off one-by-one the remaining monitors in C to find the least important D monitor to disable (Line 5-11). It then disables these chosen D monitors from the remaining monitor set, ˆM (Line 12).

1.5. Sensitivity Analysis of Heuristic Algorithms:

Due to the potentially long computation times required to solve for the optimal, we propose several heuristic algorithms to reduce the computation time complexity. They are categorized as "K-Best", "Successive Selection", "Greedy" and "Quasi-Greedy". We omit performance results for Greedy since it is computationally too costly. We first compare K-Best algorithms [1] (using different metrics) with the optimal solution in Figure 1. As expected, using the least-utility metric achieves the best β (very close to optimal) in all three topologies. It achieves 2.36X (2.6 1.1) higher measurement gain compared to using the least-importance metric, but only increases 1.1X( 1.22 1.1 ) in computation time in AS6461.

As mentioned earlier, the computation time is only collected for the LP solver. Results show that using different ranking metrics lead to very similar computation times. From the perspective of an LP solver, an unsuitable monitor placement means either more steps are needed to achieve the optimal β (which is more time consuming), or there is no way to achieve very large β (which means shorter solving time).

If we also consider other numerical computations (i.e: computation of each metric, ranking monitors based on metric values), "least-utility", "least-traffic", and "least-importance" definitely take longer, since the calculation of these three metrics involve all flows and monitors. In contrast, "least-rate" and "least-neighbor" only need topology information. Figure 2 compares the Successive Selection algorithms with different ranking metrics, and the same trend is observed in Abilene, AS6461, and GEANT networks. We omit "least-rate" and "least-neighbor" cases since they have similar measurement gain as in the corresponding K-Best case. Successive Selection with the least-utility metric also achieves the best performance. Similar to Fig.1, the three metrics share very close computation time (figures are omitted). It mostly depends on the number of iterations, which is linear with respect to the number of monitors in the Successive Selection algorithm.

Fig 1. Performance for K-best algorithms for different network (Abilene, AS6461 and GEANT).

Fig 2. Performance for successive selection algorithms for different network (Abilene, AS6461 and GEANT).

Fig 3.Performance for quasi-greedy algorithm for AS6461

Finally, we compare the Quasi-Greedy algorithm (with different λ values) against the optimal solution in Figure 3. Since Quasi-Greedy is still computationally intensive, we only present results for AS6461[4]. Note that there are no obvious improvements on measurement gain for larger λ’s. However, the computation time increases substantially with larger λ’s. This implies that even with a smaller number of candidates, the Quasi-Greedy algorithm can perform very close to the optimal and saves computation time.

Fig4. Compare heuristic algorithms for different network (Abilene, AS6461 and GEANT).

1.5.1 Comparing K-Best, SS, and QG

In this section, we compare all three heuristic algorithms with the optimal solution. Results from the previous section show that "least-utility" is the most effective metric for ranking the importance of monitors. We therefore adopt "least-utility" metric as a basis for comparing the K-Best, Successive Selection, and Quasi-Greedy methods.

1.5.2 Neighborhood-Integrated Matrix Factorization (NIMF)

A popular approach to predict missing values is to fit a factor model to the user-item matrix, and use this factor model to make further predictions. The premise behind a low-dimensional factor model is that there is a small number of factors influencing the QOS usage experiences, and that a user’s QOS usage experience on a Web service is determined by how each factor applies to the user and the Web service.

kk.JPG

Fig 5.NIMF QOS Prediction Framework

Consider an m × n user-item matrix R, the matrix factorization method employs a rank-l matrix X = UT V to fit it, where U ∈ Rl×m and V ∈ Rl×n. From the above definition, we can see that the low-dimensional matrices U and V are unknown, and need to be estimated. Moreover, these feature representations have clear physical meanings. In this linear factor model, a user’s Web service QoS values correspond to a linear combination of the factor vectors, with user-specific coefficients. More specifically, each column of U performs as a "factor vector" for a user, and each column of V is a linear predictor for a Web service, predicting the entries in the corresponding column of the user-item matrix R based on the "factors" in U.

The number of factors (in other word, the length of the "factor vector") is called dimensionality. By adding the constraints of the norms of U and V to penalize large values of U and V, we have the following optimization problem:

||U||

Where IR Ij is the indicator function that is equal to 1 if user ui invoked Web service Vj and equal to 0 otherwise, 2F denotes the Fresenius norm, and λU and Λv are two parameters. The optimization problem in minimizes the sum-of-squared-errors objective function with quadratic regularization terms. It also has a probabilistic interpretation with Gaussian observation noise. The above approach utilizes the global information of all the available QOS values in the user-item matrix for predicting missing values. This approach is generally effective at estimating overall structure (global information) that relates simultaneously to all users or items.

However, this model are poor at detecting strong associations among a small set of closely related users or items (local information), precisely where the neighborhood models would perform better. Normally, the available Web service QOS values in the user-item matrix are very sparse; hence, neither of the matrix factorization or neighborhood-based approaches can generate optimal QOS values.

In order to preserve both global information and local information mentioned above, we employ a balance parameter to fuse these two types of information. The idea is that every time when factorizing a QOS value, we treat it as the ensemble of a user’s information and the user’s neighbors’ information. The neighbors of the current user can be obtained by employing. Hence, we can minimize the following sum-of-squared errors objective functions with quadratic regularization terms.

Fig 6. Examples

Research Problem Description

The process of Web service QOS value prediction usually includes a user-item matrix as shown in Figure 5(a), where each entry in this matrix represents the value of a certain QOS property of a Web service observed by a service user. As shown in Figure 5(a), each service user has several response-time values of their invoked Web services. Similarities between two different users in the matrix can be calculated by analyzing their QOS values on the same Web services. Pearson Correlation Coefficient (PCC) is usually employed for the similarity computation. As shown in the similarity graph in Figure 5(b), totally 5 users (nodes u1 to u5) are connected with 10 edges. Each edge is associated with a PCC value in the range of to specify the similarity between user ui and user uj , where larger PCC value stands for higher similarity.

The symbol N/A means that the similarity between user ui and user uj is non-available, since they do not have any commonly invoked Web services. The problem we study in this paper is how to accurately predict the missing QOS values in the user item matrix by employing the available QOS values.

By predicting the Web service QOS values in the user item matrix, we can provide personalized QOS value prediction on the unused Web services for the service users, who can employ these Web service QOS values for making service selection, service ranking, automatic service composition, etc.

To obtain the missing values in the user-item matrix, we can employ the Web service QOS values observed by other service users for predicting the Web service performance for the current user. However, since service users are in different geographic locations and are under different network conditions, the current user may not be able to experience similar QoS performance as other service users.

it as the ensemble of a user’s information and the user’s neighbors’ information. The neighbors of the current user can be obtained by employing. Hence, we can minimize the following sum-of-squared errors objective functions with quadratic regularization terms.

To address this challenging Web service QOS value prediction problem, we propose a neighborhood integrated matrix factorization (NIMF) approach, which makes the best utilization of both the local information of similar users and the global information of all the available QOS values in the user-item matrix to achieve better prediction accuracy.

2.1. Complexity Analysis

The main computation of the gradient methods is to evaluate the object function L and its gradients against the variables. Because of the sparsity of matrices R and S, the computational complexity of evaluating the object function L is O(ρRl + ρRKl), where ρR is the number of nonzero entries in the matrix R, and K is the number of similar neighbors. K is normally a small number since a large number of K will introduce noise, which will potentially hurt the prediction accuracy. The computational complexities for the gradients ∂L ∂U and ∂L ∂V in Eq. (7) are O(ρRK l+ρRK2l) and O(ρRl +ρRKl), respectively. Therefore, the total computational complexity in one iteration is O(ρRKl+ρRK2l), which indicates that theoretically, the computational time of our method is linear with respect to the number of observations in the user-item matrix R. This complexity analysis shows that our proposed approach is very efficient and can scale to very large datasets.

Value Distributions

2.2. Reverse nearest Neighbor Search

Our general approach for answering RNN queries. The algorithm proceeds in two steps: In the First step, a k-NN search is performed returning the k closest data objects to the query and in the next step these k objects are efficiently analyzed to answer the query. The value of k that should be used in the first step depends on the data set. Based on the values of average recall we can choose the value of k we want to start with. The following subsections explain the second step in detail with necessary optimizations that can be used.

The most popular variant of nearest neighbor query is reverse nearest neighbor queries that focus on the inverse relation among points. A reverse nearest neighbor (RNN) query q is to find all the objects for which q is their nearest neighbor. A reverse nearest neighbor query is formally defined below.

Definition: Given a set of objects P and a query object q, a reverse nearest neighbor query is to find a set of objects RNN so that for any object and

$ RNN$Set of a query q may be empty or may have one or more elements. Korn et al. [KM00] defined the RNN queries and provided a large number of applications. For example, a two-dimensional RNN query may ask the set of customers affected by the opening of a new store outlet location in order to inform the relevant customers. This query can also be used to identify the location which maximizes the number of potential customers. Consider another example, an RNN query may be issued to find the stores outlets that are affected by opening a new store outlet at some specific location. Note that in first example, there are two different sets (stores and customers) involved in RNN query whereas in second example there is only one set (stores).

Korn et al. defined two variants of RNN queries. A bichromatic query (the first example) is to find the reverse nearest neighbors where the underlying data set consists of two different types of objects. A monochramtic RNN query (the second exampe) is to find the reverse nearest neighbors where the data set contains only one type of objects. The problem of reverse nearest neighbors is extensively studied in past few years. Below we briefly describe the most popular and general algorithms only.

The most popular variant of nearest neighbor query is reverse nearest neighbor queries that focus on the inverse relation among points. A reverse nearest neighbor (RNN) query q is to find all the objects for which q is their nearest neighbor. A reverse nearest neighbor query is formally defined below.

Korn and Muthukrishnan [KM00] answer RNN query by pre-calculating a circle of each object p such that the nearest neighbor of p lies on the perimeter of the circle. The MBR of all these circles are indexed by an R-tree called RNN-tree. The problem of RNN query is reduced to a point location query on RNN-tree that returns all the circle containing q. For examples, circle of and contain q so both are the reverse nearest neighbors of q. Yang et al. [YL01] improved their method by RdNN-tree. Similar to RNN-tree, a leaf node of the RdNN tree contains the circles of data points. The intermediate nodes contain the minimum bounding rectangles (MBR) of underlying points along with the maximum distance from every point in the sub-tree to its nearest neighbor.

2.3.1 Filtering step I

After the first step of the RNN algorithm we have k-NN of a query point. However, some of them can be ruled out from further elaboration by local distance calculations performed within the k objects, i.e., the candidate set. The principle of the elimination is based on the fact that a point that does not have the query point as the nearest neighbor among the candidate set can never be the RNN of the query point in the whole data set.

Therefore a point that has another candidate as its closest in those k-NN points is eliminated from the search (note that only k objects are involved in this test). We refer to this step as filtering step I. Our experiments show that this step can efficiently filter out a significant number of candidates. For example, for 60-NN queries on our stock market time-series data, on the average 50% of the 60 candidates are eliminated using only this filtering. In this step, we can avoid some of the distance calculations (out of a possible by using an optimization based on comparison of the already computed distances.

This elimination makes use of the fact that a multi-dimensional distance calculation is typically more expensive than a single floating point distance value comparison. This is formalized by the following lemma.

Lemma 1. Let x and y be two nearest neighbors of the query point q and d(x; y) denotes the distance between data points x and y.

If d(x; y) _ d(x; q) and d(x; q) d(y; q),

Then it follows that both x and y cannot be RNN.

Therefore a point that has another candidate as its closest in those k-NN points is eliminated from the search (note that only k objects are involved in this test). We refer to this step as filtering step I . Our experiments show that this step can efficiently filter out a significant number of candidates For example, a two-dimensional RNN query may ask the set of customers affected by the opening of a new store outlet location in order to inform the relevant customers.

Query object q

Query Parameter k

K−NN query on the index structure

to find candidates for RNN (q )

Efficiently eliminate some candidates by using local distance calculations

Run Boolean Range Query on the remaining candidates to determine

actual RNN

Final RNN

Proposed Algorithm for RNN

After filtering Step I, we have a set of candidate data points, each of which has a query point as its local NN (considering only the data points in the candidate set). For a point to be a RNN, we need to verify that the query point is the actual NN considering the whole data set. Two different approaches can be followed here. One crude approach to solve the above problem is to and the global NN of each point and check whether it is the query point. If this is the case, then that point is a RNN otherwise it is not an RNN. Another approach, which we refer to as filtering Step II, is running a new kind of query called the Boolean range query, which is covered in the following subsection.

2.3.2. Boolean Range Queries (Filtering step II)

A Boolean range query BRQ (q; r) returns true if the set ft 2 Sjd (q; t) < rg is not empty and returns false otherwise. Boolean range query is a special case of range query which will return either true or false depending on whether there is any point inside the given range or not. Boolean Range Queries can be naturally handled more efficiently than range queries and this advantage is exploited in the proposed schemes. For each candidate point, we define a circular range with the candidate point as the center and the distance to the query point as the radius. If this Boolean range query returns false then the point is RNN, otherwise point is not a RNN. The range queries typically are more efficient than NN queries [14], and Boolean range queries are much more efficient than range queries. Our explanation here will be based on an R-tree based index structure, but the general algorithm is not restricted to R-trees and it can be easily applied to other indexing structures.

The pseudo code for Boolean Range Query is given in Figure.

Procedure booleanrangequery (noden; pointq; radius)

Begin

Input:

Node n to start the search

Point q is query point

Radius is equal to distance between

Query point and candidate set point

Output:

True =) at least one point inside search region

False =) no point inside the search region

If n is a leaf node do:

Check if there exists a point p such that

Dist (p; q) <= q then return true;

If n is an internal node, then for each branch do:

If any node (mbr) intersects such that at least

One edge of mbr is inside search region, i.e.,

Minmaxdist <=radius then return true;

Else

Sort the intersecting mbrs wrt. Some criterion

Such as mindist and traverse the mbrs wrt. It.

If any of the mbrs has a point then return true;

Else return false;

End

Boolean Range Query

A

C

B

E

D

Cases for Boolean range query intersecting with a region

2.3.3. Boolean Range Query versus Range Query

In a range query, typically multiple MBRs intersect the query region. Especially when the number of dimensions is high, the number of MBRs intersecting the query region is also high. For range queries we have to access all the MBRs that intersect with the search region. The main strength of Boolean range query over traditional range query is that even multiple MBRs intersect a search region we do not need to access all of the MBRs. above Figure shows five possible cases where a circular search region can overlap partially or fully for range queries. Search region is denoted by circle and MBRs are denoted by A, B, C, D and E.

1. MBR A: fully contained in search region. This means that search region has at least one point.

2. MBR B: only one vertex of B is in search region.

3. MBR C: two of C's vertices are in search region. This means that an entire edge of MBR is in search region, which guarantees that at least one data point is in search region.

4. MBR D: no vertices of D are in search region.

5. MBR E: Search region is fully contained in the MBR E.

For a Boolean range query, if we can guarantee that at least one edge of MBR is inside the search region then we don't need any page access (case A and C in above Figure). This can be easily checked by calculating the minmaxdist, which is defined as minimum of the maximum possible distances between a point (here candidate point or center of search region) to the face of intersecting MBR. If minmaxdist is less than or equal to the radius of the search region than there is at least a point in the intersecting MBR so we do not need any page access.

For example, we found that for isolet data set, 35% of intersecting MBRs have this property. For other cases B, D, and E we cannot say whether there is a point contained both in search region and MBR. So, traversal of the corresponding branch may be necessary to answer the query if no relevant point is found previously. When a range query intersects a number of MBRs, it has to retrieve all of them and make sure that there is a data point in them. The answer of the range query is all the points that lie in the range.

Even if the query asks only the average of data objects in a given range, i.e. aggregation queries, it needs to read all the data points within the MBRs and check whether they lie within the given range. However a Boolean range query can be answered without any page accesses as described above or with minimal number of page accesses.

If a single relevant point is found in one of the MBRs, then we do not have to look for other points and we can safely say that the corresponding point is not an RNN. So, for the case of multiple MBRs intersecting the query region, a decision has to be made to maximize the chance of finding a point in the MBR so that the query can be answered with minimal number of MBRs accesses. In choosing the MBRs, there could be a number of possible choices:

1. Sort MBRs wrt. Overlap with search region and then choose MBR with maximum overlap.

2. Sort MBRs wrt. Mindist and then choose MBR with minimum mindist.

3. Choose randomly. We tried experimentally all the above three possibilities and found that choice 1 and 2 are comparable and both are better than choice 3. Most of the time the set of MBRs that are retrieved to the memory in the first phase of the algorithm is enough to guarantee that a point in a candidate set is not an RNN. In this case, no additional I/O is needed in the second phase. Therefore, almost with no additional overhead on the k-NN algorithm we can simply answer RNN queries.

2.3.4. Multiple Boolean Range Queries

In the filtering step II we need to run multiple Boolean range queries since there are multiple points that remain after the filtering step I. Since these are candidates that are very close to each other, multiple query optimizations becomes very natural and effective in this framework. Multiple queries are defined as the set of queries issued simultaneously as opposed to single queries which are issued independently. There has been a significant amount of research that investigated this approach for other kind of queries. In our algorithm, a Boolean range query needs to be executed for each point in the candidate set. Since the radius of the queries are expected to be very close to each other (because they are all in the k-NN of the query), executing multiple queries simultaneously reduces the I/O cost significantly. First we read a single page for the whole set of queries. The criteria for deciding which page to access first is to retrieve the page that has the most number of intersections with the queries. The performance gain of this approach is discussed next.

K-Nearest Neighbor Search

Introduction

KNN is a non parametric lazy learning algorithm. That is a pretty concise statement. When you say a technique is non parametric, it means that it does not make any assumptions on the underlying data distribution. This is pretty useful as in the real world, most of the practical data does not obey the typical theoretical assumptions made (eg Gaussian mixtures, linearly separable etc). Non parametric algorithms like KNN come to the rescue here.

It is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase or it is very minimal. This means the training phase is pretty fast. Lack of generalization means that KNN keeps all the training data. More exactly, all the training data is needed during the testing phase. (Well this is an exaggeration, but not far from truth). This is in contrast to other techniques like SVM where you can discard all non support vectors without any problem.  Most of the lazy algorithms – especially KNN – make decision based on the entire training data set (in the best case a subset of them).

The dichotomy is pretty obvious here – There is a nonexistent or minimal training phase but a costly testing phase. The cost is in terms of both time and memory. More time might be needed as in the worst case, all data points might take point in decision. More memory is needed as we need to store all training data. Recall that in classification problems, the task is to decide to which of a predefined, finite set of categories an object belongs.

we look at another classification method. The method we look at is known as the k-nearest-neighbours or kNN method.

we need a dataset of examples.Each example describes an instance and gives the class to which it belongs. As before, we’ll assume instances are described by a set of attribute-value pairs, and there is a finite set of class labels L. So the dataset comprises examples of the form h{A1 = a1,A2 = a2, . . . ,An = an}, class = cli. For a particular instance x, we will refer to x’s value for attribute Ai as x.Ai.

In the probabilistic methods that we looked at, the learning step involved computing probabilities from the dataset. Once this was done, in principle the dataset could be thrown away classification was done using just the probabilities. In k-nearest-neighbours, the learning step is trivial: we simply store the dataset in the system’s memory. Yes, that’s it! In the classification step, we are given an instance q (the query), whose attributes we will refer to as q.Ai and we wish to know its class. In kNN, the class of q is found as follows:

1. Find the k instances in the dataset that are closest to q.

2. These k instances then vote to determine the class of q.

Ties (whether they arise when finding the closest instances to q or when voting for the class of q) are broken arbitrarily.

In the following visualisation of this method, we assume there are only two attributes A1 and A2, and two different classes (where circles with solid fill represent instances in the dataset of one class and circles with hashed fill represent instances in the dataset of the other class). Query q is here being classified by its 3 nearest neighbours:

All that remains to do is discuss how distance is measured, and how the voting works.

3.1. Local distance functions, global distance functions and weights

A global distance function, dist, can be defined by combining in some way a number of local distance functions, distA, one per attribute.

The easiest way of combining them is to sum them:

dist(x, q) =def _n i=1 distAi (x.Ai, q.Ai)

More generally, the global distance can be defined as a weighted sum of the local distances. The weights wi allow different attributes to have different importances in the computation of the overall distance. Weights sometimes lie between zero and one; a weight of zero would indicate a totally irrelevant attribute.

dist(x, q) =def _ni=1wi × distAi(x.Ai, q.Ai)

A weighted average is also common:

dist(x, q) =def_ni=1wi × distAi(x.Ai, q.Ai)_ni=1wi

The weights can be supplied by the system designer. There are also ways of learning weights from a dataset. What we haven’t discussed is how to define the local distance functions. We will exemplify the different definitions by computing the distance between query q and x1 and x2 below:

The attributes and their values are: sex (male/female), weight (between 50 and 150 inclusive), amount of alcohol consumed in units (1-16 inc.), last meal consumed today (none, snack or full meal), and duration of drinking session (20-320 minutes inc.). The classes are: over or under the drink driving limit.

3.1.2. Hamming distance

The easiest local distance function, known as the overlap function, returns 0 if the two values are equal and 1 otherwise:

If the global distance function is defined as the sum of the local distances, then we’re simply counting the number of attributes on which the two instances disagree. This is called Hamming distance. Weighted sums and weighted averages are also possible.

Class Exercise. What is the Hamming distance between x1 and q and between x2 and q?

3.1.3 Manhattan distance for numeric attributes

If an attribute is numeric, then the local distance function can be defined as the absolute difference of the values:

distA(x.A, q.A) =def |x.A − q.A|

If the global distance is computed as the sum of these local distances, then we refer to it as the Manhattan distance. Weighted sums and weighted averages are also possible.

One weakness of this kind of scheme is that if one of the attributes has a relatively large range of possible values, then it can overpower the other attributes. In the example, this might be the case with duration relative to weight and amount. Therefore, local distances are often normalised so that they lie in the range 0 . . . 1. There are many ways to normalise, some being better-motivated than others. For simplicity, we will look at only one. We will divide by the range of permissible values:

where Amax is the largest possible value for attribute A, and Amin is its smallest possible value. We’ll call this the range-normalised absolute difference. The other weakness of this scheme, of course, is that it can be used only on numeric attributes.

3.1.4 Heterogeneous local distance functions

We can combine absolute distances and the overlaps in order to handle both numeric and symbolic attributes:

As usual, the global distance can be computed as a sum, weighted sum or weighted average of the local distances. Class Exercise. Taking a sum of local distances defined heterogeneously, what is the distance between x1 and q?

3.1.5 Knowledge-intensive distance functions

Human experts can sometimes define domain-specific local distance functions, especially for symbolic-valued attributes. In this way, they can bring their prior knowledge to bear.

A simple but common example is when there is already some total ordering defined over the values of the symbolic attribute. For example, the last meal a person ate has values none, snack and full. These can be thought of as totally ordered by the amount of food consumed:

none < snack < full

We can assign integers to the values in a way that respects the ordering: none = 0, snack = 1 and full = 2. Now, we can use a range-normalised absolute difference function on these integers. So, for example,

Class Exercise. Let the weights for sex, weight, amount, meal and duration be 4, 3, 5, 1 and 2 respectively. Taking a weighted sum of local distances defined heterogeneously, where distmeal is defined as above, what is the distance between x2 and q?

To explore this idea a bit further, we’ll invent another attribute, one that wasn’t present in the original dataset, type, with values {lager, stout, whiteWine, redWine, whisky, vodka}, to denote the main kind of beverage the person has been drinking. We’ll look at a variety of ways of defining some knowledge-intensive local distance functions for this attribute. One possibility, of course, is to define a total ordering, based perhaps on alcohol content, e.g.:

LAGER < STOUT < WHITEWINE < REDWINE < WHISKY < VODKA

And then to assign consecutive integers and compute a range-normalized absolute difference. Hence, e.g.,

Another possibility is to define taxonomy (class hierarchy) of beverage types, such as the following:

Based on distances in this tree, we can again compute range-normalized distances. For example,

The final approach that we will consider is simply to enumerate all the distances in a matrix. This allows the designers to use whatever distances they feel make sense in their domain. Note that, since the diagonal of the matrix will represent the distance between a value and itself, it should be filled in with zeros. And, assuming that distance is symmetric, the lower triangle will be a reflection along this diagonal of the upper triangle. We have looked at only three ways of defining a distance function for type. For other attributes there could be numerous other domain-specific approaches.

Voting

Once we have obtained q’s k-nearest-neighbours using the distance function, it is time for the neighbours to vote in order to predict q’s class. Two approaches are common.

Majority voting: In this approach, all votes are equal.For each class cl ∈ L, we count how many of the k neighbours have that class. We return the class with the most votes.

Inverse distance-weighted voting: In this approach, closer neighbours get higher votes.While there are better-motivated methods, the simplest version is to take a neighbour’s vote to be the inverse of

its distance to q:

Then we sum the votes and return the class with the highest vote.

Class Exercise. Suppose k = 3 and q’s 3-nearest-neighbours from the dataset are instances x7, x35 and x38. (For

conciseness, I won’t show their attribute-values— they aren’t needed at this step anyway.) Here are their classes and

the distances we computed\

What is q’s predicted class using (a) majority voting, and (b) inverse distance-weighted voting?

There are numerous other voting schemes. Human experts might even define a domain-specific scheme. For example,

in spam-filtering, the cost of misclassifying ham as spam (when legitimate email ends up in your spam folder) is

higher than the cost of misclassifying spam as ham (when spam ends up in your in-box). A domain-specific voting scheme might be defined to skew the classifier away from the former kind of error. For example, we could predict class = spam if all k neighbours vote spam; otherwise, we would predict class = ham.

Using k-Nearest-Neighbors for Other Tasks

We’ve been looking at classification. But k-nearest-neighbours is useful for other tasks too.

4.1. Regression

In classification, there is a finite set of class labels L from which we must choose. In regression, the task is, given the description of an object, to predict a real value. Examples abound: from meteorological data, predict tomorrow’s rainfall; from information about a software project, predict how long it will take to complete; from information about stock market movements, predict the value that my shares will have by close of business tomorrow.

In regression, the dataset comprises examples of the form h{A1 = a1,A2 = a2, . . . ,An = an}, value = ri, where r is a real number. The only part of k-nearest neighbours that needs changing is the voting. In other words, again we find the k instances in the dataset that are closest to q. Now we must use their values to predict q’s value. While there may be better-motivated approaches, the simplest approach is to take the mean of the neighbours’ values.

Class Exercise. Suppose our drinkers dataset no longer contains a class for each instance (over or under). Instead, it specifies the blood alcohol content (BAC) of the person, as measured by a breathalyser. This will be a real in the range 0 . . . 100 (a percentage).

Suppose k = 3 and q’s 3-nearest-neighbours from the dataset are instances x7, x35 and x38. Here are their BACs and the distances we computed:

What is q’s predicted BAC?

Don’t go away with the idea that regression is something that kNN methods can do that na¨ıve Bayesian methods cannot: it is just as possible to apply na¨ıve Bayesian methods to regression problems. In fact, neither is likely to be the best method to use for regression!

Nearest Neighbor Classification

What is called supervised learning is the most fundamental task in machine learning. In supervised learning, we have training examples and test examples. A training example is an ordered pair hx, yi where x is an instance and y is a label. A test example is an instance x with unknown label. The goal is to predict labels for test examples. The name "supervised" comes from the fact that some supervisor or teacher has provided explicit labels for training examples.

Assume that instances x are members of the set X, while labels y are members of the set Y . Then a classifier is any function f : X ! Y . A supervised learning algorithm is not a classifier. Instead, it is an algorithm whose output is a classifier. Mathematically, a supervised learning algorithm is a higher-order function: it is a function of type (X ×Y )n ! (X ! Y ) where n is the cardinality of the training set.

The nearest-neighbor method is perhaps the simplest of all algorithms for predicting the class of a test example. The training phase is trivial: simply store every training example, with its label. To make a prediction for a test example, first compute its distance to every training example. Then, keep the k closest training examples, where k _1 is a fixed integer. Look for the label that is most common among these examples.

This label is the prediction for this test example. Using the same set notation as above, the nearest-neighbor method is a function of type (X × Y )n × X ! Y . A distance function has type X × X ! R. This basic method is called the kNN algorithm. There are two major design choices to make: the value of k, and the distance function to use. When there are two alternative classes, in order to avoid ties the most common choice for k is a small odd integer, for example k = 3. If there are more than two classes, then ties are possible even when k is odd. Ties can also arise when two distance values are the same. An implementation of kNN needs a sensible algorithm to break ties there is no consensus on the best way to do this.

When each example is a fixed-length vector of real numbers, the most common

distance function is Euclidean distance:

d(x,y)=||x-y||x-y||== (

5.1 Categorization of supervised learning tasks

A major advantage of the kNN method is that it can be used to predict labels of any type. Suppose that training and test examples belong to some set X, while labels belong to some set Y . As mentioned above, formally a classifier is a function X ! Y . Supervised learning problems can be categorized as follows.

• The simplest case is |Y | = 2; the task is a binary classifier learning problem.

• If Y is finite and discrete, with |Y | _ 3, the task is called a multiclass learning problem.

• If Y = R then the task is called regression.

• If Y = RQ with q _ 1 then the task is called multidimensional regression.

Note that the word "multivariate" typically refers to the fact that the dimensionality of input examples is m > 1.

• If Y is a power set, that is Y = 2Z for some finite discrete set Z, then the task is called multilabel learning. For example Y might be the set of all newspaper articles and Z might be the set of labels {SPORTS, POLITICS, BUSINESS, . }. Technically, the label of an example is a set, and one should use a different name such as "tag" for each component of a label.

• If X = A_ and Y = B_ where A and B are sets of symbols, then we have a sequence labeling problem. For example, A_ might be the set of all English sentences, and B_ might be the set of part-of-speech sequences such as NOUN VERB ADJECTIVE.

Tasks such as the last two above are called structured prediction problems. The word "structured" refers to the fact that there are internal patterns in each output label. For example, suppose that x is a sentence of length 2. The part-of-speech labels y = NOUN VERB and y = VERB NOUN both have high probability, while the labels y = NOUN NOUN and y = VERB VERB do not.

One major theme of recent research in machine lear



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now