A Methodology For Enhancing Template Extraction Accuracy

Published Date: 02 Nov 2017

ABSTRACT-:

Today websites contain large number of pages generated using the common templates with contents. Due to irrelevant terms in templates they degrades the accuracy of web application. Thus, template detection techniques have received a lot of attention recently to enhance the accuracy. To extract the template from these heterogeneous templates we use different algorithms to find the similarity of underlying structure in the documents, so that the template is extracted with various clusters. We implement various algorithms to find similarity between the web pages. Earlier the algorithms used are Text Hash and Text Max with jaccard coefficient. But the time and space occupied by this algorithm is more. In this paper, we implement Text Hash and Text Max with Jaccard as well as Dice coefficient. The space and time occupied by Dice coefficient is less as compared to Jaccard coefficient.

KEYWORDS-:

Template Extraction , Clustering , MDL,Text-MAX , Text-Hash,Jaccard Coefficient, Dice Coefficient.

INTRODUCTION-:

Internet is the major source of information in recent decades. Some web pages are created based on some common template but due to the unwanted terms in the template, it reduces the performance of search engine. In this project, we detect and extract the common template from heterogeneous web pages. From this, it improves the performance of search engine, classification and clustering. Good template extraction technique improves the performance of applications like industries, medical, Government and etc.

Extracting the template based on DOM tree uses the tree edit distances measure for extracting common templates. However, it is not easy to select proper training data and does not work for all the time. It employs Minimum Description Length principle for cluster the web documents and estimate the jaccard coefficient between sets. MDL cost of the clusters and execution time are high.

It manages unknown number of templates and improves the efficiency and scalability of template detection and extraction algorithm. We extend the MinHash by estimating the Diceâ€™s coefficient between two sets. It estimates MDL cost with partial information of documents. It is fully automated and robust without requiring many parameters.

PROPOSED SYSTEM-:

WORLD Wide Web (WWW) is popularly used to publish and access information on the Internet. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using common templates with contents.

For human beings, the templates provide readers easy access to the contents guided by consistent structures even though the templates are not explicitly announced. However, for machines, the unknown templates are considered harmful because they degrade the accuracy and performance due to the irrelevant terms in templates.

Thus, template detection and extraction techniques have received a lot of attention recently to improve the performance of web applications, such as data integration, search engines, classification of web documents. HTML document can be naturally represented with a Document Object Model (DOM) tree, web documents are considered as trees. In order to alleviate the limitations of the state-of-the-art technologies, we investigate the problem of detecting the templates from heterogeneous web documents and present novel algorithms called TEXT (auTomatic tEmplate eXTraction).

We propose to represent a web document and a template as a set of paths in a DOM tree. Our goal is to manage an unknown number of templates and to improve the efficiency and scalability of template detection and extraction algorithms. To deal with the unknown number of templates and select good partitioning from all possible partitions of web documents, we employ Rissanenâ€™s Minimum Description Length (MDL) principle.

In order to improve efficiency and scalability to handle a large number of web documents for clustering, we extend MinHash. While the traditional MinHash is used to estimate the Jaccard coefficient between sets, we propose an extended MinHash to estimate our MDL cost measure with partial information of documents.

Moreover, our proposed algorithms are fully automated and robust without requiring many parameters. Our Proposed work consists of two algorithms namely TEXT-HASH and TEXT-MAX with jaccard and Dice coefficient . We implement and compare these algorithms and conclude that TEXT-MAX with Dice coefficient is the best one.

TEXT-HASH is the agglomerative clustering algorithm with MinHash signatures. It requires an input parameter which is the length of MinHash signature. TEXT-MAX is the clustering algorithm with both MinHash signatures and Heuristic 1 to reduce the search space.

It requires the length of the signature as an input parameter. TEXT-MAX is the clustering algorithm with both MinHash signatures and DICE Coefficient to reduce the search space.

SYSTEM ARCHITECTURE-:

Fig 2: System Architecture

SYSTEM FLOW DIAGARM

HTML Documents

Document Object Model

Essential Paths and Templates

Set of paths

Support of Paths

Min Support Threshold

Matrix (ME MT , MD )

Clustering

TEXT-MAX DICE

Initial Cluster

GetBestPair

GetMDLCost

Final Cluster

Approximate MDL COST

TEXT-HASH

Initial Cluster

GetBestPair

GetHashMDLCost

Final Cluster

Approximate MDL COST

TEXT-MAX

Initial Cluster

GetInitBestPair

GetHashMDLCost

Final Cluster

Approximate MDL COST

Fig 3: System Flow Diagram

MODULES

HTML Documents and Document Object Model

We collect the HTML documents as input. The DOM defines a standard for accessing documents, like HTML and XML. The DOM presents an HTML document as a tree structure. The entire document is a document node, every HTML element is an element node, the texts in the HTML elements are text nodes, every HTML attribute is an attribute node, and comments are comment nodes.

For example, let us consider simple HTML documents in Fig.4. We construct the Document Object Model (DOM) tree for the documents. The DOM presents an HTML document as a tree structure. For instance, the DOM tree of a simple document d3 is shown Fig. 5. We find support values for each path in the documents based on the DOM tree are shown in Table 1. Support value can be calculated based on number of documents the path appears. Document d1 is represented as a set of paths {p1,p2,p3}. Document d2 is represented as a set of paths {p1,p2,p3,p4}.

Fig-4: Simple Web Documents

Fig-5 : DOM Tree of d3 in Fig. 4.

For a node in a DOM tree, we denote the path of the node by listing nodes from the root to the node in which we use â€™\â€™ as a delimiter between nodes. For example, in the DOM tree of d3 in Figure 1, the path of a node â€™Template Extractionâ€™ is Document\<html>\<body>\<h1>\Template Extraction.

Path

Sup

Document\<html>

Document\<html>\<body>

Document\<html>\<body>\<br>

Document\<html>\<body>\list

Document\<html>\<body>\<h1>

Document\<html>\<body>\<big>

Document\<html>\<body>\<b>

Document\<html>\<body>\<h1>\ Template Extraction

Document\<html>\<body>\<h1>\data

Table 1: Paths of Tokens and their Supports

Essential Paths and Templates

Collection of web documents can be represented by D = {d1,d2,â€¦,dn}. We define a path set PD as the set of all paths in D. For each document di, we calculate the threshold value. Threshold values for each document are distinct. The mode of each document is very effective to make templates, while contents are eliminated. We use the mode of support value for each document as the minimum support values of paths in each document as the minimum support threshold for each document. If several modes of support values, we will take the smallest mode.

For example, In Table 1, the paths appearing at the documents d1 are p1, p2 and p3 whose supports are 4, 4, and 3 respectively. Since 4 is the mode of them, we use 4 as the minimum support threshold value for td1. Then p1 and p2 are the essential paths of d1.

Similarly the minimum support thresholds td2, td3, and td4 are 4, 2, and 2. Based on the threshold value for each document we construct essential path matrix. Row represents paths in the document set and column represents documents. We use a | PD | X | D | matrix ME with 0/1 values to represent the documents with their essential paths. If a path pi is an essential path of a document dj then the matrix ME is 1. Otherwise it is 0.

Example 1: Consider the HTML documents D = {d1, d2, d3, d4} in Fig 1. All the paths and their frequencies in D are given in Table 1. Assume that the minimum support thresholds td1, td2, td3, and td4 are 4, 4, 2 and 2 respectively. The essential path sets are E (d1) = {p1, p2}, E (d2) = {p1, p2}, E (d3) = {p1, p2, p4, p5, p6, p7} and E (d4) = {p1, p2, p4, p5, p6, p7}. We have the path set PD = {pi|1 â‰¤ i â‰¤ 9} and the matrix ME becomes as follows:

CLUSTERING BASED ON MINIMUM DESCRIPTION LENGTH

Initially each document is considered as a cluster. A cluster is denoted by a pair (Ti,Di), where Ti is a set of essential path in the document and Di is a set of documents in the cluster. Set of all the clusters are represented by C={c1,c2,â€¦,cn} for a web document set D.(i.e) we have n clusters for a web document set D. Construct MT, MD and MÎ” matrix. MT represents information of each cluster with its template paths and MD represents the information of each cluster with its member documents. Construct MÎ” based on the formula ME= MT. MD + MÎ”. The MDL cost of the clustering and a matrix are denoted by L(C) and L (M), respectively. Cost of the clusters can be calculated by L(C) =L (MT) + L (MD) +L (MÎ”). L (MD) becomes | D |.log2 | D |. L (MT) and L (MD) are calculated by

H(X) =

L (M) = |M|Â·H(X)

For example consider the web documents in Fig 3 and ME in Example 1 again. Assume that we have a clustering C = {c1, c2} where c1 = ({p1, p2}, {d1, d2}) and c2 = ({p1, p2, p3, p4, p5, p6, p7}, {d3, d4}). Then, MT, MD and MÎ” are as follows and we can see that ME = MT Â·MD +MÎ”.

Then, with MT , Pr(1) = 9/36 and Pr(0) = 27/36 and we have L(MT) = |MT| Â· H(X) = 36 Â· (âˆ’ 9/36 log29/36 â€“ 27/36 log2 27/36 ) = 29.21.Similarly L(MD)=8, L(MÎ”)=11.145 and thus L(C)=48.35. In this way it calculates the MDL cost for different clusters combination and select minimum MDL cost cluster as a best cluster.

TEXT â€“ MDL ALGORITHM USING AGGLOMERATIVE CLUSTERING ALGORITHM

For computation of Optimal cost calculation, it reconstructs the formula for calculating MDL cost as

|ME| Â· Î² /Î± Â· (Pr(1) of MT +(Pr(1)+Pr(âˆ’1)) of MÎ”)+L(MD) =Î²/Î± Â· (# of 1s in MT + # of 1s and -1s in MÎ”) + L(MD) (1)

From this, cluster doesnâ€™t depend on the any cluster in the cluster set C. Following algorithm is used to calculate the MDL cost for different clusters and find the minimum MDL cost cluster as a best cluster using agglomerative clustering technique. In this algorithm, each document is considered as each cluster. Merge the document based on the MDL cost until it founds the best cluster.

Find MDLCost function is used to find the MDL cost for the current cluster. It sends sample two clusters and our original cluster as an input for this function.

In GetCorrectPair function, first get current best pair and merge the two clusters and calculate the MDL cost for the clusters. If this MDL cost is minimum compare to the initial MDL cost then we can merge the two clusters otherwise get another two clusters until we get a best cluster.

TEXT HASH ALGORITHM USING MINHASH FUNCTION

The estimation of MDL cost of a clustering by MinHash. From that, we can reduce the dimensions of documents and find the best pair quickly.

Min Hash

Jaccardâ€™s coefficient between two sets A1 and A2 is defined as

Î³(A1, A2) = and Diceâ€™s coefficient between two sets A1 and A2 is defined as âˆ‚( A1, A2)=. Both are calculated using Min-Wise Independent permutation. It estimates the coefficient by repeatedly assigning random ranks to the universal set and comparing the minimum values from the ranks of each set.

Consider a set of random permutations Î = {Ï€1, Â· Â· Â·, Ï€L } on a universal set U={r1,â€¦..rM} and a set A1U. Let Ï€(ri) be the rank of ri in a permutation Ï€i and min(Ï€i(A1)) denote min(Ï€i(rj)|r,A1). is called min- wise Independent if we have Pr(min(Ï€i(A1)= Ï€(x))=1/|A1| for every set A1 U and every x A1 for all Ï€i Then for any sets A1,A2 U for all Ï€i,we have Pr(min(Ï€i(A1))=min(Ï€i(A2))) = , where is the Jaccardâ€™s coefficient defined previously and is the Diceâ€™s Coefficient.

For K sets A1,A2,..,Ak, the Jaccardâ€™s coefficient is defined by

Î³(A1â€¦â€¦ Ak) = | A1âˆ©â€¦â€¦..âˆ© Ak |

| A1Uâ€¦â€¦...U Ak |

For k sets A1, A2â€¦â€¦Ak , the Diceâ€™s coefficient is defined by

âˆ‚( A1, .,Ak) = 2 . | A1âˆ©â€¦â€¦â€¦..âˆ© Ak |

| A1|+â€¦â€¦â€¦. +| Ak|

We can estimate Î³ (A1â€¦â€¦ Ak) using the signature vectors as follows

Î³(A1â€¦ Ak) = |{i|sigA1 [i] = Â· Â· Â· = sigAk [i]}|

|Î |

Extended MinHash

Thus, given a collection of sets S={A1,â€¦,Ak}, we extend MinHash to estimate the probabilities needed to compute the MDL Cost. We denote the probability as

Î¾(X,m) = |{ rj | rj is included in m number of sets in X}|

|A1 U. . . U Ak|

Then, Î¾(X,m) is defined for 1â‰¤ mâ‰¤ |S| and Î¾(S,|S|) is the same as the Jaccardâ€™s coefficient of sets in S.

We can estimate Î¾(X,m) with sigX as follows [1].

Î¾(X,m) = |{ i | n(sigX[i]) = m}| (2)

|Î |

Calculation of MDL Cost using MinHash

Computation of MDL cost using n(Di,k). Recall that sup(px,Di) is the number of documents in Di having the path px as an essential path. Let n(Di,k) represent the number of paths px whose sup(px,Di) is k. The following formula shows that we can count the numbers of 1s and -1s in MT and MÎ” is follows

(3)

For k with 1<= k<=|Di|,

(4)

In this algorithm it estimates the MDL cost, but do not generate the template paths of each cluster. cb is initialized as the empty set. We can use the signature of cb is maintained to estimate the MDL cost.

TEXT â€“ MAX ALGORITHM USING MINHASH FUNCTION

When it merges the clusters hierarchically, it selects two clusters which maximize the reduction of the MDL cost by merging them. In order to efficiently find the nearest cluster, we can calculate Jaccardâ€™s coefficient between two clusterâ€™s cm and cn as follows

Then, given three clusters cm, cn and ck, if Jaccardâ€™s coefficient between cm and cn is greater than that between cm and cb, we assume that the reduction of the MDL cost by merging cm and cn will be greater than that by cm and cb .

By using this approach, it can reduce the search space to find the nearest cluster. Using this approach, the search space becomes the number of clusters whose Jaccardâ€™s coefficient with cm is maximal. It can extend the previous algorithm by calculating Jaccardâ€™s coefficient between two sets.

TEXT-MAX USING DICE COEFFICIENT

In order to improve the efficiency for clustering we can implement TEXT- MAX with DICE coefficient. It reduces the execution time and improves the MDL cost. Diceâ€™s coefficient between two clusterâ€™s cm and cn can be calculated as follows.

We can change the previous algorithm with Jaccardâ€™s Coefficient by Diceâ€™s Coefficient. This algorithm improves the execution time. MDL cost value is high compare to previous algorithms.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

A Methodology For Enhancing Template Extraction Accuracy

HTML Documents and Document Object Model

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time