Web Page Visual Similarity Computer Science Essay

Published Date: 02 Nov 2017

Vanita V Sawant ME-II COMPUTER Engg. Vidya Pratishthas College of Engineering,Baramati. Prof.Takale S. A.

Assist.Prof.Information Technology Department VPCOE Baramati,Pune University.

Abstract

Today,Internet becomes part of our live,the WWW is most

service of internet because it allows to present information

such as document ,image etc.Web page visual similarity assessment

has been employed to address a problem in different

fields including phishing,web archiving ,web search engine,

truth discovery of web sites etc. An effective approach

to similarity assessment of Web page is proposed, which uses

Earth Moverâ€™s Distance (EMD) to measure Web page visual

similarity .In this ,first convert the involved Web pages into

low resolution images and then use color and coordinate features

to represent the image signatures. For that it uses

EMD to calculate the signature distances of the images of

the Web pages. First step is to get the web page from its

URL ,Compute the Layout Similarity by using tag and template

comparison algorithm,if it is match then link analysis is

performed to compute inward and outward links if it is also

match then ,convert them into normalized images and then

represent their image signatures with features composed of

dominant color category and its corresponding centroid coordinate

to calculate the visual similarity of two Web pages.If

EMD value is zero then web pages are similar one else non

similar.

Key terms

visual similarity,EMD,link analyasis,layout similarity,web

similarity,text similarity.

1. Introduction

Web page is the main part on the World Wide

Web.Similarity of Web pages is very useful for Web

content analysis. Some similarity computation methods

have been used to compare Web pages. However,

only text based similarity computation methods are not

sufficient for Web page comparison, because Web page

consists of not only text but also multimedia contents,

such as , audio, video,image,hyperlink structure and so

on. This paper proposes a new approach to evaluate

visual similarity of Web pages considering some of the

contents on them.It can make Web page similarity computation

exactly and bring benefits for Web analysis.The

World Wide Web contains large repositories of electronically

stored information. The size, dynamic nature and

diversity of content of this information is neccessry in

the development of effective search tools .Web search engines

are also today one of the most frequently used

tools for retrieving information from the web. Apart

from research into methods for effective retrieval of information

on the Web, there has also been a considerable

increase in research into methods for effective information

organisation, access and navigation.For such research

problems, relationships (i.e. similarities) between

web pages become important.Some subjects in which

the notion of similarity finds uses include cluster-based

search engines (e.g. Vivisimo, iBoogie1), web communities,

the related pages function of search engines ,collaborative,

filtering,identification of fake web pages , and visualization.

Typical scenarios in which a suitable notion

of similarity of Web pages is desirable include: search

engines, testing tools, documents wrappers, detection of

duplicated Web pages,and Web data mining.

In this paper,it proposes an effective approach for detecting

similarity ofWeb pages, which employs the Earth

Moverâ€™s Distance(EMD) to calculate the visual similarity

of Web pages. The most important reason that Internet

users could become victims of phishing attacks is that

phishing Web pages always have high visual similarity

with the real Web pages, such as visually similar page

layouts, dominant colors,and images, etc. Obtain the visually

similar URLâ€™s from â€™www.phishtank.comâ€™ or any

source. First the layout similarity of two web pages is

calculated considering their associated DOM-Tree representation

by using Templates Computation Algorithm

and Simple Tags Comparison Algorithm.If the layout is

similar of both web pages then compute the Link Analysis

else generate the dissimilar report.In Link Analysis,it

performed by analyzing forward and backward links,if it

is math then compute the EMD else generate dissimilar

report.To compute the EMD(Earth Moverâ€™s Distance)

first convert the web pages into normalized images and

then represent their image signatures with features composed

of dominant color category and its corresponding

centroid coordinate to calculate the visual similarity of

two Web pages. The linear programming approach for

EMD is applied to visual similarity computation of the

two signatures . If the EMD-based visual similarity of

a Web page exceeds the threshold, we classify the Web

page as a Non similar one,or if EMD value is 0 we classify

web page is similar otherwise Non similar one.

The web page visual similarity is assessed by in four

ways,which are as follows:

1. Layout Similarity.

2. Link Analysis.

3. Text Similarity.

4. Image based EMD.

2. Related Work

2.1. Liturature Survey

Layout similarity method is based on Antiphish[1], which

is also known as DOMAntiphish.The key disadvantage of

AntiPhish is that user manual interaction is required to

specify the information on a web site that is considered

sensitive. The number of anti-phishing solutions have

been proposed to date. These approaches can generally

be classified into five main categories: Email-based[2],

blacklist-based[3], visual-clue-based[4], website featurebased[

5], and information-flow-based approaches[6].

Web pages consist of features like layout content,

textual content,visual content.Textual content is defined

as the terms or words that appear in a given web

page[19], except for the stop words (a set of common

words like "a," "the," "this," etc.). We first separate the

main text content from HTML tags and apply stemming

[7] to each word. Stems are used as basic features instead

of original words. For example, "program," "programs,"

and "programming" are stemmed into "program" and

considered as the same word.variety of similarity or distance

measures have been proposed and widely applied,

such as cosine similarity and the Jaccard correlation coefficient,

Metric,Euclidean distance. Meanwhile, similarity

is often conceived in terms of dissimilarity or distance as

well [8].

Kleinberg [9] has introduced the concepts of "authorities"

and "hubs". This paper is perhaps one of the most

widely cited papers in the areas of hyperlinked environments.

He did utilize his algorithm to solve the "similarity

queries", but he did not present a measure.There are

other systems that find "similarity pages", online, like

Google with its "Similar Pages" feature [10], Netscape

with its "whatâ€™s related?" option [11].Nevertheless, none

of them present the concepts and techniques used in semantic

link anaysis.

In this paper, we propose an effective approach for

comparing two Web pages, which employs the Earth

Moverâ€™s Distance (EMD) [12] to calculate the visual

similarity of Web pages. Previous research work is for

duplicated document detection approach focus on only

plain text documents and use pure text features in similarity

measure, such as collection statistics, syntactic

analysis, displaying structure, visual-based understanding

[13], vector space model [14],etc. Hoad and Zobel

have surveyed various methods on plagiarized document

detection in [15]. However, as Liu et al. [16] demonstrated,

pure text features are not sufficient for phishing

Web page detection since phishingWeb pages mainly employ

visual similarity to scam users.DOM-based [17] visual

similarity of Web pages is oriented, and the concept

of visual approach to phishing detection was first introduced.

EMD is a method to evaluate the distance (dissimilarity)

between two web page signatures. A signature is

a set of features and their corresponding weights. The

method comes from the well-known transportation problem.

To provide the solution to web page assessment we

provide the user friendly system.The system is developed

by using Netbeans 7.1.2 IDE with java at front

end.The Webscreenshot API is required to get the web

page images in png format.The Internet connection is

must be required to run the system.No any Hardware

requirement except mouse and keyboard. The detail description

about the system with methods are described

in the praposed work.

3. Problem Statement

To find out the dissimilarity between two web pages,Web

page visual similarity assessment is performed.It is based

upon the visual content and some features of web page

like text,html tag,hyperlink,web page image pixel.After

entering the two web pages for assessment,the Web page

is compared with the four methods viz.Layout similarity

assessment ,Textual similarity assessment,Hyperlink

structure analysis and image based Earth Moverâ€™s distance

respectively.If the first three methods gives similarity

report then only Image based EMD is performed.For

Image based EMD it contains following web page preprocessing

procedure.

1. Get the image of webpage from itâ€™s url.

2. Perform the normalization to increase sharpness of

web page image.

3. Represents normalized web page image into web

page visual signature(consist of color and coordinate

features),and then compute visual similarity using

EMD.

If EMD value is zero then we can consider entered two

web pages are similar else non similar.

3.1. Solving Approach

To provide solution to address the given problem,we required

two urls as an input to the system which are visually

similar(i.e.available in www,phishtank.com).Then

first method is applied with tag and template comparison

algorithm to check the layout of two web pages.If layout

is match then the text data present on that web pages is

compared with simple cosine similarity theorem.If it also

match then inward and outward links are compared.If

any method generate dissimilar result then that is the final

output of the system,else Image based EMD method

is performed.The first step is preprocessing of web page

to get the signature of it and then EMD value is calculated.

If it is zero then two web pages are similar one else

non similar.

4. Praposed Work

The proposed methods for comparing web pages are

given below:

4.1. Layout Similarity

The layout similarity oof two web sites is calculated considering

their associated DOM-Tree representation.We

start from the assumption that if two web sites have

the same DOM-Tree, then they must produce an identical

layout. It is still possible that two web sites with

two different DOM-Trees could render the same layout.

Given two DOM-Trees, we compare their similarity

in two different ways: 1) comparing the tags of the

two web pages; 2) Extracting regular subgraphs from

the trees.Templates represent particular subgraphs of the

original graph with at least two instances i.e.tags. The

generic regularity extraction problem consists of identifying

large existing templates with many instances and

covering the graph by the identified templates. The layout

similarity of the two web site is defined as the ratio

of the weighted number of matched vertices of the DOMTrees

to the number of total vertices in the web page, as

shown in Equation 1.

_ =

nX=v

n=0

Wt(Vn)/Vn (1)

where Wt is a function that assigns a similarity weight

between 0 and 1 to each vertex of the DOM-Tree, while

Vn represents the n-th vertex of the DOM-Tree. If the

layout similarity value _ , as defined in (1), exceeds a

certain threshold _ then two pages are similar one.

Tree represents the web site structure in which set of

vertices model set of tags and hierarchy among the tags

represent set of edges.The subgraph Gn(Vn,En) of graph

G(V,E) is consistent if Vn _ V,En _ E and not contain

disconnected graph.Two subgraphs is equivalent if,they

are isomorphic,types of the corresponding vertices are

same and indices of edges are same.we consider a weak

equivalence definition ,we considered as very strong condition

if the one of that asserts the equivalence of the

types of the vertices, else adding the similarity penalty

if the other remaining two conditions are not completely

respected. The Template comparison algorithm and tag

comparison algorithm [20]as follows:

1. In initialized step make the map of compatible pairs

of vertices i.e. same type of tags.The map consist

of seeds for next template computation phase.Meta

tags are excluded.Because they are not effective on

layout.

2. The algorithm starts from the seeds computed in

the initialization phase and proceeds comparing the

children vertices of each root iteratively, until compatible

vertices are found.

3. For each root with a different number of children

with respect to its counterpart in the compatibility

list,or with different types of children, or with different

attributes attached to the vertices, we add a

similarity penalty using the weight function _ given

in Equation 1.

4. All the computed templates are stored in templates

map, which will be analysed to extract the best fitting

template during the next step.

5. The template with maximum number of vertices and

the minimum penalty is extracted. We called this

coverage criterion as Best-Match-Fit-First. The vertices

that are part of the extracted template will

cover the tags of both the DOM-Trees.Repeat this

step until no more template is extracted or both

DOM trees covered.

Tag comparison algorithm:

1. Every tag of the first web page is compared with

every tag of the second web page.

2. If a match is found, then the identified pair of tags

is covered.

3. The routine is repeated until every tag of the both

the web page is covered, or no more matches are

found.

The similarity value is computed using Equation 1.Wt

function is constant i.e. 1,no penalty is considered.

4.2. Textual Similarity

In this approach text data is extracted from web

page.There are several ways to model a text document.

For example,it can be represented as a bag which contains

words, where each words are assumed to appear

independently and the order is immaterial.The bag of

word model is widely used in text mining [18] and information

retrieval. Words are counted in the bag, which

differs from the mathematical definition of set.We use

the frequency of each word as its weight, which means

words that appear more frequently are more important

and descriptive for the document.Let D = d1, ..., dn be a

set of documents and W = w1, ...,wn the set of distinct

terms occurring in D. We discuss more precisely what

we mean by "terms" below: for the moment just assume

they are words. A document is then represented as a ndimensional

vector âˆ’!

td . Let tf(d, w) denote the frequency

of term w_W in document d_D. Then the vector representation

of a document d is: td = (tf(d, w1), . . .

, tf(d, wn)) Although more frequent words are assumed

to be more important as mentioned above, this is not

usually the case in practice. For example, words like a,

the,an are probably the most frequent words that appear

in English text, but neither are descriptive nor so important

for the documentâ€™s subject.Documents presented as

vectors, we measure the degree of similarity of two documents

as the correlation between their corresponding

vectors, which can be further quantified as the cosine of

the angle between the two vectors.Terms are basically

words. But we applied standard transformations on the

basic term vector to represent in keyword vector.First,

we removed stop words. There are words that are nondescriptive

for the topic of a document, such as a,and,

are and do.words were stemmed using PorterÃ¢AZs suffixstripping

algorithm [7], so that words with different endings

will be mapped into a single word. For example

connection, connected, connects will be mapped to the

stem connect.In subsequent experiments we use the tf idf

value instead of the absolute term frequency of each term

to build keyword vectors.

Cosine Similarity:

When documents are represented as keyword vectors,

the similarity of two documents corresponds to the correlation

between the vectors. This is quantified as the

cosine of the angle between vectors, that is, called as

cosine similarity. Cosine similarity is one of the most

popular similarity measure applied to text documents,

such as in numerous information retrieval applications

[21].Given two documents âˆ’!

tx and âˆ’!

ty , their cosine similarity

is;

Sim(âˆ’!

tx ,

âˆ’!

ty ) =

âˆ’!

tx .

âˆ’!

|

âˆ’!

tx| Ã— |

âˆ’!

ty|

(2)

where âˆ’!

tx and âˆ’!

ty are n-dimensional keyword vectors over

the term set T = w1, ...,wm. Each dimension represents

a term with its weight in the document, which is

non-negative. As a result, the cosine similarity is nonnegative

and bounded between [0,1].If cosine similarity

is 1 then web pages are similar else non similar.

4.3. Link Analysis

Hyperlinks inside HTML pages contain a wealth

of information about the relationships among web

pages.Kleinberg introduced the concepts of "authorities"

and "hubs". His article presented on an analysis of the

link structure which states that "a link recognizes authority

of the other document". The main statement is

that those conferring the recognition are called "hubs"

and those receiving the recognition are called "authorities".[

21] Our definition of core is similar to the idea of

a hub.In this section we are primarily interested in the

similarity based on the hyperlink structure among the

pages. We proposed several similarity measures based

on hyperlinks below.

Predecessor Check: Our basic assumption is, if two

web pages are similar, then there will be some other web

pages that reference to both of them. The more such

pages existed, the more these two pages are similar to

each other.If two pages are referenced by many pages

simultaneously, then they are similar. Definition : Given

a set of web pages W and two pages x and y in W, the

similarity of these two pages is:

_1(x, y) = |prd(x) \ prd(y)|

|prd(x) [ prd(y)|

(3)

if denominator != 0, Otherwise,_1(x,y) = 0. We have defined

a quantitative measure, between 0 and 1,of two web

pages similarity. If we started with a large set W, the

similarity of two web pages may be relatively small.Apart

from the absolute value of the similarity . One should

use the number in a relative way. In this definition and

in all subsequent definitions, the denominator in the definition

of similarity is called the support of the similarity.

For the meaningful similarity, the support has to be at a

certain level. For example, if there is only one page in W

that links to x or y, then the support of the similarity is

1 with a support of 1. It gives that we cannot trust the

similarity so much since there are so few pages that reference

them. On the other hand, if we have a similarity

of 0.1 with a support of 1000, we can be confident that

there is a great deal of similarity between the pages. We

can also expand our similarity definition of two pages to

two sets of pages. Given two subsets A and B of W, the

similarity is defined as:

_1(A,B) = |core(x) \ core(y)|

|core(x) [ core(y)|

(4)

if denominator != zero. Otherwise,_1(A,B) = 0.

Successor Check: If we have two web pages linking to

the same web page, we may also consider these two pages

are similar. This leads us to the second definition.

Definition: Given a set of web pages W and two pages

x and y in W, the similarity of these two pages is:

_1(x, y) = |suc(x) \ suc(y)|

|suc(x) [ suc(y)|

(5)

if denominator != 0. Otherwise, _1(x, y) = 0. Similarity,

we can also expand our similarity definition of two pages

to two sets of pages. Given two subsets A and B of W,

the similarity is defined as:

_1(A,B) = |closure(x) \ closure(y)|

|closure(x) [ closure(y)|

(6)

if denominator !=0. Otherwise,_1(A,B) = 0.

4.4. Earth Moverâ€™s Distance

In [22] have given, EMD is method to evaluate the dissimilarity

or distance between two signatures. A signature

is consist of set of features and their corresponding

weights. This method comes from the well-known transportation/

consumer producer problem. Producers

want to transport their products to consumers. Suppose

the distances of each pair of producer and consumer are

given, and they are represented into a distance matrix ,

which is defined before calculating EMD. Producers

produce the same product and consumers consume the

same product. The transportation fee is proportional

to both distance and weight of product. The task is

to find a flow matrix , which contains factors indicating

the amount of product to be moved from one producer

to one consumer.The transported product amount from

Producer to Consumer should be as much as possible

and the total transportation fee should be minimized. It

has been practically proved that EMD is advantageous

in representing problems involving multifeatured signatures.

EMD allows for partial matches in a very natural

way and is especially fit for cognitive distance evaluation.

To calculate EMD,give input as a two Url.It consist

of following tasks.

Page Processing and Signature Generation:

The task of our Web page preprocessing approach contains

three procedures: 1) obtain the image of a Web

page from its URL, 2) perform normalization, and 3)

represent theWeb page image into aWeb page visual signature

(consists of color and coordinate features), which

is used to evaluate the visual similarity of a pair of Web

pages.

1)Web Page Rendering Process:

The process of displaying aWeb page in aWeb browser

on the screen from HTML and accessory files (including

images, flash movies, activeX plugins, java Applets,

etc.)is the Web page rendering process. We use webscreencapture

to get Web page images (in png format).

2)Perform Normalization:

The images of the original sizes are processed into images

with normalized size (e.g.10*10) The Lanczos algorithm

is used to calculate the resized image because the

Lanczos algorithm has very strong antialiasing properties

in Fourier domain, and it is also easy to be computed in

spatial domain. Lanczos algorithm is used to calculate

resized image.Sharp images can be generated with the

Lanczos algorithm as intuitively, the sharp images could

provide better signature for identification than others.We

use the normalized images to present the signature of

each Web page.

3)Signature Generation:

A signature of an image is a feature vector which can

effectively represent the web page image. The signature

of an image in our approach is comprised of features and

their corresponding weights. A feature is comprised of a

color and the centroid of its position distribution in the

image.The feature-weight tuples in Si are ranked in the

descending order of their weights. In this approach, we

do not use all of the features. We choose the first Nsi

most frequent colors in Si to be the signature, where Nsi

is less or equal to N, and we denote it as Ssi. When N is

less than Nsi, Si is chosen to be exactly Ssi.

The color of each pixel in the resized images is

represented using the ARGB (alpha, red, green, and

blue) scheme with 32 bits. A color can be represented

with a 4-tuple < A,R, G,B > . However, this is a huge

color space, which includes 232=4, 294, 967, 296 colors.

In practice,use a degraded color space to represent the

signature of an image.Define the Color Degrading Factor

(DF) to be the scale of each color component making a

change. Thus, we have (28/DF)4 colors in our degraded

color space. A degraded color can be represented as:

<A-(A Mod DF), B-(B Mod DF),C-(C Mod DF),D-

(D Mod DF)>

For example, when DF = 32, we have 4,096 colors in

the degraded color space. The centroid of each degraded

color is calculated using

Cen = PN

i=1 Cen, i/Ndc

Cen is the centroid of degraded color dc, Cen,i is

the coordinates of the ith pixel that has degraded color

dc, and Ndc is the total number of pixels that have

degraded color dc, i.e., the frequency of dc. A feature F,

which has degraded color dc, can be represented with dc

and Cen, F =< dc,Cen > .The weight corresponding to

this feature is the colorÃ¢AZs frequency Ndc. A complete

signature Sc is represented as

Sc =<< Fdc1,Ndc1 >,< Fdc2,Ndc2 >, ...,<

Fdcn,Ndcn >>

Where N is the total number of degraded colors. The

feature-weight tuples in Sc are ranked in the descending

order of their weights, i.e.,for . In our approach, we do

not use all of the features. We choose the first Nsi most

frequent colors in Sc to be the signature, where Nsi is

less or equal to N, and we denote it as Ssi. When N is

less than Nsi, Sc is chosen to be exactly Ssi.

Computing Visual Similarity From EMD: Use

EMD to calculate the similarity of two Web pages based

on their signatures as follows: The distance matrix

Ds = [dmij ] 1 _ i _ mand1 _ j _ n is defined in

advance using a straightforward way. First calculate the

normalized Euclidean distance of the degraded ARGB

colors, and then calculate the normalized Euclidean

distance of centroids. The two distances are added up

with weights a and b, respectively, to form the feature

distance, where a + b = 1.

Suppose we have feature _i =< dci,Cdi > where

dci =< dAi, dRi, dGi, dBi > feature _j =< dcj ,Cdj >

where dcj =< dAj , dRj , dGj , dBj > MDcolr =k<

MaxA âˆ’ 0,MaxR âˆ’ 0,MaxG âˆ’ 0,MaxB âˆ’ 0 >k where

MaxA, MaxR, MaxG, and MaxB are the maximum

numbers of the four components of ARGB, respectively,

in the specified color space, and the maximum centroid

distance, p

w2 + h2 where, w and h are the width

and height of the resized images, respectively. The

normalized color distance NDcolr(dci,dcj) is defined as

(dci âˆ’ dcj) _ (dci âˆ’ dcj)T /MDcolr (7)

The normalized centroid distance NDcen(Ceni,Cenj) is

defined as

(Ceni âˆ’ Cenj) _ (Ceni âˆ’ Cenj)T /MDcen (8)

The normalized feature distance between _i and _j is

defined as NDfeature(_i,_j )

a.NDcolr(dci, dcj) + b.NDcen(Ceni,Cenj) (9)

So far, Ds = dmij , where dmij = NDfeature(_i, _j)

can be calculated before performing EMD calculation.

Suppose we have signature Ss,x and signature Ss,y where

Ss,x has m features and Ss,y has n features. The flow

matrix Fxy = [fmij ] 1 _ i _ mand1 _ j _ n can

be calculated through linear programming and the EMD

between Ss,xand Ss,y can be calculated as:

EMD(Ss,x, Ss,y,D) =

i=1

j=1

(fmij.dmij)/

i=1

j=1

fmij

(10)

Define EMD-based visual similarity of two images as:

V S(Ss,x, Ss,y) = 1 âˆ’ [EMD(Ss,x, Ss,y,D)]_ (11)

__(0,+1) is the amplifier visual similarity.Use _ to

make visual similarity to be better distributed in (0,1)

rather than too dense at either side without affecting

the ranking relationship of the visual similarity values of

Web pages.

5. Programmerâ€™s design

The aim of web page similar assessment is two compare

two web pages based on 4 methods viz.layout similarity,

textual similarity,link analysis and EMD to generate

similarity report.The given aim is satisfied by following

steps.

1. Input the two URLs.

2. DOM Tree Extraction of both URL and compute

with tag and template comparison algorithm for layout

similarity assessment,if layout is similar goto

next step else go to step 6.

3. Textual similarity of web page is performed using

cosine similarity,if it is math go to next step else go

to step 6.

4. Hyperlink structure analysis of both web pages to

calculate inward and outward links,if it is same go

to next step else go to step 6.

5. preprocess both the web pages,generate signatures

and apply EMD on it.If EMD=0,generate report as

web pages are visually similar, else go to step 6.

6. Generate report as web pages are not similar.

5.1. Mathematical Model

5.1.1. Set Theory

1. Let S be a system that describes visual similarity

Assessment for two web pages.

S ={.

2. Identify input as Ui.

S={Ui.U i = {U1i,U2i;i is the elements of web

pages as tag,subgraph,text,hyperlink,pixel}}

3. Identify output as O.

S ={Ui,O...

O={Oi.Oi = {i is the result of web pages

comparison}}

4. Identify process as P.

S ={Ui,O, P

P= { Ls,Ts,Hs,Is }

5. Ls = { De,Tg,Tm,Lr }

â€¢ De- DOM tree of web page ,extract the HTML tag

as input to Tg.

â€¢ Tg-simple tag comparison algorithm to count and

compare HTML tag of both the web pages.

â€¢ Tm-Template comparison algorithm to compare

subgraph.

â€¢ Lr- Layout similarity report.

6. Ts = { V,Cs,Tr }

â€¢V-vector is formed by extracting text,removing

stopwrords,and by apply Porter stemmer algorithm,

which consist of set of keywords.

â€¢ Cs-apply Cosine similarity on the vector.

â€¢ Hr- Text similarity report.

7. Hs = { Hi,Ho,Hr }

â€¢ Hi-compute the inward link.

â€¢ Ho-Compute the outward link.

â€¢ Hr- link analysis report.

8. Is = { Wp,Gs,E,R }

â€¢ Wp be the web page preprocessing.

Wp=Wi,In

Wi-Wi be the web page is converted into image by

webscreenshot API.

In-In be the normalized image by applying Lanczos

algorithm,image is resized in 10*10 pixel.

Gs=Nc,Nd,Fv,

â€¢ Gs be the signature generation of an input image.

Nc-Normalized color is calculated.

Nd-Normalized centroid is calculated.

Fv-Feature vector is formed by adding Nc and

Nd,which represents signature.

E=Ed,Fv

â€¢ E be the Earth Moverâ€™s Distance apply on Ed and

Fv.

Ed-Euclidian Distance.

â€¢ R be the final report.

9. Identify failure cases as F ,S ={Ui,O, P, F

â€¢ Failure occures if incorrect input Urls.

â€¢ Internet connection failure.

10. Identify failure cases as Sc ,S ={Ui,O, P, F, Sc

â€¢ When the correct report is generated.

11. Identify Initial condition as So,

S={Ui,O, P, F, Sc, So

â€¢ Internet connection should be there before start

the system.

5.1.2. Mathematical Representation

Let S be the system.

S={ Ui,O, Ls,De, Tg, Tm, Lr, Ts, V,Cs, Tr,

Hs,Hi,Ho,Hr, Is,Wp, Gs,E,R, F, Sc, So }

Where,

Ui=set of valid urls.

O=Set of output result.

Ls=Layout Similarity of web pages.

De=DOM tree extraction of given input Url.

Tg=Html tag comparison present on web page.

Tm=Template comparison.

Lr=Layout comparison report.

Ts=Text Similarity of web pages.

V= Vector of keyword frequency.

Cs=cosine similarity.

Tr=Text similarity report.

Hs=Hyperlink Similarity.

Hi=Inward link.

Ho=Outward link.

Hr=Hyperlink similarity report.

Is=Image based similarity.

Wp=web page preprocessing.

Gs= Signature Genaration.

E=Earth Moverâ€™s Distance.

R=Final Report.

F=set of failure state.

Sc=set of success state.

So=set of initial state.

5.2. Data independence and Data Flow architecture

The data flow architecture is as shown in figure.3.The

input data for the system is two web pages and the produced

output data is comparison report for the same.

Figure 1: System Architecture

5.3. Multiplexer Logic

In this system ,after entering the two urls the first

method is executed i.e. layout comparison ,the next

method is depend upon the result of first method.If

layout comparison report is similar then only we perform

Text comparison else generate the final result.Same

thing with the remaining method i.e.link analysis and

Image based EMD comparison.After comparison of all

the methods the final result is generated.

We can also execute every method separately to generate

individual result.

5.4. Turing Machine

Turing machine for the system is as shown in figure.2

Figure 2: State diagram

0.=Read Urls.

1. =DOM tree extraction.

2. =Layout similarity assessment.

3. =Textual similarity assessment.

4. = Hyperlink structure analysis.

5. = Image based EMD comparison.

6. = Stop and generate result.

6. Results and Discussion

To get the result of the system we have to check the system

on visually similar web pages.We can also compare

the result of each method with each other as well as with

existing system.To get the urls which are visually similar

we can use â€™www.phishtank.comâ€™.The First two methods

result is compared with System Analyser tool as shown

in figure 3.

Figure 3: Result comparison

7. Conclusion

Web page visual similarity assessment is based on

the following features of web pages,html tag,template

,text,hyperlink,pixel,which increases effectiveness of result.

The system is combination layout,textual,hyperlink

and Image based EMD comparisons,the algorithms used

with related comparisons are feasible.This system can

compare web pages in all manner,viz.visual content and

textual content,etc.It is applicable in various problem

like phishing Detection,Web page archieving,etc. The

enhancement of system is possible with technologies like

copyrights checking,Web search engine,Finding truth discovery

of web pages.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Web Page Visual Similarity Computer Science Essay

Abstract

Today,Internet becomes part of our live,the WWW is most

service of internet because it allows to present information

such as document ,image etc.Web page visual similarity assessment

has been employed to address a problem in different

fields including phishing,web archiving ,web search engine,

truth discovery of web sites etc. An effective approach

to similarity assessment of Web page is proposed, which uses

Earth Moverâ€™s Distance (EMD) to measure Web page visual

similarity .In this ,first convert the involved Web pages into

low resolution images and then use color and coordinate features

to represent the image signatures. For that it uses

EMD to calculate the signature distances of the images of

the Web pages. First step is to get the web page from its

URL ,Compute the Layout Similarity by using tag and template

comparison algorithm,if it is match then link analysis is

performed to compute inward and outward links if it is also

match then ,convert them into normalized images and then

represent their image signatures with features composed of

dominant color category and its corresponding centroid coordinate

to calculate the visual similarity of two Web pages.If

EMD value is zero then web pages are similar one else non

similar.

Key terms

visual similarity,EMD,link analyasis,layout similarity,web

similarity,text similarity.

1. Introduction

2. Related Work

3. Problem Statement

4. Praposed Work

_ =

âˆ’!

âˆ’!

âˆ’!

|

âˆ’!

âˆ’!

5. Programmerâ€™s design

5.1.1. Set Theory

5.1.2. Mathematical Representation

6. Results and Discussion

7. Conclusion

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time