Intelligent Web Application To Timeline Data

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

The internet is a huge source of unstructured information. Web mining is done to organize these huge amounts of data available online in this paper we propose a technique to retrieve and present data to the clients in the form of timeline. For crawling the web, we have used one of the efficient text searching �boyermoore algorithm�. It reduces the number of searches and returns the links. Through these links we can extract the required data and cluster it according to time period. In this way the Time lining of data is possible.

Keywords� WebCrawler, Data Extraction, Data Clustering, Time lining, K-means algorithm, Identifying Tags, Booyer -Moore algorithm

I. INTRODUCTION

The main objective of this paper is to develop an application that provides answers to the client in a structured manner by categorizing the results, as the user has to shift through scores of pages to come upon the information he/she desires. By taking one seed URL as input and searching via keyword, the web pages containing the keyword shall be retrieved. Clustering of data should be done on the basis of time/date. Clustering improves the efficiency and performance for retrieving a proper information results that satisfy user's needs. Thus, making the search more particular so that the client can view the changes or progress of a company / country over a period of time.

Web Crawlers are small programs that =browse� the web on the search engine�s behalf. The programs are given URL�s, whose pages they retrieve from the web. The crawler extracts URLs appearing in the retrieved pages, and gives this information to the crawler control module. This module determines what links to visit next, and feeds the links to visit back to the crawlers. [1] The web crawler collect information about the website and its links, the website URL, the page title, the meta tag information, the web page content, the links on the page, and where they go. A focused crawler is developed to extract only the relevant web pages. The crawler uses pattern recognition algorithms and generates the number of times the input text exists in the text found on a link. [1] Using Booyer and Moore algorithm, using pattern matching we can search our keyword in different web pages.

Data extraction is the act or process of retrieving data out of data sources for further data processing or data storage (data migration).

This paper suggests how to extract data from tables and lists from web pages. Tokenization of the web pages[4] is done. Pages are analysed to find out the common structure. During tokenization, the text of each Web page is split into individual words, or more accurately tokens, and each token is assigned one or more syntactic types based on the characters appearing in it. Then columns are identified and then rows are identified from the tables. Tag identification is the process in which tags from the web pages are identified and data is extracted making use of these tags. The extraction of data is done from tables grouping it in rows and columns[4] automatically; assuming some general assumptions about the structure of data. It considers that tables may contain irregular behaviour. Even when there are irregularities in behaviour, the Quality of data is not hampered. This approach requires several pages to be analysed before data can be extracted from a single list.

The data that is segregated on the basis of row according to the previous section [4] has to clustered on the basis of time. Here we are using the k-means algorithm[9] to cluster data. Initially we assume few centroids and all records are considered as separate clusters. The distance between the

Centroid and clusters are calculated using the distance formula. On the basis of these distances the groups are made. The new centroids are calculated and the above mentioned procedure is continued till the groups made remain the same in successive iterations. Thus in this way we can effectively cluster data.

As WWW is the huge collection of data. Whenever user searches for any keyword, actually focus of user is particularly on the recently updated information but still search engine returns number of links as a result and it consumes time of user for searching data of interest so using time-lining in our application we are trying to discover this problem. The information given to client will be categorized on the basis of time i.e. clustering of data will be done on the basis of time/date. In our application our main focus is Indian Cricket Team, so whenever user will enter any player name, for example M. S. Dhoni then the user can view changes or progress in the profile of M.S. Dhoni over a period of time.

2

II. CRAWLER

In the boyer moore algorithm[1], the pattern is scanned from right to left when proceeding though the text. BM[1] works with two different pre-processing strategies to determine the smallest possible shift, each time a mismatch occurs, the algorithm computes both and then chooses the largest possible shift, thus making use of the most efficient strategy for each individual case.

? The first strategy is the ?bad character? heuristic[1]. This strategy concentrates on the ?bad character? in the text, which causes the mismatch. If it is not contained in P at all, the pattern can be shifted past it, if it is somewhere in the pattern, then search for the rightmost appearance of the ?bad character? in the pattern and match it against the text.

Auxiliary function for the ?bad character? heuristic:

For each character of the alphabet, we determine its rightmost occurrence in the pattern and write out result into an array. Then, each time a mismatch occurs, we look up the value of ?last? for our bad character and find out how far the pattern can be shifted to the right. Simple algorithm[1] that uses only the ?bad character? heuristic.

First check whether the pattern longer than the text. We set text and pattern pointer to the starting point, which is the rightmost character in the pattern and then compare. When j equals m-1, we know we found the match in full. And we return the position of the valid shift. If not, j and I decremented and we continue the comparison. In case text and pattern character don�t match, the auxiliary function is called .We determine the rightmost occurrence of the bad character in the pattern and modify j and i accordingly; if we have examined all valid shifts and no match has been found, we know the pattern doesn�t appear in the text and return �1.

Example:-

Pattern:- Sachin

Text:- Best ever century by Sachin.

There is a table shift[128] in which all the characters possible through keyword and their respective shift operations.

The alphabets =B� =E� =V� so on are not present in the pattern =SACHIN�

So the shift when these alphabets will be found will be the length of the pattern ?SACHIN� i.e. 6. Hence we will reduction computations.

When Alphabets like =s� =a� =c� =h� =i� =n� are found their respective shifts are performed.

In the text above:

? Best ever century by Sachin

? Best ever century by Sachin

? Best ever century by Sachin // n found so shift to left one by one as n is the last alphabet of pattern

? Best ever century by Sachin // e found so shift by 6

? Best ever century_by Sachin // space found so shift by 6

? Best ever century by Sachin //c found so 1st shift to right by 3

? Best ever century by Sachin // now shift to left and compare with the pattern

? Best ever century by Sachin // pattern found

The more the pattern is matched the higher the page is ranked and its links are returned accordingly.

Fig1.State Diagram for Boyer-Moore Algorithm

Refer Fig.1 for following explanation. The input for the system is a string. The text to be searched is the pattern. Max is length of pattern. Min is calculated in a table. The alphabets of the string are compared in state s1. If the character is a bad character i.e. the alphabet is not in the pattern then it moves to state s2 else to state s3.if it moves to state s2 shift by max else shift by min. which leads to state s4 where we get a character. This character is compared it attains 2 states s6 if pattern not matched else s7. If pattern is matched then state s8 is obtained wherein shift to left is performed. Again it compares with pattern if matched then s7 else s6 and back to s1.

Formula for distance calculation:-

shift[character] = min{s|s = maxlength or (1 s < maxlength and pattern[maxlength - s] = character)}

A Finite Automata is a 5-tuple (S,s0, F,S,d),where

S= {??1,??2 , ??3 ,??4, ??5, ??6, ??7, ??8, ??9}

??0= {??1}

S = {bad character (B), good character (G), end of character ($)}

F={ ??9}

d = { d (??1,B) = ??2 ,

d (??1,G) = ??3

d (??3,G) = ??4

d (??3,B) = ??4

d (??4,G) = ??5

d (??4,B) = ??5

d (??5,G) = ??7

d (??5,B) = ??6

3

d (??6,$) = ??9

d (??6,B) = ??1

d (??7,G) = ??8

d (??8,G) = ??5

d (??8,B) = ??1

d (??8,$) = ??9 }

The language accepted by S is: L(S) ={x ? A*|there exists u ? A* and ux=w} where A is alphabet and w is word.

Example:-Input { Best ever Sachin} , pattern:- Sachin Max=6.

Min for =s� =5,Min for =a�=4,Min for =c�=3, min for =h�=2, min for =I�=1, min for n=0;

Fig.2 Transitions for the string �Best Ever Sachin?

As in Fig.2 the 1st state is s1 wherein alphabet S is obtained. It�s a bad character hence shift by 6 (s2). We reach state S4 wherein we get character =v�, then it is compared in the pattern, as it doesn�t match state S6 is obtained which leads to state s1. Again the bad character is obtained , hence shift by 6. We get a character =c� (s3). By shift by min we get character� n� (s4).Compare with pattern , it matches with pattern hence state s7. Now shift to left by 1(s8) and compare with pattern (s5) . the loop s5, s7, s8 continues till the pattern is matched . the program will exit on completion of text else go back to state s1.

III. DATA EXTRACTION

Finding the page template:- During the tokenization[4] step, the text of each Web page is split into individual words, or more accurately tokens, and each token is assigned one or more syntactic types , based on the characters appearing in it. Thus, a token can be an HTML token, an alphanumeric, or a punctuation token. If it�s an alphanumeric token, it may also belong to one of two categories: numeric or alphabetic, which is further divided into capitalized or lowercased types, and so on.

Pages are represented as

<html>

<head> Sachin Tendulkar </head>

<table>

<tr>

<th>series</th>

<th>score</th>

<th> year</th>

</tr>

<tr>

<td>natwest</td>

<td> 245</td>

<td>2003</td>

</tr>

<tr>

<td>ind-pak</td>

<td> 745</td>

<td>2007</td>

</tr>

<tr>

<td>indo-pak</td>

<td> 453</td>

<td>2003</td>

</tr>

</table>

</html>

All the tags such as <html><head><table> and words like natwest ,indo �pak series are also tokenized.

Identifying columns[4]:-We expect all data in the same column to be of the same type, e.g., book prices; therefore, we may be able to identify columns by grouping extracts by similarity of content. In addition to content, layout hints and separators, may be useful evidence for helping arrange extracts by column.The algorithm mentioned below is from the reference[4].

begin

for each x in X

((Vx)1= leftseparator(x)

((Vx)2 = rightseparator(x)

end for

C = cluster(V )

P = null

for each c in C

addpatterns(patterns(c), P)

end for

for each x in X

for i=firstpattern(P) to lastpattern(P)

if ( matches(x, patterns(i; P) )

(Vx)i+2=1

Else

(Vx)i+2=0

End if

end for

end for

end

With the above algorithm the fields coming in different <td >or <th> tags are clustered separately.

4

After the clusters are formed we check if any clusters have similar behavior.

i.e series and natwest have similar behavior so should be grouped together. After the<tr> tag the first<td> tag contains these 2 texts hence grouped together.

Identifying rows[4]:-

The same pattern of representation like after series, scores and year are mentioned so this pattern can be used to group together the rows. After clustering of columns it becomes easy to identify rows.

IV. DATA CLUSTERING

Input: N objects to be cluster {??1,??2,�????}, the number of clusters k;

Output: k clusters and the sum of dissimilarity between each object and its nearest cluster centre is the smallest;

1) Arbitrarily select k objects as initial cluster centres (??1,??2�.????);

2) Calculate the distance[9] between each object ???? and each cluster centre, then assign each object to the nearest cluster, formula for calculating distance as:

d (????,????) = (????1-????1)2????=1 , i=1�.N; j=1�k;

d (????,????) is the distance between data i and cluster j;

3) Calculate the mean of objects in each cluster as the new cluster centres[9],

????=1?? ????????????=1 ,i=1,2�k; ???? is the number of samples of current cluster I;

4) Repeat 2) 3) until the criterion function E converged, return (??1,??2�.????). Algorithm terminates[9].

Example:

m: set of initial clusters centre with k number of objects.

x: set of objects to be clustered.

m = {1990, 1994, 1998, 2002, 2006, 2010}

k=6

x = {2003, 2007, 2003}

For Iteration0,

Centroids: 2003, 2007, 2003

Distance matrix:

We can observe that G0=G1.

It shows that objects does not move group anymore. Thus, the computation of the k-means clustering has reached its stability and no more iteration is needed.

The following diagram shows how clustering will take place. The Rows which are to be clustered are provided as input. At S1 the processing of clustering will start. We will initially consider that we have =k� clusters. Then we calculate distance between each cluster center/centroid and cluster object to go at state S2. At state S3 we compare distance. If we get same distance to the cluster center then we cluster the data otherwise we will form new cluster. Again we check for next element in row. This process will continues till input ends and at the end we will receive clustered data.

Fig.3 State Transition Diagram for Data Clustering

For example,

In the above example our centroids will be the year. The distance of rows on the basis year is calculated. The 3 rows have years 2003, 2007, 2003 i.e. input is {2003, 2007, 2003} with centroids as 2003.

Input :{ 2003, 2007, 2003}

Output: {{2003, 2003}, {2007}}

5

Fig.4 Transitions for Data Clustering Initially there are three clusters. First element in row is 2003.So distance with centroid is d=0.So cluster is formed. Now, next input is 2007. Its distance from centroid is d=4. So another new cluster is formed. But next element in row is 2003.So distance d=0 and we combine first and last element to form one cluster of 2003.Thus, 1st n 3rd clusters are combined as the year is the same and the 2nd cluster will be alone. Hence, if the year is 2003 then the records with 2003 as year will be clustered together as the distance for it will be less. Hence in this way we'll get the clustered data in the form of a timeline.

V. FUTURE WORK

Proposed intelligent web application can be integrated for various organization, companies, stock market, politics or any other sector to time-line the respective data. Because of time-line representation the application provides interesting and interactive interface to user.

VI. CONCLUSION

Thus, as mentioned above we can successfully cluster data and present them in a timeline format. The system will work like following: User will enter the text to be searched, the text will be searched on cricket websites, the links of web pages will be returned, data will be extracted and clustered, and this data will be presented to the users. REFERENCES [1] Poojagupta Assistant Professor , Linagay's University, NOIDA,Mrs. KalpanaJohari, ?Implementation of web crawler?

[2]Anuradha, Department of Computer Engineering , YMCA University of Sc. & Technology A.K.Sharma, Department of Computer .,Engineering, MCA University of Sc. Technology Faridabad, India International Journal of Computer Applications (0975 � 8887)Volume 15� No.4, February 2011? A novel technique for data extraction from hidden web databases? [3]Xiaoqing Zheng, Yiling Gu, Yinsheng Li ,School of Computer Science Fudan University ,Shanghai, 201203, China,

[email protected],[email protected],[email protected]? Data extraction from web pages based on structural-semantic entropy? WWW 2012 Companion, April 16�20, 2012, Lyon, France.ACM 978-1-4503-1230-1/12/04. [4] Kristina Lerman1,CraigKnoblock and Steven Minton, Information Sciences Institute Univ. of Southern California, Marina del Rey, CA 90292-66952.FetchTechnologies flerman,knoblock,[email protected],?Automatic data extraction from lists and tables in web sources? [5] Georges GardarinPRiSM Laboratory University of Versailles 45 Avenue des Etats-Unis78035 Versailles Cedex, FRANCE ?Gradual Clustering Algorithms for Metric Spaces? [6]Andrea De Lucia, Michele Risi ,Giuseppe Scanniello ,GenoveffaTortora Dipartimento di Matematica e Informatica, Universit� di Salerno, Via Ponte Don Melillo, 84084, Fisciano (SA), ITALY Dipartimento di Matematica e Informatica, Universit�della Basilicata, VialeDell'Ateneo, MacchiaRomana, 85100, Potenza, ITALY?Automatic Clustering of Similar Web Pages?

[7] R. Etemadi1 , N. Moghaddam2 Department of Electrical and computer engineering, Islamic Azad University branch of bonab,Tabriz,Iran,Department of Electrical engineering and computer engineering,TarbiatModarresuniversity,Tehran,Iran,[email protected], [email protected]?An approach in web content mining for clustering web pages? 978-1-4244-7571-1/10/$26.00 �2010 IEEE [8] Odukoya, o.h, aderounmu, g.a. and adagunodo, e.r. Computer Science and Engineering department ObafemiAwolowo University,OsunStateIle-Ife, Nigeria,[email protected], [email protected], [email protected]?An improved data clustering algorithm for mining web documents ? [9] Juntao Wang and Xiaolong Su, School of Computer Science and Technology China University of Mining &Technology, Xuzhou, China,[email protected],suxl@cumt. edu.cn ?An improved k-means clustering algorithm?

Center for Development of Advanced Computing, NOIDA 978-0-7695-3884-6/09 $26.00 � 2009 IEEE Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now