Algorithm For Organizational Page Evaluation

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Abstract: Automatic classification of web pages is an effective way to deal with the difficulty of retrieving information from the Internet. The long term goal is the incorporation of web page filtration into the search process to improve the quality of the search results. Have proposed an idea to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page. Results indicate that the classifier is able to distinguish home pages from non-home pages and within the home page genre it is able to distinguish personal from corporate home pages. Organization home pages, however, were more difficult to distinguish from personal and corporate home pages.

Keywords: Web Page ,home page, non-home page.

INTRODUCTION

As the World Wide Web continues to grow exponentially, researchers and search engine companies continue to look for techniques that will improve the quality of search results. The proposed software will help in improving the quality of search and focused on the automatic identification of home pages, and the type of home page (sub-genres). A web classifier has been used to distinguish home pages from non home pages and to classify these home pages as personal home page, corporate homepage or organization home page.

Fig. 1. Automatic Web classifier

SCOPE

It appears that organization home pages do not have a specific style that is unique to them, whereas personal and corporate home pages each have a (more) unique style. Organization home pages can look like either a personal or a corporate home page, depending on who creates the page. There are a number of open research questions yet to be investigated in this area. Machine learning model can be developed in this area. A neural net classifier can be trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page. With the growing importance of the web as the repository of information, it is important to develop mechanisms to improve the quality of search engine results, and the incorporation of genre into the search equation may be one way of doing this to be truly adaptive.

In this environment, a classifier would have to:

1. Track a recognized genre as it evolves.

2. Recognizing the introduction of a novel genre which was not seen previously. This first requirement would entail continuous learning while the second requirement would entail an examination of the set of web pages identified as "noise" with possible clustering of this set to identify new classes of genre.

DEFINITIONS,ACRONYMS,AND ABBREVIATIONS

Search engine: A search engine is an information retrieval system designed to help find information stored on a computer system such as on the World Wide Web inside a corporate or proprietary network, or in a personal computer.

Home Page: A Web document that serves as a starting point or organizational centre for a collection of Web documents.

Personal Page: Personal home pages were defined to be home pages that contain information describing the interests and ambitions of a person, where those ambitions do not include making profit through selling some product or service.

Corporate Page: Corporate home pages were defined as web pages describing the interests and ambitions of companies whose purpose for existing is to make profit through selling some product or service.

Organization home page: Organization home pages were defined to be home pages that contain information describing the interests and ambitions of a group (such as a society or religious organization, etc.), where those ambitions do not include making profit through selling some product or service. Organization home pages appear to fill the role of home pages that do not fall into the personal or corporate categories.

Noise: Incomprehensibility resulting from irrelevant information or meaningless facts or remarks on a webpage.

URL: Uniform Resource Locator (the address of a web page on the World Wide Web).

2. LITERATURE SURVEY

2.1 Existing System

Available search tools on the Web fall into two categories: net directories and search engines. Net directories, such as the one provided by Yahoo!, give a hierarchical classification of documents; each document in the directory is associated with a node of the tree (either a leaf or an internal node). Moving along the tree, a user can access a set of pages that have been manually pre-classified and placed in the tree.

Yahoo! for example consists today of a classification tree of depth of 10 or more (depending on the path followed). About 10-30 branches at each level of the tree lead to a total of a few hundreds of thousands of pages .Search in a net directory is very convenient and usually leads the user to the set of documents he is seeking, but it leads to only a small fraction of the Web (often the commercial part). This limited coverage stems from the (slow) rate of manual classification.

Search engines such as Google and Bing cover a large portion of the Web. The drawback of these search engines is that they only support syntactic, keyword-oriented search, i.e., the search returns a list of pages that include a given set of keywords (or phrases). They don’t give any information about the type of the web page.Example: Ratan N Tata - Tata group

Ratan N Tata has been chairman of Tata Sons, the Tata promoter company, since 1991. He is also chairman of other Tata companies, including Tata Motors, www.tata.com/aboutus/articles/inside.aspx?artid=uBZjT+/ooH8= - Cached This is the search result of the Google on typing the keyword "Ratan Tata" in search box.

The information contained in this result is:

1. A link to some webpage which is related to Ratan Tata group.

2. There is an option to get similar pages like this by clicking on the link "Cached".

3. It shows the initial lines of the web pages just as detail.

The drawbacks with the existing system are:

1. The result does not tell anything about the type of the page i.e. whether it is personal, corporate or organizational home page or non-home page.

2. The user must open the initial 4-5 pages of the result to get confirmed that he has come to the right web page. (The personal home page of " Ratan Tata " ).

3. There is a delay in coming to the right page.

2.2. Methodology

In order to classify these pages, an appropriate set of features needed to be determined.

1. Content:

1. Number of Meta tags used.

2. Does the page contain any phone numbers?

3. List of most common words appeared.

2. Form:

1. Does the page have its own domain, or is it in a sub- directory within a domain?

2. Size of file in bytes.

3. Number of words in the page.

3. Functionality:

1. Number of Links in the Web Page.

2. Number of E-mail Links.

3. Proportion of links that are navigational links to other web pages within the same site.

4. Proportion of links that are links to locations within the same page.

5. Number of form inputs.

3. PROPOSED SYSTEM

3.1 System Design

The Automatic Identification of home pages is a system of interactive interface and a classifier to separate pages of different category. The entry station is a interactive interface

which takes keyword to be searched from user. The client system and the server side are distributed system. Specifying the classifier part of the system consists primarily of a class model. Figure 3.1 shows the architecture of the proposed architecture of the system.

The permanent data which is given by the system is stored in clients system. A database ensures that data is consistent and available for the present calculation to be analyzed. The server system sends the result to the client system and classifier does the processing as per the nature of the web page. The classifier and the search engine will be event driven.

The search engine at server has the minimal functionality. It simply replies with the result of the search based on the keyword. The classifier must be efficient enough to handle the classification load. It may be acceptable to block an occasional website with providing users an appropriate message.

Fig.3.1. Architecture of Automatic web page classifier

The classifier is the unit with nontrivial procedures. The only complexity comes with the failure handling. The system must have the capacity to handle the expected worst-case load; there must be enough disk storage to handle all search classification. Each physical unit must protect itself against the failure or disconnection from the rest of the network. A database protects against the loss of data. The down connection should also be identified.

3.2 System Work

The software will contain a search box and a search button, when user will enter any text and click on the search button it will redirect to the search result and list the available pages each site being classified whether it is home page or non home page. If it is home page then it will classify it as corporate page, organizational page or corporate page.

3.2.1 Functions

As per the user input the website listed from Google will be processed by finding occurrences of keywords and based upon that it will classify.

Example:

1. Personal Home Page: my, me, I, t

2. Corporate Home Page: we, services, service, available, fax, our, us, com, contact, copyright, free.

3. Organization Home Page: events, community, organization, help, its, members, news, information.

3.2.2 Validity checks

1. Internet connection should be present.

2. Some sites are unable to be classifying due to security reasons.

3. Error page generated when internet connection is not detected.

3.3 Algorithms Used

The following modules are used:

3.3.1 Personal Page Evaluation

This module takes the URL of the web page to be analyzed as parameter. The page source of the web page is opened within the module and the content of the web page is looked into for the keywords given as: I, me, my, ’t, mine, am. The count for each keyword is recorded and the support is calculated by their frequency of the occurrence. The module also looks into the content, form type and the functionality of the web page.

Algorithm for personal page evaluation

Input: valid URL of a web page

Output: percentage to support the web page as personal

Algorithm personal page evaluation

Begin :{

1. Open URL of web page

2. Look for the keywords to be present in a personal page

3. Count the occurrence of keyword

4. Count of form input items ( buttons, text boxes, drop down list, label ) is calculated

5. No. of links, java script and meta tags are counted

6. Email links, phone numbers and other contacts are looked after

7. Combined calculation of content , form and functionality is done

8. Calculate support and confidence for URL to be a personal page

End :}

The personal page evaluation module runs as a thread so that it can be executed simultaneously with other modules which calculate the support and confidence for other classifications. This makes the execution of the module faster and efficient. Constructor for the personal page has been defined for the initialization of the variable and variables has been declared as private for data security point of view.

3.3.2 Organizational Page Evaluation

This module takes the URL of the web page to be analyzed as parameter. The page source of the web page is opened within the module and the content of the web page is looked into for the keywords given as: events, community, location, organization. The count for each keyword is recorded and the support is calculated by their frequency of the occurrence. The module also looks into the content, form type and the functionality of the web page.

Algorithm for organizational page evaluation

Input: valid URL of a web page

Output: percentage to support the web page as organizational

Algorithm organizational page evaluation

Begin :{

1. Open URL of web page

2. Look for the keywords to be present in a organizational page

3. Count the occurrence of keyword

4. Count of form input items( buttons, text boxes, drop down list, label ) is calculated

5. No. of links, java script and meta tags are counted

6. Email links, phone numbers and other contacts are looked after

7. Combined calculation of content , form and functionality is done

8. Calculate support and confidence for URL to be a organizational page

End :}

The organizational page evaluation module runs as a thread so that it can be executed simultaneously with other modules which calculate the support and confidence for other classifications. This makes the execution of the module faster and efficient. Constructor for the organizational page has been defined for the initialization of the variable and variables has been declared as private for data security point of view.

3.3.3 Corporate Page Evaluation

This module takes the URL of the web page to be analyzed as parameter. The page source of the web page is opened within the module and the content of the web page is looked into for the keywords given as: we, service, copyright, members, client, help etc. The count for each keyword is recorded and the support is calculated by their frequency of the occurrence. The module also looks into the content, form type and the functionality of the web page.

Algorithm for corporate page evaluation

Input: valid URL of a web page

Output: percentage to support the web page as corporate

Algorithm corporate page evaluation

Begin :{

1. Open URL of web page

2. Look for the keywords to be present in a corporate page

3. Count the occurrence of keyword

4. Count of form input items( buttons, text boxes, drop down list, label ) is calculated

5. No. of links, java script and meta tags are counted

6. Email links, phone numbers and other contacts are looked after

7. Combined calculation of content , form and functionality is done

8. Calculate support and confidence for URL to be a corporate page

End :}

The corporate page evaluation module runs as a thread so that it can be executed simultaneously with other modules which calculate the support and confidence for other classifications. This makes the execution of the module faster and efficient. Constructor for the corporate page has been defined for the initialization of the variable and variables has been declared as private for data security point of view.

CONCLUSION

Automatic Classifier can be trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page. Since web act as the repository of information, it is important to develop mechanisms to improve the quality of search engine results, and the incorporation of classifier into the search equation may be one way of doing this to be truly adaptive. So by using this classifier we can reduce the overall search time of search engine for getting the right information as required by the user as, user can get the information from the search engine in one or two hit.

FUTURE SCOPE

It appears that organization home pages do not have a specific style that is unique to them, whereas personal and corporate home pages each have a (more) unique style. Organization home pages can look like either a personal or a corporate home page, depending on who creates the page. There are a number of open research questions yet to be investigated in this area. Machine learning model can be developed in this area. A neural net classifier can be trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page. With the growing importance of the web as the repository of information, it is important to develop mechanisms to improve the quality of search engine results, and the incorporation of genre into the search equation may be one way of doing this to be truly adaptive. In this environment, a classifier would have to:

1. Track a recognized genre as it evolves.

2. Recognizing the introduction of a novel genre which was not seen previously.

This first requirement would entail continuous learning while the second requirement would entail an examination of the set of web pages identified as "noise" with possible clustering of this set to identify new classes of genre. While we have not yet addressed this, an examination of the topic detection and tracking literature may provide useful insights into this problem.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now