The Fast Growth Of World Wide Web

Published Date: 02 Nov 2017

Abstractâ€”Automatic web page classification plays essential role in several information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etcâ€¦) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this paper, we propose an approach which categorizes web pages by exploiting plain text and text contained in HTML tags. Our method operates in two steps. In step 1, we use SVM classifiers to generate, for each target web page (page to classify), reduced vector representation based on plain text and text from HTML tags. In Step 2, we submit this vector representation to NaÃ¯ve Bayes algorithm to determine the final class for the web page. Our experimental results, on two large datasets of pages from ODP (Open Directory Project) and WebKB (Web Knowledge Base), prove that the use of HTML tags content combined with plain text is significantly better than the use of text alone in web page classification performance in terms of both Micro-F1 and Macro-F1 measures.

Index Termsâ€” HTML Tags, NaÃ¯ve Bayes, Machine learning, SVM, Webmining, Web page classification

â€”â€”â€”â€”â€”â€”â€”â€”â€”â€” ïµ â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”

1 Introduction

W ith the fast growth of World Wide Web size, the task of locating relevant information on the web becomes much more complex. The web contains billions of web pages making the need for performant automatic web page classification techniques more noticeable. Web page classification creates new problematic research issues due to rich information enclosed within hypertext. Standard text classifiers usually work on structured data collections, but are not concerned by special characteristics possessed by web pages like HTML tags, hyperlinks, etcâ€¦.. Research demonstrates that integrating hypertext information along with the content of web pages [1]â€“[3] can provide rich source of information for web classification.

On the other hand, there are web pages that contain HTML tags treating related but different subjects, which do not necessarily indicate the major class of those web pages. Also, there are some web developers who define HTML tags contents which are irrelevant to categories of web pages so that these latter can be top listed by search engines. This poses serious challenge to classifiers with respect to discriminating informative tags from noisy tags. For instance, a course home page can have course information in title (<title> tag), refers to course teacher(s) home pages in anchor (<a> tag) and talks about students in paragraph (<p> tag). In this case, <a> and <p> tags provide no indication about web page category, where as <title> tag seems to be more informative. Given this, the question of how to effectively employ HTML tags contents becomes very crucial to web classification.

In this paper, we address the problem of how HTML tags contents can be utilized to enhance web page categorization. We propose HTML tags content based web page classification approach that proceeds in two phases. In the first phase, we construct seven virtual documents for each target page. We create the first virtual document using web page text content alone and each one of the six others using text extracted from each of HTML tags: <title>, <H1>, <H2>, <strong>, <em> and <a>. Then, we use SVM [4] classifier to obtain a class for each of the seven virtual documents. Next, we represent each web page by the vector composed of the seven classes obtained, by SVM. In phase 2, we submit our vector to NaÃ¯ve Bayes [5] classifier to find the final class for the web page. The main contributions of this work can be summarized as follows:

We suggest an approach using 2 classifications: one to better represent the target web page, and another for its final classification. This approach is distinguished by (1) new technique that uses SVM classifiers to create reduced and more refined vector representation for target web page. The creation of this vector is based mainly on HTML tags information without any weighting schema for these tags. (2) The use of NB model (for final target page classification) where the naÃ¯ve part, namely independence assumption of topics vector components, is experimentally verified thanks to very small dimension of representation vector.

We test our approach on two different web corpora (The ODP [6] and WebKB [7]). Our study shows that combining plain text with HTML tags information results in better web page classification performance than using plain text alone.

The remainder of the paper is organized as follows. In section 2, we review recent work on using meta data and hypertext information to improve web page classification. In section 3, we present our approach in details. In section 4, we show the experimental setting adopted. In section 5, we present and discuss obtained results. Then, we conclude our work and cite some of our future perspectives.

2 RELATED WORK

Web pages features have been used by several research works as a rich source of information to boost hypertext classification performance.

Fang and al. [8] discussed web page classification using hypertext features such as the text included in the web page, the title, headings, URL and anchor text. They investigated five different classifications methods based on SVM that use individual or combined features. They showed that combining these features improves classification performance. They also proposed a voting technique that further improves the performance compared to individual classifiers. Zhongli and al. [9] proposed a new approach of web classification using the output from an MFICA (improved feature selection using Independent Component Analysis) as the input for NB classifier. They experimentally demonstrated that their method provides acceptable classification accuracy. Youquan and al. [3] contributed to NB algorithm improvement by considering semi structure nature of web page. They split HTML tags into "Supervise tags" groups to supervise classification. They introduced new weighting strategy based on these "supervise tags" groups. They concluded that NB classifier performance benefits from their weighting policy. Sun and al. [10] employed SVM algorithm to classify web pages using their content and context. They showed that combination of text with <title> and <a> HTML tags helps improving classification performance as opposed to the use of text alone. They also proved that this combination assist SVM to perform very well compared to FOIL-PILFS algorithm. Li and al. [11] introduced a new method that calculates the wordâ€™s mixed weight by the information of the HTML tags feature, and then mines the classification rules based on the mixed weight to classify web pages. They showed that their approach allows better performance than traditional associated classification methods. Fernandez and al. [12] tried four types of weighting based on the frequency and HTML tags. They also tried the combination of these weights using ACC "Analytic combination criteria" and FCC "Fuzzy combination criteria". Their experiments showed that the use of NaÃ¯ve Bayes along with the weighted Gaussian model and FCC gives best results. They concluded that the use of HTML tags improves the performance of the classification. Ardo and Golub [13] showed that the weighted linear combination of the contents of the title, headings, metadata and text works best in the classification. Kwon and al. [14] supplemented KNN approach with a feature selection method and a term weighting scheme using markups tags. They also reformed inter document similarity measure used in vector space model. They showed that their method improves the performance of KNN classification.

BÃ¶ttcher and al. [15] presented an approach that uses weighting of HTML tags for separating relevant information in hypertext documents from the noise. They evaluated their method on web page classification. They proved that their technique significantly helps classification.

All research work cited above introduce very interesting research ideas on hypertext classification. This paper proposes two stage classification approach. It employs SVM classifiers for web page reduced vector representation. Then, it applies NaÃ¯ve Bayes classifier on this vector representation to find the final class.

3 PROPOSED APPROACH

Web pages authors usually embed very important text within HTML tags such as <title> and </title>, <h1> and </h1> etcâ€¦.. On the other hand, these web pages have complex structure where plain text and HTML tags contents may talk about related but different subjects. Even though these latter may not be the same, they all inform to the final page with various contribution degrees. Our approach is motivated by simply considering these possibly different subjects towards final classification of target page. This motivation is driven by two curiosities about these subjects: (1) how they can be exploited to best represent the target web page (2) how they can be made optimally informative to the right class of target page. In this work, we construct our topics using plain text part and text from six HTML tags that are described in the table below:

Role

<title>

Contains the pageâ€™s title.

<h1>

Contains the most important title of the web page.

<h2>

Contains second level titles (ex.: sections titles).

<a>

Contains a description of neighboring web pages whose URL is in the "href" attribute of this tag.

Indicate an emphasised content.

Table Â : Descriptions of HTML tags used in our approach

Our technique operates in two stages. First, we construct, for each target page, seven virtual documents (VDs) where the first document is based on plain text alone, where as the six others are created from text extracted respectively from <title>, <h1>, <h2>, <a>, <strong> and <em>. Then, we separately submit these VDs to SVM classifier to obtain seven classes (called "Topics"). These topics, which are not necessarily the same, form a vector ("Topics" Vector) that represents the target page. Second, we supply our "Topics" vector to NB classifier in order to compute the final class.

2.1 "TOPICS" VECTOR REPRESENTATION USING PLAIN TEXT AND HTML TAGS

As mentioned earlier, we create seven virtual documents for target web page. One VD contains terms extracted from plain textual part, where as each VD, of six others, is created from text included in one pair of the six HTML tags (<title></title>; <H1></H1>; <H2></H2>; <strong></strong>; <em></em> and <a></a>). We define our VDs as follows (t denotes a term):

where X is the plain text of the target web page P.

where "title" is the content of the <title> tag.

where "h1" is the content of <h1> tags.

where "h2" is the content of <h2> tags.

where "a" is the content of <a> tags.

where "strong" is the content of <strong> tags.

where "em" is the content of <em> tags.

Finally, we generate "TOPICS" vector from the seven classes obtained previously:

To find the most adapted algorithm to classify each VD, we experimentally consider several techniques. We opted for SVM algorithm that demonstrates high performance in such type of classification [8], [10], [16].

We applied SVM classification for each VD representation of target web page as follows:

Name of the virtual document representing the web page P

Name given to the SVM classifier corresponding to the virtual document

Result of the SVM classifier

DX(P)

SVM(X)

DT(P)

SVM(T)

DH1(P)

SVM(H1)

DH2(P)

SVM(H2)

DS(P)

SVM(S)

DE(P)

SVM(E)

DA(P)

SVM(A)

Table : Virtual documents representing target web pages

2.2 NAÃVE BAYES CLASSIFICATION

In this step, we use NB classifier to compute the final class of target page represented by "TOPICS" vector. NB is simple and effective text classification algorithm which has been shown to perform very well in practice [17], [18]. Figure 1 provides a summary of our approach:

Figure Â : The proposed approach

4 EXPERIMENTS

In this section, we present the experiments that we have conducted in order to validate the advantage of combining HTML tags text and plain text in web page classification task. We describe the experimental design adopted, i.e. the preprocessing techniques utilized, the classifiers applied, the data sets employed, the performance evaluation metrics used and the results obtained based on those measures with the discussion.

4.1 EXPERIMENTAL SET UP

4.1.1 PRE-PROCESSING

In first stage of our approach, we applied a number of techniques so as to clean and normalize the raw text contained in each of the seven parts of target page. In tokenization step, we converted all words tokens to lower cases to avoid the problem of case differences. We removed from words some special characters, punctuation marks, scripts, styles, mimes headings and numbers. We also eliminated stop words and terms that occur less than twice. As stemming process, we applied Porter method [19]. After pre-processing stage, we proceed to the dictionary building. In our case, we consider web pages as bags of words. Therefore, our dictionary consists of words resulting from pre-processing. Then, we represent our web pages by the use of the conventional Vector Space Model [20]. We map each web page onto its vector; where denotes the weight of term i with respect to web page. We adopted the Inverse Document Frequency [21] based weighting model to obtain the weights:

Where N is the total number of target web pages and is the number of pages where term i appears.

In second stage, we did not perform any pre-processing for "TOPICS" vectors representing our target web pages.

4.1.2 CLASSIFIERS USED

We employed two different classifiers, namely Support Vector Machines [4] for step1 and NaÃ¯ve Bayes [5] for step 2. We provide a succinct description for each of these algorithms below.

4.1.2.1 SUPPORT VECTOR MACHINE (SVM)

SVM is a powerful learning technique that has been successfully applied to text classification [16]. It is based on the Structured Risk Maximization theory, which aims at minimizing the generalization error instead of the experimental error on training data alone. Multiple versions of SVM have been developed [22]. In this work, we used SMO (Sequential Minimal Optimization) [23], [24]. We empirically adjust C parameter, which defines tolerance degree to errors, to the value maximizing both Micro-F1 and Macro-F1 measures on datasets. On the other hand, we use a linear kernel, which proves to be efficient for text categorization where we have high feature vector dimension [16]. This is the case in both ODP and WebKB datasets that count respectively 29888 terms, and more than 17215 terms.

4.1.2.2 NAÃVE BAYES (NB)

NB is a simple and very popular text classification algorithm [17], [18]. Its basic principle is to use the joint probabilities of words and categories to estimate the probabilities of categories given a document. In this work, we applied Bayesâ€™s rule to use the joint probabilities of topics (provided by SVM classifiers) and categories to estimate the probabilities of category given a web page. This rule can be formalized as follows:

Where A and B are events, and is the complementary event of A.

Letbe the set of topics vector components. Let be the set of categories. Having a web pagerepresented by a topics vector

, whereis a possible value of.

By making the assumption that all attributes are conditionally independents given a class, the probabilityâ€™s expression can be formalized as follow:

To estimate posterior probabilities, we used the m-estimate model [25].

In this work, reduced dimension of topics vector (7components) made it possible to experimentally verify Bayesian modelâ€™s naÃ¯ve part, which is the assumption of topics independence. Thus, we experimentally show that the conditional probability of each topic given a category is independent from the conditional probabilities of other 6 topics given that category. For each data collection used, we consider 21 pairs (i.e.) of topics (). We proved experimentally that:

For each pair of topics (), where k=1,2,â€¦..,7 and kâ€™=1,â€¦.,7 such that kâ‰ kâ€™ and for each category we have :

The following tables represent the average of differences between the probabilities for both datasets.

Course

Faculty

Other

Project

Student

0,003

0,005

0,021

0,02

0,006

Table Â : Average of differences between the probabilities for WebKB

Cancer

Cardiovascular_Disorders

Immune_Disorders

0,005

0,004

0,006

Infectious_Diseases

Musculoskeletal_Disorders

Neurological_Disorders

0,004

0,002

0,001

Other

0,001

Table Â : Average of differences between the probabilities for ODP

4.1.3. DATASETS

In this article, we used 2 different data collections to evaluate our approach. The first data set is taken from Open Directory Project (ODP) and the second is extracted from the "Web Knowledge Base" (WebKB).

The ODP is a huge repository containing around 4.6 million URLs and 765,282 categories and subcategories [26]. We extracted our dataset from the category "Top/Health/Conditions_and_Diseases", which counts 21 subcategories with 6 of them, are sufficiently populated for our experiments. The 15 others were treated as subcategory "Others". In this work, we conducted our experiments on the 6 most populated subcategories and the subcategory "Others". Hence, the seven categories which are considered are:

Cancer

Cardiovascular_Disorders

Immune_Disorders

Infectious_Diseases

Musculoskeletal_Disorders

Neurological_Disorders

Other

The WebKB is a collection of web pages about computer science departments of different American universities (Such as Cornell, Texas, Washington, Wisconsin and others). It contains 8282 web pages and counts 7 categories: Student (1641 pages), Faculty (1124 pages), Course (930 pages), Project (504 pages), Department (182 pages), Staff (137 pages) and Other (3764 pages). The category "Other" contains pages instances of others categories. To avoid any possible bias to classifiers, we replace this category content by "Staff "and "Department" contents. In this paper, we consider 5 categories for our experiments: the 4 most populated categories "Student", "Faculty", "Project", "Course" and "Other" category, which was replaced by the 2 least populated categories "department" and "Staff". Therefore, the categories used here are:

Student

Faculty

Course

Project

Other

Our approach proceeds in 2 stages, where we apply 2 learning classifiers that need training and test datasets. The output "Topics" vector of the first stage is used as an input for the second stage. Consequently, we carefully prepare a data partitioning protocol for each data collection. We divide, each data corpus, into 3 disjoint datasets as follows:

Dataset (Tr-set1) to train SVM classifiers.

Dataset (Test-set1) to test SVM classifiers in order to generate Vtopics vector, which is employed as training set (Vtopics-Tr) for NB classifier.

Dataset (Test-set2) to test our entire approach: Test-set2 is used to test SVM in order to create Vtopics vector. This latter is utilized as test set (Vtopics-test) for NB to generate final classes. Figure 2 outlines our data partitioning workflow.

Figure : Data partitioning workflow

4.1.4. EVALUATION MEASURES

To evaluate the performance of both classifiers, we employ the standard metrics recall, precision and F1, that are commonly used in the classification task. Recall is defined to be the ratio of correct assignments by the system divided by the total correct assignments. Precision is the proportion of correct assignments by the system within the total number of the systemâ€™s assignments. F1, introduced by van Rijsbergen [27] is the equally weighted average of recall and precision as stated below:

To measure global averages across multiple categories, we used two averaging techniques: Micro-averaging and Macro-averaging. The former, which has been widely used in cross-method comparisons, assigns equal weight to every document. The latter, which has been used in some cases [28], gives equal weight to every category, regardless of its frequency. The micro-averaged scores are dominated by the classifier performance on most populated categories, where as the macro-averaged scores are influenced more by rare categories. In our experiments, both kinds of scores are used to test the effectiveness of classification.

4.2 RESULTS

The results, on WebKB dataset (table 5 and figure 3), indicate that for all categories and for recall, precision, and F1 measures NB classifier based on our proposed model performs better than NB classifier based on text alone. For more informative outcomes, we also provide micro-F1 and Macro-F1 averaging. These latter show that our proposed schema also assists NB classifier to improve performance on both common and rare categories.

Similarly, the results on ODP dataset (table 6 and figure 4), globally confirm that for recall, precision and F1 our approach helps NB algorithm in improving classification compared to NB using text alone. However, we notice that for the category "Infectuous_Diseases", the difference in performances is not significantly remarkable. This is due to the fact that lots of web pages, for that category, do not accidentally have some HTML tags. In our experiments, we replace missing tags with plain text of web pages. This justifies the close match between results obtained by NB classifier in case of our proposed model and when using text alone.

We also supply, for ODP dataset, averaging evaluations. These latter measures point out that NB (on proposed model) also outperforms NB (on text alone) for frequent and uncommon categories.

Course

Faculty

P

R

F-1

P

R

F-1

NB (Tags+ SVM)

0,942

0,936

0,934

0,683

0,857

0,758

NB (Text)

0,912

0,797

0,84

0,5

0,635

0,557

Project

Student

P

R

F-1

P

R

F-1

NB (Tags+ SVM)

0,827

0,755

0,781

0,91

0,892

0,899

NB (Text)

0,425

0,708

0,52

0,869

0,755

0,807

Macro F1-mesure

Micro F1-mesure

NB (Tags+ SVM)

0,843

0,874

NB (Text)

0,681

0,742

Table 5: A comparison between the NB based on our proposed model and the NB based on text alone for WebKB dataset

Figure : A comparison between the NB based on our proposed model and the NB based on text alone for WebKB dataset

Cancer

Cardiovascular_Disorders

P

R

F-1

P

R

F-1

NB (Tags+ SVM)

0,9

0,826

0,861

0,755

0,432

0,55

NB (Text)

0,211

0,515

0,299

0,889

0,23

0,365

Immune_Disorders

Infectious_Diseases

P

R

F-1

P

R

F-1

NB (Tags+ SVM)

0,877

0,4

0,549

0,57

0,633

0,6

NB (Text)

0,263

0,2

0,227

0,634

0,5

0,559

Musculoskeletal_Disorders

Neurological_Disorders

P

R

F-1

P

R

F-1

NB (Tags+ SVM)

0,717

0,25

0,371

0,435

0,825

0,57

NB (Text)

0,442

0,125

0,195

0,418

0,3

0,349

Macro F1-mesure

Micro F1-mesure

NB (Tags+ SVM)

0,584

0,613

NB (Text)

0,332

0,323

Table 6 :A comparison between the NB based on our proposed model and the NB based on text alone for ODP dataset

Figure : A comparison between the NB based on our proposed model and the NB based on text alone for ODP dataset

5 CONCLUSION AND FUTURE PERSPECTIVES

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

The Fast Growth Of World Wide Web

â€”â€”â€”â€”â€”â€”â€”â€”â€”â€” ïµ â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”

2 RELATED WORK

3 PROPOSED APPROACH

Tag

Role

2.1 "TOPICS" VECTOR REPRESENTATION USING PLAIN TEXT AND HTML TAGS

Name of the virtual document representing the web page P

Name given to the SVM classifier corresponding to the virtual document

Result of the SVM classifier

2.2 NAÃVE BAYES CLASSIFICATION

4 EXPERIMENTS

4.1 EXPERIMENTAL SET UP

4.1.1 PRE-PROCESSING

4.1.2 CLASSIFIERS USED

4.1.2.1 SUPPORT VECTOR MACHINE (SVM)

4.1.2.2 NAÃVE BAYES (NB)

Course

Faculty

Other

Project

Student

Cancer

Cardiovascular_Disorders

Immune_Disorders

Infectious_Diseases

Musculoskeletal_Disorders

Neurological_Disorders

Other

4.1.3. DATASETS

4.1.4. EVALUATION MEASURES

4.2 RESULTS

Course

Faculty

P

R

F-1

P

R

F-1

Project

Student

P

R

F-1

P

R

F-1

Macro F1-mesure

Micro F1-mesure

Cancer

Cardiovascular_Disorders

P

R

F-1

P

R

F-1

Immune_Disorders

Infectious_Diseases

P

R

F-1

P

R

F-1

Musculoskeletal_Disorders

Neurological_Disorders

P

R

F-1

P

R

F-1

Macro F1-mesure

Micro F1-mesure

5 CONCLUSION AND FUTURE PERSPECTIVES

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

â€”â€”â€”â€”â€”â€”â€”â€”â€”â€” ïµ â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”

2.2 NAÃVE BAYES CLASSIFICATION

4.1.2.2 NAÃVE BAYES (NB)

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time