Text Based E Learning

Published Date: 02 Nov 2017

Analytical Study under Information Retrieval in Indian Languages

Kolikipogu Ramakrishna

Department of Information Technology

Sridevi Women's Engineering College,

Hyderabad, India

[email protected]

B. Padmaja Rani

Department of Computer Science and Engineering

JNTUH College of Engineering,JNTUH

Hyderabad,India

[email protected]

Abstractâ€” Information Retrieval in Telugu Language (IRTL) is a challenging research area of Application Domain under Information Retrieval Systems. Information Retrieval system is mostly used to present the information to the naÃ¯ve users by reducing the overhead in typing complex queries, searching in the repositories and presenting the results according to his/her interest in a limited visualized space in an organized fashion based on the location of usage. All e-learning application which are includes audio, video, graphics and text will use one of the Information Retrieval Methods. Information Retrieval (IR) is a process of Storing, Retrieving and Presenting the processed information. Naive users expect hundred percent accurate and fast information Retrieval to their queries. IR in English and other European languages is in a nascent stage due to less complex structure and high availability of Language processing tools. Indian Languages are more complex in nature when compared to other languages. In this paper we gave complete survey and review on Information Retrieval in Telugu Language and other Indian Languages. Telugu is one of the 22-Constitutional Indian Languages, which stands in third in spoken languages in India and fifteenth position in all over the world. Major research is going in almost all Universities in India on Information Processing and Retrieval in Indian Languages under various funding agencies. We studied the difficulties of Processing and Retrieving information in Indian languages especially for Telugu language. We also studied various language processing tools and their use in information Retrieval process. At the end we concentrated on how the Retrieval performance is effecting with use of language processing tools on Telugu Language as a Case study.

Keywordsâ€”Information Retrieval; Indian Languages; Telugu Language; Language Processing Tools; Stemmer; POS Tagger ; Morphological Analyzer; Named Entity Recognizer; Dictionaries; WordNet; Ontology.

Introduction

Information Retrieval System(IRS) is multi phase information processing system which involves collecting information objects, normalizing them by standardization , storing in a search-able structure known as repository followed by retrieving interested object and presenting them in a visualization space to the end user to his/her query. The process of Information Retrieval(IR) is same for all the types of objects (i.e. Text Documents, Images Files, Audio Files, Video Files and Web Resources et. al. ) , but the method followed to implement various phases depends on the type of the Object used in Information Retrieval process. In this paper we limited our study and review to Text Documents. The Information Retrieval Systems are designed with the objective of providing, in response to a user query , references to documents that would contain the information desired by the user [1].

Indian Languages

Spoken Languages in India

India is a multi-lingual country. The languages spoken by Indian are from different families of languages like Indo-aryan a subset of Indo-European languages, Dravidian Languages, Austro-Asiatic and Tibeto-Burmese etc. This nature of using multiple languages in different regions of Indian Subcontinent comes with immigration of various species and administration of various migrated kings through Incursion. Over a period of time the people living in different region, who are speaking same language, change their accent, then there comes with different dialects of same languages in India. There are hundreds of dialects followed by the particular section of people in specific regions in India. Hence it is highly difficult to recognize the identity of specific language. i.e. The same language spoken by people in different regions have variant dialects. However, in spite of their diversities, all most all the scripts are derived from Brahmi and the order of alphabets in all the scripts is similar.

In Indian Scripts there are no cases (i.e. UPPER and lower) and vowels are free to occur at the beginning unlike English to be occurred with a word. Building Rules for recognizing language features are different from language to language, hence it is required to build language processing tools for each language with unique features. Lot of research is being carried in this domain funded by government of India to facilitate all the people to freely use resources in their day to day life. Bilingual, Multilingual and Cross Lingual Information Retrieval are challenging task for developers. In this paper we explored major challenges in building language processing tools for Telugu and other Indian Languages.

There are around seven hundred languages [1] in India. Few of them say 22 â€“ languages are given an official status under 4-Families of languages. Telugu language belonging to the South-central branch of Dravidian languages.

Table 1. In total 22 â€“ languages are given official status by Govt. of India in 8th schedule[3].

I. Indo-Aryan Languages

1.Assamese

2.Bengali

3.Dogri

4.Gujarati

5.Hindi

6.Kashmiri

7.Konkani

8.Maithili

9.Marathi

10.Nepali

11.Oriya

12.Punjabi

13.Sanskrit

14.Sindhi

15.Urdu

II. Dravidian Languages

III. Austro â€“ Asiatic

IV. Tibeto-Burmese

1.Kannada

2.Malayalam

3.Tamil

4.Telugu

1.Santali

1.Bodo

2.Manipuri

They are 15 languages Scheduled out of 24 under Indo-Aryan Languages, 4 languages are scheduled among 17 Dravidian Languages, one language is recognized out of 14 Astro-Asiatic Languages. Two languages are scheduled under Tibeto-Burmese Languages. The government of India has given 22 "languages of the 8th Schedule" the status of official language according to Census India, 2010-11[3].

Telugu Language

Telugu is one of the Indian languages belongs to Dravidian family. Telugu is second most spoken language in India [5] and fifteenth place in spoken languages in all over the world. Telugu is an official first language of Andhra Pradesh state with 7, 40, 02,856 native speaker, which is 7.19% of Indian Population [3] as per year 2001 statistics. It is in third place according to population with Telugu as Mother Tongue. Indian Constitution recognizes Telugu as one of the 22 official languages out of hundreds of speaking languages. Unique classical features of Telugu bring "Roman of the East" name by language interested groups, it is very easy to learn, speak and write. Since a considerable number of Telugu speaking minorities live in other states of India and other parts of the world viz., Maharashtra, Orissa, Madhya Pradesh, West Bengal, United States of America, Australia and Europe, Telugu speaking population as a group is high across the country and the world[6]. In Andhra Pradesh Telugu has three different dialects due to identical regions named Coastal-Andhra, Rayalaseema and Telengana regions. Most of the syntactic and Semantic features are same in these three dialectics except pronunciation and little terminology. The Language used in electronic and printed versions are same in all the reasons. These is no issue in processing document repository like other languages, but the word corpus is little big to cover dialectic of three regions. Standard word corpus is collected to handle all type of queries with dialectic features.

Text Based E-Learning System

Electronic Learning is a process of training and educating users through computer system by facilitating mandatory tools and software to access resources of remote place. E-learning is the delivery of a learning, training or education program by electronic means [7]. Resources like Audio, Video, Text and other form of Static or Dynamic Data used to place in a remote place and host computers are connected to a server through network. In this paper we limited to Text repository placed in a Server and accessed by end users through Information Retrieval System. Application of E-Learning under Information Retrieval Covers Information Extraction, Summarization, Personalization, Document Retrieval, Paragraph Extraction, Duplicate Assignment Detection and so on. Recent "online" learning applications are used to deploy on World Wide Web called Internet and Intranet.

Tokenize the Query

Characterize the Tokens

(POS-Tagging)

Eliminate Stop-words

Query Indexing

User Interface

(Query Space)

Text Documents Collection

Frequency Based Indexing

Inverse Document Frequency

Latent Semantic Indexing

Document Indexing

Searching

Result Set

Query

Figure 1 : Text-based E-Learning System

Figure 1. Shows step by step implementation process of Text based E-Learning System using Information Retrieval Process. The complete system from query processing to results extraction is implemented in Telugu Language on a Sample Document collection with 3500 Documents. Initially these 35,00 Documents are classified in to ten categories, such as Business, Devotional, Editorial, Historical, Literature, Politics, Science, Songs, Sports and Stories with more than one lakh word corpus. For Experimental purpose we make it static online system with fixed size document collection. The steps involved in preparing Document collection and user query for search process are explained in the next section using Information Retrieval Process according to Figure 1.

Information Retrieval Steps In E-Learning

Information Retrieval is a process Retrieving and Presenting various content object to the user relevant to his/her query from a standardised collection of objects from different sources or repositories. Users submit short queries that do not consider the variety of terms used to describe a topic, resulting in poor recall power [8]. It is difficult to search on a un-normalized documents collection. Normalization called indexing reduces the complexity of search process by representing whole document with limited set of words or word phrases. Information Retrieval is process of Indexing is a process of identifying keywords to represent a document based on their contents. Indexing is very important phase of Information Retrieval System to create a search-able unit for the given query. Indexing [9] is performed by assigning each document with keywords or descriptive terms, which represents the document.

Query Pre-processing

Registered users are permitted to login the E-Learning system and access the text by writing queries to their information needs. The Query entered by users in User Interface must properly represent the documents, which the user is expecting to study. It is highly impossible to the user to frame a query based on document vocabulary, because repositories will always kept in a remote place and it is difficult to location a particular document in the huge collection. Then Irrespective of the document store, user query must specify their information needs. Lengthy queries are not advisable in search engines, then few words of query need represent whole concept. This can be done using indexing process. Similar kinds of steps are used to index the query term like indexing documents. Query preprocessing involves Tokenization, Characterization, Applying Stop list, Stemming, Indexing bag of words to create searchable Data Structure for given query. This step greatly influences the final search results based on level of word independence. Different searching methods are adopted to compensate the dropping of final results, one with phrase search. Telugu terms are difficult to process, due to their complexity in syllable identification. To overcome this, the entire pre-processing steps from both the ends (i.e. Query and Document) has been implemented in Romanization with WX-Notation. Initially Telugu Text is available in Unicode format (UTF). The UTF-8 formatted queries are converted in WX-Notation [10] and proceed to apply pre-processing steps to get indexed words.

Example 1 : User Query "à°ˆ à°°à±‹à°œà±à°²à±à°²à±‹ à°¸à±à°¤à±à°°à±€à°²à°²à±‹ à°ˆ à°šà±ˆà°¤à°¨à±à°¯à°¦à±€à°ªà±à°¤à°¿ à°•à°¾à°²à°¾à°¨à±à°—à±à°£à±à°¯à°®à±‡" is UTF Representation for Telugu. Word conflations are high in Telugu and it is difficult to get root word of the word form. The Query is Converted in to WX-Notation and then pre-processed. Wx-Notation for the above query is " I rOjullO swrIlallO I caEwanyaxIpwi kAlAnuguNyamE"

WX-Representation for Telugu Scripts

à°… [a]

à°†[A]

à°‡[i]

à°ˆ[I]

à°‰[u]

à°Š[U]

à°‹[q]

à°Ž[e]

à°[eV]

à°[E]

à°’[o]

à°”[oV]

à°…à°‚[aM]

à°…à°ƒ[aH]

à°•[ka]

à°–[Ka]

à°—[ga]

à°˜[G]

à°™[fa]

à°š[ca]

à°›[Ca]

à°œ[ja]

à°[Ja]

à°ž[Fa]

à°Ÿ[ta]

à° [Ta]

à°¡[da]

à°¢[Da]

à°£[Na]

à°¤[wa]

à°¥[Wa]

à°¦[xa]

à°§[ Xa]

à°¨[na]

à°ª[pa]

à°«[Pa]

à°¬[ba]

à°[Ba]

à°®[ma]

à°¯[ya]

à°°[ra]

à°²[la]

à°µ[va]

à°¸[sa]

à°¶[Sa]

à°·[Ra]

à°¹[ha]

à°³[lYa]

à°•à±à°·[kRa]

à°±[rY]

Table -2 .Telugu Scripts in UTF and WX-Notation

Document Preprocessing

Before Indexing Documents need to be cleaned by standardization process. Let all the documents in the repository are in UTF-Encoded format. Steps involved in creating Searchable Data Structure are as follows :

Dj={d1,d2,d3,d4,...dn}, where Dj ser of Documents, j = 1 to n, In this paper n=3500, dj is jth Document belongs to Dj

Split dj into sentences based on .(dot) delimiter

Sjk = { S1,1 , S1,2, S1,3 . . . S1,m} where Sj,k set of sentences belongs to dj âˆˆ Dj, j=I to n and k=1 to m

In-case of treating sentence as index term there may be a chance of promise in precision, but recall will be too low.

Splitting Sentences in to Words by considering white spaces and new line as delimiters.

Wjkl = { w1,1,1, w1,1,1, w1,1,1 â€¦ wn,m,p,}, where wjkl is a set of words belongs to Sjk Sentence, that belongs to dj Document from Dj set.

Characterize the Words by applying POS Tagger, where as in this paper we used Morphological Analyzer to generate root words and its POS Tagging.

Apply Stoplist to eliminate most frequent words like Prepositions, Conjunction, Intersections, Numerals, Pronouns etc.

#Term Ti occurrence in a Document Dj

Total #words in Document Dj

Wij

=

List of stop words are collected from various sources [11] and few are manually collected from corpus.

By setting higher threshold and lower threshold few more functional words are identified. Wrong selection of threshold value Leeds to poor indexing

Select a particular Indexing process to weigh the terms of documents as better tuned indexing terms.

Frequency based Indexing [12]

Inverse Document Indexing [13]

Latent Semantic Indexing using Singular Value Decomposition [14].

Experiments and Results Analysis

The proposed work has been implemented on sample collection of 3,500 Documents manually categorized in to ten groups. In each Category 10 queries were created based on random word selection with length 3-content words.

Table 3 Input Statistics :

S.No

=

tf x idf

tf-idft,d is assigned to a term t in a document d, The query terms are used to rank the documents by their score values :

SCORE(q,d)

=

âˆ‘ tf-idft,d

for each query term t in q.

The Indexing Process is successfully completed for whole document collection and Retrieval of Documents for query q is done by taking highest score of query terms those occurs in document dâˆˆD.

Precision and Recall

Precision is the ratio between number of items retrieved that are relevant to the total number retrieved.

# Items retrieved relevant

Precision =

# Items total retrieved

Recall is the ration between number of items retrieved that are relevant to the total number of relevant items.

# Items retrieved relevant

Recall =

# Items total Relevant

Conclusion & Future Work

In order to build E-learning applications for Text retrieval, Information Retrieval methods are good start. Cataloging in a Library system motivated us to build online library where categorical text documents as source collected from Daily News Papers. While implementing proposed system, many issues were raised. As we used UTF to WX â€“ WX to UTF convertor before and after preprocessing text, identifying foreign words and numbers is very important task. Indexing based on Inverse Document Frequency gave better results compared to our baseline approach Frequency. Still the results are not up to the mark as per concept description is concern. We proposed to use lexical resources to expand the Document during indexing and Query expansion during search process as future work. There are no Lexical and Semantic resources available from Telugu to Telugu. For better representation of document and query in a semantic level need concept descriptions. Latent Semantic Indexing solves Synonym problem. High level concepts can be represented using Lexical resource like WordNet, Concept hierarchy called Semantic network and Ontology etc. Building these kind of language resources is a challenging task to the researcher and developer due to richness of Telugu language. Multilingual and Cross lingual Information Retrieval Systems are depending on a source language and mapping to other language, where as for monolingual identifying language features is quite important which may vary from language to language.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now