Application Of Data Warehousing And Mining

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Abstract— Major libraries have large collection of database. Managing libraries electronically has resulted in the creation and management of large library databases. And hence it is a significant challenge to search database information stored. In this context, data mining improves the quality and efficiency of search results. Most multinational industries worldwide have exploited such opportunities by applying data warehouse technology to their data repositories to discover knowledge that had helped them to gain competitive advantage through decision making. The same can be done for libraries using the available large databases. This paper shows how the data warehouse technology could assist them to discover knowledge and improve services.

Index Terms—Data mining in Library Database

Introduction

A library would record data about their books using library catalogues. Each catalogue consists of data about the author, title, subject, publisher, edition, place, year, language and ISBN of the book. In a manual library system one maintains library catalogues only in author sequence as each catalogue entry requires a separate entry card which is placed in the catalogue drawer. Thus a user could search for a catalogue entry only by author name.

To improvise this system in online library management system, we can use the applications of Data Mining and improve the search results on the basis of previous data searched by the users. We can also analyze the criticality of books so that library could be signaled when there is a need of adding a critical book. To find the critical book we can analyze the pattern in line plot between number of books and search count of that particular book.

SOME DEFINITIONS AND PRELIMINARIES

E-library

A library managing an e-catalogue would have three categories of data, namely bibliographic (catalogue), circulation (borrowing) and acquisition (purchasing). For each catalogue entry there would be some information about its acquisition process.

Library circulation data is required to be kept in the database until the items borrowed are returned. However, as we would see later such data would serve a librarian in decision making if we retain these data.

Digital library

Digital library, also called an electronic library is being widely adopted across many Libraries.

Functions of a library have grown beyond maintaining books, magazines and newspapers. Many libraries also provide CD/video lending, and online searching, reservations and browsing e-journals. Certain universities and libraries have even moved beyond this level and provide full-text books, multimedia manuscripts and periodicals (Chen 2000). We also see newspapers, technical documents being made available on the web along with its print edition.

Now let us try to quantify the data that we would need to extract for transforming a paper based library to a digital library.

Now let us look at some storage requirements for digital material. A scan A4 page would on average would take about 50 kilobytes of storage (White paper 22009). Standard storage box or a drawer is estimated to have about 2500 pages of information. Thus a CD could store content of 4 drawer file cabinet. This type of calculation is good for organizations who want to put their records electronically. However for book, it is not practical to scan in the first place as it could damage the book, especially if the book is quite old. However if we look at how digital copiers are performing their photocopying activities we could see that this is not that difficult. If we assume that a book on average has 250 pages than 40 books may go into one CD. This calculation would go up if one considers high resolution colors material. Similarly the requirements will go up for audio and video content. Thus when digital material is included e-catalogues are considered as digital libraries using VLDB. Current digital libraries must also contain various scholarly papers required for both students and teachers.

Information in a library is of two kinds — there is the content, the collection, all that stuff that resides in books and journals and special collections; and there is the information about that content, the metadata: information about where things are located, how they relate to other things, how often they circulate (but, rarely, for privacy reasons, about who actually accesses and reads the content). It’s that latter kind of information, the metadata.

Data Warehousing

A data warehouse is defined as a "subject-oriented, integrated, time variant, non-volatile collection of data that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions. In data warehouses historical, summarized and consolidated data is more important than detailed, individual records. Since data warehouses contain consolidated data, perhaps from several operational databases, over potentially long periods of time, they tend to be much larger than operational databases.

Data Mining

Data Mining is the extraction or "Mining" of knowledge from a large amount of data or data warehouse. To do this extraction data mining combines artificial intelligence, statistical analysis and database management systems to attempt to pull knowledge form stored data.

Data mining is the process of applying intelligent methods to extract data patterns. This is done using the front-end tools. The spreadsheet is still the most compiling front-end application for Online Analytical Processing (OLAP). The challenges in supporting a query environment for OLAP can be crudely summarized as that of supporting spreadsheet operation effectively over large multi-gigabytes databases.

Data, Information, Knowledge and Data Mining

Data

Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:

operational or transactional data such as, sales, cost, inventory, payroll, and accounting

nonoperational data, such as industry sales, forecast data, and macro-economic data

Meta data - data about the data itself, such as logical database design or data dictionary definitions.

Data Format

Data items can exist in many formats such as text, integer and floating-point decimal. Data format refers to the form of the data in the database.

Information

The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

Knowledge

Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

Binning

A data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For example, age could be converted to bins such as 20 or under, 21-40, 41-65 and over 65.

Data Mining

Data mining can be defines as "An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis."

Bibliomining

Use of data mining to examine library data records might be aptly termed bibliomining. With widespread adoption of computerized catalogs and search facilities over the past quarter century, library and information scientists have often used bibliometric methods (e.g. the discovery of patterns in authorship and citation within a field) to explore patterns in bibliographic information. During the same period, various researchers have developed and tested data mining techniques — advanced statistical and visualization methods to locate non-trivial patterns in large data sets. Bibliomining refers to the use of these techniques to plumb the enormous quantities of data generated by the typical automated library.

Data Mining and the Libraries

Till now we have discusses about the data mining and its working. Now we are going to explore how data mining can be useful in the field of library and information science. As per fifth law of library science "Library is a growing organization" [9] so the volume of the library data is also growing with an enormous rate. For efficiently and effectively doing the library administration and extending library services the need of library automation and e-Library occur. But simply automating the library or developing an e-Library is not the only solution unless and until we are not able to explore the hidden information from the large amount of database. This can be done by applying the data mining in the library data.

Now we take a glance on the possibilities opening in the new age of data mining in the field of library and information science.

Classification - By using data mining we can develop a computer program that will replace the manual classification with the automatic classification of library contents. Classification mimics library cataloging procedures by grouping structured and unstructured data according to certain criteria such as source (e.g., government bodies), document type (e.g. maps), language, subject, or a number of other criteria [3].

Link analysis- Likewise the paper materials, where similar documents tend to have similar bibliographical references, and frequency of citation is often considered to reflect the quality or importance of document, link analysis assumes that higher-quality or otherwise more desirable documents will generally be linked to more frequently than other documents, and that links in ac document reveal something about the content of a document. Link analysis can place frequently linked-to-documents at the top of a list or identify documents that are associated with each other [3].

Sequence analysis- Sequence analysis uses statistical analysis to identify unlinked documents that users are likely to want to read together. It examines the paths that users follow when searching for information and can help identify which documents users are likely to want together [3].

Summarization- Though machine generated abstracts are inferior to human-generated ones in terms of readability and content, yet they can be very useful for helping users decide what items they need. Abstract-generating software typically works by identifying significant words or phrases based on position within documents association with critical phrases [3].

Clustering- Clustering is similar to classification, except that the classes are determined by finding natural groupings in the data items based on probability analyses rather than by predetermined groupings. Clustering and classification are often used as a starting point for exploring further relationships in data. For example, many search engine (such as Northern Light) break down sites by location, subject, or language before sub-arranging data [3].

Data mining consists of five major elements

1. Extract, transform, and load transaction data onto the data warehouse system.

2. Store and manage the data in a multidimensional database system.

3. Provide data access to business analysts and information technology professionals.

4. Analyze the data by application software.

5. Present the data in a useful format, such as a graph or table.

Applications in Library Management System

After an information resource appears in the library’s collection, users locate it using catalog search systems and bibliographic databases. Although little uniformity exists with regard to the specifics of the user interfaces for these systems, most catalogs and bibliographic databases support a standard Web browser client as the user interface. Increasingly, catalogs and databases are cross-linked, and each user’s search record and traversal of links appears in log files. When users find resources that they wish to borrow, the circulation department records their selection in a database that tracks the location of each resource owned by the library. As this overview suggests, all functional processes of the library – collection assessment, acquisition, cataloging, end user searching, and circulation – generate large reserves of available data that document information resource acquisition and use. Library information systems frequently use large relational databases to store user information, resource information, circulation information, and possibly bibliographic search logs. The vast data stored in the databases of traditional and digital libraries represent the behavioral patterns of two important constituencies: library staff and library users. In the case of library staff, mining available acquisitions and bibliographic data could provide important clues to understanding and enhancing the effectiveness of the library’s own internal functions. Mining user data for knowledge about what information library users are seeking, whether they find what they need, and whether their questions are answered, could provide critical insights useful in customer relations and knowledge management. These kinds of information can have strategic utility within the larger organization in which the library is situated.

Integrated Library Systems and Data Warehouses- Although the system used in most parts of the library is commonly known as an Integrated Library System (ILS), very few ILS vendors facilitate access to the data generated by the system in an integrated fashion. Instead, most librarians conceptualize their system as a set of separate data sources. While a relational database stands at the heart of most ILS systems, few system vendors provide sophisticated analytical tools that would promote useful access to this raw data. Instead, vendors encourage library staff to use pre-built front ends to access their ILS databases; these front ends typically have no features that allow exploration of patterns or findings across multiple data sets.

As a first step, most managers who wish to explore bibliomining will need to work with the technical staff of their ILS vendors to gain access to the databases that underlie their system.

Once the vendor has revealed the location and format of key databases, the next step in bibliomining is the creation of a data warehouse. As, with most data mining tasks, the cleaning and pre-processing of the data can absorb a significant amount of time and effort. A truly useful data warehouse requires integration methods to permit queries and joins across multiple heterogeneous data sources. Only by combining and linking different data sources can managers uncover the hidden patterns that can help understand library operations and users. After the data warehouse is set up, it can be used for not only traditional SQL-based question-answering, but also online analytical processing (OLAP) and data mining. Multidimensional analysis tools for OLAP (e.g., Cognos) would allow library managers to explore their traditional frequency based data in new ways by looking at statistics along easily changeable dimensions. The same data warehouse that supports OLAP also sets the stage for data mining. This data warehouse will lower the cost of each bibliomining project, which will improve the cost/benefit ratio for these projects. The remainder of this chapter builds on the assumption that this data warehouse is available.

Bibliomining to Improve Library Services

The users of library services are one of the most important constituencies in most library organizations. Most libraries exist to serve the information needs of users, and therefore, understanding those needs is crucial to a library’s success. Examining individual users’ behaviors may aid in understanding that individual, but it tells librarians very little about the larger audience of users. Examining the behaviors of a large group of users for regular patterns can allow the library to have a better idea of the information needs of their user base, and therefore better customize the library services to meet those needs. For many decades libraries have provided readers’ advisory services with the help of librarians who know the collection well enough to help a user choose a work similar to other works. Market analysis can provide the same function by examining circulation histories to locate related works. In addition, this information could be provided to the OPAC to allow users to see similar works to one they have selected based upon circulation histories. While it is technically possible to build a profile for users based upon their own circulation history (Amazon.com for example), it may be legally and ethically questionable to do this without a user’s permission. Nonetheless, by obtaining and using anonymous data from a large number of users, one can obtain similar results.

In order to locate works in the library, users rely on the OPAC. Librarians often examine user comments and surveys to assess user satisfaction with these tools. Therefore, librarians may wish to examine the artifacts of those searches for problem areas instead of relying on user comments and surveys in order to improve the user experience. When upgrading or changing library system interfaces, librarians can explore these patterns of common mistakes in order to make informed decisions about system improvements.

Undergraduates don't use the precise terminology of Library of Congress classifications. They may be thrown by library terminology: Several students in the study thought that a link to "scores" pertained to sports scores, not musical scores, while others thought a link to "maps" would provide maps and directions to the library, not items from map collections.

Although students aren't discriminating about library classifications, they are concerned about finding sources for their research, and Web searches aren't meeting that need. In this respect, RLG's user studies echo the findings of a 2002 study by the Digital Library Federation and Outsell, Inc. That study found that existing information sources available to students are falling far short of their expectations in terms of quality, ease of access, subject coverage, and other aspects they considered most important. Ninety-one percent of students felt that high-quality information sources were very important; only 51 percent felt that available electronic information sources are meeting that need. A study published by the Online Computer Library Center (OCLC) in June 2002 confirms these findings, showing that accuracy is the most important attribute of online information for students—yet only half of those surveyed found that information on the Web is acceptable in this regard.

For example, a student might enter a search for the keywords "Civil War" without specifying the American, Spanish, or other civil wars. Using Recommend, RedLightGreen can organize the results in clusters of related items, letting the student pick which civil war interests her. At the same time the application can insert more specific, scholarly subject classification terms into the search that have been derived from the MindServer data. A search for "New York riots" would turn up records pertaining to the Irish draft riots of 1863, with subject headings such as "New York—History—Civil War, 1861-1865" and "Civil War, 1861-1865—Fiction"—even though those headings don't match the keywords exactly.

Bibliomining For predicting future user needs

By looking for patterns in high-use items, librarians can better predict the demand for new items in order to determine how many copies of a work to order. To prevent inventory loss, predictive modeling can be used to look for patterns commonly associated with lost/stolen books and high user fees. Once these patterns have been discovered, appropriate policies can be put in place to reduce inventory losses. In addition, fraud models can be used to determine the appropriate course of action for users who are chronically late

in returning materials. The library can also better serve their user audience by determining areas of deficiency in the collection. The reference desk and the OPAC are two sources of data that can aid in solving problems with the collection. If the topics of questions asked at the reference desk are recorded along with the perceived outcome of the interaction, then patterns can be discovered to guide librarians to areas that need attention in the collection.

In future Data Mining can provide the new road map for the next generation of library by applying it for the following activities of library.

Searching of Information (Reference Service)- Since the data of the library continuously growing with an exponential rate and the main problem is how one can reference the required information form the large amount of redundant information of the library. This can be possible by applying data mining techniques, so one can say that the data mining is the future of reference service.

Classification- It will replace the manual classification of content of the library with the computer assisted classification, so that the classification task can be accomplished by less skilled person in a fast and efficient way. This will simplify the classification task of the library.

Acquisition- As per third law of library science "Every book its reader" [9]. By applying the data mining in the library data it can be easily find out the required contents that are necessary to acquire next. This will reduce the work of library staff related to the acquisition as well as the efficient use of budget allocated to the library.

Finding Criticality Of Books

We need to find the critical books in Library and to do so we have to analyze the pattern of number of books present in library and the number of times that book is searched. On the basis of such analysis we can cluster the books in two different groups such as Critical books and Safe books. If the search count of a particular book is very less then there is not much need of that book to be added in the library. But if it is relatively higher with respect to average then there is necessity to add that book in the library. Clustering can be done in many ways. Graph can be plotted in the following way-

Fig. 6.1- Line plot of search count and no. of books for each book.

Conclusion

It can be concluded that there is the need of data mining techniques that will redesign and simplify the working of library like classification, acquisition, circulation and referencing. The main use of data mining is in referencing but it can be used for some other work of library as well. So it is urgently needed that systematic efforts have been take place to develop data mining techniques and algorithms for library database.

Owing to automation of libraries, they have gathered data about their collections and users for years, but have rarely used those data for better decision-making. By taking a more active approach based on applications of data mining, data visualization, and statistics, information organizations can get a clearer picture of their information delivery and management needs. At the same time, libraries must continue to protect their users and employees from misuse of personally identifiable data records. Libraries must compete against online booksellers, downloadable audio books, and the vast supply of "free" information of varying quality from the Internet, librarians must begin to take the initiative in using their systems and data for competitive advantage and to justify continued support and funding of libraries.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now