Characteristics Of Big Data

Published Date: 02 Nov 2017

Abstract

The unremitting increase of computational strength has produced tremendous flow of data in the past two decades. This tremendous flow of data is known as "big data". Big data is the data which cannot be processed with the aid of existing tools or techniques and if processed can result in interesting informationâ€™s such as analyzing the behavior of the user, business intelligence etc. This paper discusses the difference between the traditional relational database and big data; it also shows the characteristics of big data. The paper also focuses on challenges that come across the pipeline of big data and as well as on how big data is a solution to the organizations. At last it discusses about hadoop an open source framework that allows the distributed processing for massive datasets on cluster of computers. Big data does not only focus to store and handle the large volume of data but also to analyzed and extract the correct information from the data in lesser time span.

1. Introduction

The term big data is useful to the information or data that canâ€™t be handled or analyzed using an existing or traditional data mining techniques or the tools. Many definitions of big data are given by researchers such as big data [1] is the data that is massive, too fast or too hard for existing tools to handle and process. Here massive means the size of data can range from petabytes(PB) to exabytes(EB) or to zettabytes(ZB). It can be generated from the client server interaction which is known as click-streams data, sensors, retails, finance sectors, customer call records or transaction histories etc.

Big data refers to the technologies and architectures, which were developed to capture, store, process and run better-quality volumes of data in lesser amount of time or even in real time. It is a system that lets digitize massive amounts of information and amalgamating it with on hand databases. An application of big data where methods and technologies to the data has better impact is e-health [3], because larger information exists in this segment. Of all segments and applications where systems and technological solutions evolve towards the Internet of the Future, e-health might be the application sector where the methods and technologies to be applied to the data may have a greater impact. This health care, drug companies and patients are the major receiver of this new concept implemented by the public and private institutions. We will not only confined to the security of data but also to the storage capacity, the Not just talking about the security of data, we discuss the storage capacity, the necessity to put them across multiple platforms and techniques and the latent importance of thinking about data: this will get better researches and treatments, which will turn in making decisions with genuine knowledge base and resulting in an improve care, steps forward, savings in management and monitoring.

Hadoop, open-source framework employs a simple programming standard that allows distributed processing of massive data sets on clusters of computers. In Hadoop stack, MapReduce is a software programming framework. It simplifies processing on big data sets and gives systems analysts a common method for expressing and coordinating complex processing tasks within clusters of computers. MapReduce applications manage the processing of tasks by scheduling jobs for a cluster node, supervising activity, and re-executing unsuccessful tasks. The Hadoop Distributed File System stores the input and output. Usually the data is executed and stored on the same node, which makes it well-organized to arrange tasks where data already located and results in high cumulative bandwidth across the node.

This paper is organized as follows section 2 is the literature survey, section 3 characteristics of big data, section 4 points the challenges, section 5 throws light on big data as a solution to the enterprises, section 6 discusses about the hadoop an open source for big data decision making and at last section 7 summarizes the paper.

2. Literature Survey

The unremitting increase of computational strength has produced tremendous flow of data in the past two decades. This tremendous flow of data is known as "big data" which cannot be processed with the aid of existing tools or techniques and this is more comprehensible to computers. For example, social network website like Facebook, operate 570 billion web page accesses per month, accumulate 3 billion latest photos every month, and deal with 25 billion pieces of matter[4]. Facebook, Flickr, You Tube, LinkedIn, and Googleâ€™s search [8], use a wrap up of artificial-intelligence activities; they have need of parsing vast quantities of data and making decisions straight away.

"Big Data Research and Development Initiative" was announced by American government on March 29, 2012 and it becomes first time a national policy. It shows that major resources were allocated to hold the data severe operations which direct to high storage and information processing costs while daunting the challenges of big data. Cloud computing is one of the powerful architecture for big data that performs large scale and complex computing by combining the resources and presenting a single system view.

One more definition of Big Data as per the publication 2008 of the journal of Science, it "Represents the development of the human cognitive methods, generally includes massive data sets that are beyond the skill of present technology, process and theory to summarize, handle, and sort out the data within an acceptable elapsed time"[5]. Definition of big data given by Gartner: "Big Data refers to high volume, velocity and variety data assets which demand novel forms of working in an order to enable improved decision making, imminent discovery and process optimization"[6]. Definition of big data as per Wikimedia "big data is a gathering of large and complex data sets which becomes complicated to process using convenient database management tools".

Big data is the gathering of information from our daily life. Data from Finance, Internet, Mobile device, Radio â€“ Frequency Identification (RFID), Science, Sensor and Streaming are the top most seven data drivers. With the rapid rises in computing power and storage, organizations have modified their skills to take the meaning out of these massively large, diverse, noisy and partial data sets gathered from a variety of sources. Till now, researchers are not able to combine the vital features of big data. As per many organizations and researchers big data is the data that we are not able to deal with using convenient technology, approach and theory. It put forward challenges for data management and analysis, and also for the entire IT industry.

3. Characteristics of Big Data

Big Data is described by the amplifying volume of the data ranging up to zettabytes and the velocity at which it is generating ([8], [9], [10], [11], [12], [13]). The difficulty of big data can be pertaining to the capturing of data, storage, search, analytics, sharing and visualization etc. [2]. The characteristics of big data are below mentioned which are every so often also known by V3, whereas the fourth V is refers to the uncertainty, ambiguity in data and is shown in Fig. 1.

â€¢ Volume â€“ Massive data sets scale larger than the data handle in conventional storage and analytical solutions. Assume size of data in petbytes instead of terabytes.

â€¢ Variety â€“ Structured, unstructured, semi-structured, variable data and complex such as generated in different formats like blogs, e-mail, images, sensor data, social media, usage data etc.

â€¢ Velocity â€“ The speed at which the data is generated with real-time queries for significant information to be supplied on demand rather than gathered.

â€¢ Veracity â€“ The data is uncertain, inconsistent, and ambiguous (difficult to understand).

Big Data

Variety

Volume

Velocity

Veracity

Structured

Semi-structured

Unstructured

(Text, Video, Images etc.)

Terabytes â€“ Petabytes

Petabytes â€“ Exabytes

Exabytes â€“ Zettabytes

Zettabytes -Yottabytes

Insight Data â€“ Analytics on data at-rest

Stream Processing â€“ Data in-motion i.e. millisecond to respond

Uncertain

Inconsistent

Ambiguous

Fig. 1 Big Data Characteristics

3.1. Traditional Data Vs. Big Data

Big data management is characteristically dissimilar from long-established relational models of data management. The difference between relational data base and big data is shown in Table 1. Although the difference is frequently described in terms of the data, "structured versus unstructured", however it is not fairly true. A growing source of big data is Log data, it has structure. The dissimilarity of big data vs. traditional database is that big data handle data in any structure and does not need time and effort to generate a model to capture process and examine data.

Table 1. Difference between Relation data set and big data

Application

Relation-Based Data

Big Data

Type of Data

Stored Data or Data at-rest.

Data at-rest (Big Insights)

Data in-motion (Stream Data)

Processing

Centralized Processing, Make use of single-computer program which scales with better CPUs.

Parallelization or Distributed processing for large data sets.

Cluster of systems that scale to thousands of nodes.

Management

Centralized storage of data.

Make use of Relational Databases (SQL).

Supports structured, semi-structured, and unstructured data (NoSQL).

Make use of distributed storage of data.

Analytics

Prediction or forecasting can be performed for a portion of data. Centralized & Descriptive.

Concurrent, Distributed, and Predictive and Prescriptive.

4. Challenges of Big Data

Big data may be the most recent trend in business technology however, it can also mean to a considerable headache. Whenever any talk on big data takes place CIOs focus on the positive side such as cost reduction, adding tactical values to the production and customer retention or understanding the behavior of the users. They are improbable to talk about the storage, management and analysis of massive data. Particularly big data is not so easy for an organization to adopt moreover it is a tool and not the solution.

The major challenges of big data are

Non availability of data due to security reason

Massive storage space

Adoption of new technology

Significant processing power

All of above mentioned points can be costly and hard to justify in an average controlled IT budget. Big data is fairly a new development in IT operations which amplify the risk of exceeding budgets, missing project time limit and unsatisfactory results. The amplified risk can keep businesses missing from implementing big data methodologies.

Big Data Storage and Management: The existing data management systems technologies of are not able to comply the needs of big data, whereas on the other side the mounting speed of storage capacity is much less than the rate at which the data is generated. Therefore reconstruction of information outline is desperately required such a hierarchical storage architecture needs to be designed. Furthermore, preceding computer algorithms cannot effectively store data which is straightforwardly obtained from the real world as the data has the heterogeneity. Whereas, they function outstanding in processing homogeneous data. Hence, re-organization of data is huge problem in big data management. Use of virtual server technology can intensify the problem, elevating the possibility of overcommitted resources, exceptionally when communication is rundown between the storage administrators, server and application.

Big Data Computation and Analysis: Speed is a significant demand when processing big data query [7]. An index is a best possible choice because the process takes time as it cannot navigate all the related data in the entire database in a less time. Indices in big data aim only on plain type of data even as big data is becoming more intricate. The blend of right index for big data and up-to-date preprocessing knowledge will be a key to the encountered problems. A natural computational models for approaching big data problems are application parallelization and divide-and-conquer. However, getting supplementary computational resources is not as simple. The usual ongoing algorithm is inept for the big data.

Big Data Security: Making use of big data applications online aid many organizations to reduce their IT expenditure. But due to the massive involvement of third party designed for infrastructures which are used to host vital data or to operate critical actions, security and privacy distress intact big data storage and processing. Huge challenges such as dynamic data monitoring and security fortification are faced because of scale of data and growth of applications at an exponential rate. Unlike conventional security approach, security in big data is primarily in the shape of how to process data extraction or mining without revealing susceptible information of users. Moreover, recent technologies of privacy fortification are primarily found on the static data set, even as data is always vigorously changed, together with data pattern, dissimilarity in attribute and adding up new data. Therefore, itâ€™s a dare to employ effective privacy protection in this dense circumstance.

High â€“Speed Networking: Transferring terabytes of data within a cluster can take an hour or a day over the good speed internet connection. Bandwidth hindrances amplify the challenges of efficient use of computing and storage resources in the cluster.

5. Big Data A Solution for Enterprises

Enterprises are eager to implement big data as a part of their Information Technology strategy. With the aid of big data technologies organizations can analyze the massive data sets that earlier they were not able to analyzed. Such as the data from industrial sensors, mobile, social media and web log data. This kind of data cannot fit properly within the established organization data repositories and the tools as it needs large dimension of storage space, unstructured in nature and turn up at very high velocity.

Big data skills support the analysis of past untouched data. Big data are intended to be more efficient in analyzing hefty volumes of formless data. When the data is processed and cumulated, it can be incorporated within the accessible enterprise infrastructure and can be used to get better spot business trends, customer relations, target online marketing and discover new supply of customers.

5.1. Three Major Areas Customer Leverage Big Data

A. Analyzing The Behavior Of Customer â€“ Social websites such as Facebook give rise to 25 terabytes of logs data each day. Even less significant or smaller websites usually generate web log data in many gigabytes every day. The web logs enclose valuable information about users or customer navigation behavior. Customer navigation behavior analysis by managers can help to amplify the sales by the modifying the web pages as per the requirement of the users. For analyzing such kind of data big data systems are required otherwise it is prohibitively expensive.

With the help of Hadoop, which is a most accepted big data analysis platform, organizations can store the data and analyzed it in a cost-effective way and insert the results into the business enterprise data storehouse for conventional analysis in combination with sales and customer relations data from established business enterprise databases.

B. Recommendation SystemsÂ â€“ Recommendation systems are used to recommend the things that the visitors like to enjoy or as per his searching behavior web pages are linked to his current search. For example, purchase recommendation made by retailers for online â€“ Netflix is being awarded a prize of US$1,000,000 for improving its recommendation systems by 10 percent, Zappoâ€™s online will use the purchase history to suggest products based on the purchase history of the visitors.

However it is not so simple to make good recommendations. For this many things has to be taken into consideration such as : buy history, ratings of purchase, products looked by customer but not purchased, products friends bought, responses to preceding recommendations, products purchased in one region, products bought by customers with related tastes and so on. The possibilities are nonstop. Organizations can quickly develop and advances their recommendation algorithms by using the data which is stored in the organizations data storehouses and combining it with data stored in Hadoop and other NoSQL sources. Here the need of copying the massive data is eliminated which results in cost reduction of the IT infrastructure and helps to organize enhanced recommendations on the websites for the visitors.

C. Speed Up The Business Insight Cycle â€“ In an order to speed up data processing and diminish the impact on the production system scalable Hadoop clusters are used. Hadoop process extracts from the transactional database and structured file in Hadoop. It takes very less time comparatively to the traditional processing system. And the drawbacks are it takes most of the time in loading and mining the data.

With the increase in data volumes and business requirements, it can add more servers to the Hadoop cluster so that to keep the processing times low.

Big data in many use cases patterns deals with the data collected by an organization but cannot treat in a cost effective way. However the big data technologies deliver new ways to process the data proficiently and find the importance unknown in it.

6. Hadoop for Big Data

Hadoop, open-source framework employs a simple programming standard that allows distributed processing of massive data sets on clusters of computers. The entire technology incorporates shared utilities, a distributed file system (DFS), analytics and information storage platforms, plus an application layer which manages activities like workflow, distributed processing, parallel computation, and configuration management. Besides offering high availability of data, Hadoop is cost-effective for handling massive, complex, or heterogeneous data sets than traditional approaches, and it proposes massive scalability and speed.

6.1 Working of Hadoop Distributed File System (HDFS)

The idea of Hadoop is to make use of Distributed File System (DFS) for processing the data. Hence, it split the file into blocks and these blocks are allocated in the Hadoop cluster nodes. The input data in HDFS is given in write-once fashion and it is processed by MapReduce, and the outcomes are sent to HDFS. The HDFS data is safeguarded by duplication mechanism among the nodes, which gives reliability and availability regardless of node failures. Two types of HDFS nodes are there: DataNode and NameNode.

DataNode â€“ In HDFS, it stores the data blocks of the files.

NameNode â€“ It contains the metadata, with record of blocks and a listing of DataNodes in the cluster.

Two extra types of Map Reduce nodes are there that are Job Tracker and Secondary Name Node.

A. Hadoop â€“ MapReduce

Hadoop MapReduce delivers a mechanism for programmers to process the data sets on a distributed system. It can be divided into two separate phases: Map & Reduce phase.

â€¢ Map Phase: It assigns the workload into reduced sub-workloads and allocates tasks to Mapper, which handle each unit block of data. Mapper output is a sorted list pairs of (key, value). These pairs (key, value) list is passed (called shuffling) to the next phase (reduce).

â€¢ Reduce: it examines and combines the input to yield the absolute output. This output is written in the cluster of HDFS.

Steps of MapReduce during processing a job are

Input Step: Load data into Hadoop Distributed File System and split it into blocks after that distribute it to the DataNode of clusters. Now, the blocks are copied for obtainability in case of failures. The NameNode keeps path of blocks and data nodes.

Job Step: Job tracker keeps track of the jobs submitted to the MapReduce and its details.

First Step of Job: Job tracker interacts with task tracker on all data node to plan MapReduce tasks.

Map step: the Mapper handles or processes the data blocks and produces a catalog of (key, value) pairs.

Sort step: Mapper sorts the list of pairs (key, value).

Shuffle step: It handover the plotted output to the reducers in an organized fashion.

Reduce step: The reduce step combine the list pairs in an order to produce final output or result.

At last the results are stored in HDFS and duplicated as per the make-up. Then the results are lastly read from the HDFS through the clients.

7. Conclusion

This paper described about big data from the all the possible aspects. We have also discussed the challenges of big data. Big data refers to huge data and takes lot of time to process it, therefore, this paper has also focused on distributed approach to retrieve results for the close real-time. The big data technique (Hadoop) makes use of Google MapReduce technology to process the structured, semi-structured and unstructured data sets. An experimental work is shown through hadoop cluster for extracting the meaningful data from the access log file which is basically a text file and takes lots of time to understand and process as the data generated in log file grows at a faster rate than the data can be processed by the traditional systems. Future work focuses on the challenges of big data which needs to be tackled.

[1] Sam Madden, "From Databases to Big Data", Massachusetts Institute of Technology, www.computer.org/internet/ IEEE INTERNET COMPUTING

[2] Sachchidanand Singh, Nirmala Singh, "Big Data Analytics", 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India

[3] MartÃn DÃaz, Gonzalo Juan, "Big Data on the Internet of Things", 2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, 978-0-7695-4684-1/12, 2012 IEEE

[4] Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, Keqiu Li," Big Data Processing in Cloud Computing Environments", 2012 International Symposium on Pervasive Systems, Algorithms and Networks

[5] "Big Data: Science in the Petabyte Era", Nature 455 (7209): 1, 2008.

[6] Douglas and Laney, "The Importance of Big Data: A Definition", 2008.

[7] X. Zhou, J. Lu, C. Li, and X. Du, "Big Data Challenge in the Management Perspective", Communications of the CCF, vol. 8, pp. 16â€“20, 2012.

[8] R. E. Bryant, R. H. Katz, and E. D. Lazowska, "Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society", Dec. 2008

[9] "Challenges and Opportunities with Big Data", http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf

[10] "Drowning in numbers: Digital Data will Flood the Planet and Help Us Understand it Better", 2011.

[11] "Federal Government Big Data Rollout", 2012.

[12] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, "Big Data: The Next Frontier For Innovation, Competition, And Productivity", May 2011.

[13] S. Lohr, "The Age of Big Data", Feb. 2012.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now