What Is Big Data

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Every minute Google received over 2000000 search queries, users send over 100000 tweets and facebook user shares 684478 pieces of contents and email users send more than 204166667 messages (VISUALNEWS.COM, 2012). However, this is only a fragment of data internet users generate every minute of the day. There are still many more social medias, web page, mobile application and sensors to name a few that produce enormous amount of data. With the increasing number of internet users all over the world, it is also certain that the amount of data generated every minutes will only increase. Data is not only generated through internet, it is everywhere thus it is estimated that by 2015, we will be produced 5 zettabytes of data where a zettabytes is about a trillion gigabytes (siliconangle.com, 2012),. However, just how much data is considered as big data? Big is a subjective word and different individual has a different interpretation. The most common classification of big data is by the 3 Vs; Volume, Velocity, and Variety and recently another V is introduced, Veracity. Data that is high in volume sometimes does not considered as big data if it only a relational database data, it also take into account the variety, velocity and also the veracity of the data(Big Data Puzzle).

Big Data Use Cases

Big data has been used in many sectors and it has help in understanding the sector in a better way. With that, a lot of achievements have been accomplished. For example, customers that used the IBM solution for the Big data problem in the healthcare sector have achieved a reduction of patient mortality by 20 percent, in the telecommunication sector, a great success is achieved when the processing time has been decrease by 92 percent and finally a bigger outcome in the utilities sector is accomplished when the placing of energy generation resources has improved its accuracy by 99 percent. (IBM.Com, 2013)

Fraud Detection

In the book Understanding Big Data produced by IBM in 2012, one of the use cases mentioned is how big data is used in improving the fraud detection pattern. Fraud is very synonym in the financial sector, however it can actually be found in other service such as insurance and online transaction too. The conventional system has the problem of storing and the ability to process the data into information that can be used in fraud detection and also less than one fifth of the data is actually used. This is because the cost of handling all the available information is too high. IBM then introduced a new platform called BigInsights where it can analyse huge amount of data at a cheaper cost and in short time. A collaboration with a large credit card company proven the effectiveness of the system. What more impressive is that the system now can also predict the fraud even before it is happening. (C.EATON 2012)

IBM Data Baby

The second example of the use cases, IBM Data Baby is one of the big data used in the health sector. This project main objective is to help the health sector in discovering a better alternative to foresee the peculiar hospital-borne diseases that affect infants (Paul, Z., etc 2013). Previously, the old fashion system just monitored certain duration of data; hourly or half hourly, and after 3 days the data will be discarded. However, by taking advantage of the new Stream technology from IBM, University of Ontario Institute of Technology together with the Hospital for sick children, Toronto, managed to produce an in-motion analytics platform from the analysation of thousands of information gathered from the infant. The data gathered includes the baby’s heart beats, respiratory, oxygen level and many more are then visualize by the help of The Ogilvy team where all the data is transferred into graphic simulation. One amazing discovery of the). This has helped the doctor to find new ways to find any life threatening infection up to 24 hour sooner.

Another interesting use cases of Big Data is in government sector. Defense Advanced Research Projects Agency (DARPA), an agency of the United States Department of Defense has engaging in Big data to help the agency in finding solutions for war fighters. Nowadays, the dependency on data provided from the virtual net sensor and communication systems is very high. Every decision made need to be based on the obtained data to optimize the efficacy of war mission and strengthen a country’s security. However, the system currently adopted is not efficient in handling the large amount of data.

"The sheer volume of information creates a background clutter…," said DARPA Acting Director, Kaigham J. Gabriel. "Let me put this in some context. The Atlantic Ocean is roughly 350 million cubic kilometers in volume, or nearly 100 billion, billon gallons of water. If each gallon of water represented a byte or character, the Atlantic Ocean would be able to store, just barely, all the data generated by the world in 2010. Looking for a specific message or page in a document would be the equivalent of searching the Atlantic Ocean for a single 55-gallon drum barrel.

The Department of Defense is taking a new measurement and approach in mengendali the massive amount of data by introducing XDATA program. This program aim is to produce a new tools and technique to analyze the data and thus assist in giving a reliable information to the warfighters

Big Data Challenges

Big Data usually described as a data that has these four characteristics; Volume, Velocity, Variety and the newly introduced Veracity or better known as 4 Vs. The challenges in big data are also closely connected to the four characteristics.

Volume and Velocity

Researchers and companies are having problem in storing the data. Finding the most suitable ways in storing big volume of data that generated at a different rate with different types of data is not an easy task. Sometime, the large amount of information generated over limit the available database storage. Take for example big data in the human genome process, it is said that;

"For a single sequence of one individual human, that sequence operation consumes about two terabytes of data. But that’s just to start. A data scientist from a life sciences company told me that once you begin doing additional processing and analysis on that data, that two terabytes can quickly turn into seven terabytes."(FERGUSON, R.B., 2012).

In the case of human genome project, the great volume of the information not only generated solely on the operation process, but also due to the duration of the storage. They want to store the data for a long period of time so that a better research can be conducted based on the historic feature of the data.

To understand the challenges in handling data storage problem especially a large one, some of these aspects need to be considered; capacity, number and size of file, shared volume and also backup facility. The problem with capacity is that sometime it is hard to estimate the volume needed to store the data, for example in the case of genome project , the initial data gathered is only two terabytes, however once it is analysed, it then multiply into seven terabytes. In the case of number and size of files, especially the unstructured data such as the user generated content; newsfeed from social Medias and customer feedbacks to name a few, it is difficult to determine the size for each file. Metadata is the part of the data system that set the data features, and it is usually determined by the number and size of data (PARKINSON, J). Metadata is important in making the data easily accessible and retrieve so that a faster and dynamic database system can be implemented.

Apart from the storage problem, there is problem in transferring the data either raw or the one already analyzed. For example, the biggest genomic institution, BGI, is facing a problem in transferring the data to their clients. The conventional method of using the Internet as the medium to transfer the data is no longer valid due to the inability of the service to handle the volume and the velocity of the data. At the end they opt for a more manual approach where the data storage computer discs are sent to the client via FedEx.).Data generated particularly by machine for example from mobile devices, sensor and computer usually very fast and high velocity. Large volume of the new data or the updated one entering the database system and a fast and reliable result need to be produced upon the data entries. This gives challenges to data scientist in handling the transmission of the raw and also the analyzed data. They need an exceptionally fast and scalable stack layer and query processing layer.

Variety

In reality, data is generated from multiple sources. This causes the data to be in variety of formats and models (10.1007<-ignore). Data can be categorised into structured, semi-structured and instructed. The structured data is the traditional relational data or usually obtained from machines, sensors or web logs and unstructured data is the user generated contents from the social media and web pages to name a few . Each of these data has their own analysing platform and system; however, data scientists are facing the challenge of integrating all of the data types into one platform for analysing process. Even more challenging is for unstructured data, it is almost impossible to process the raw data, so it needs to be transformed into something that can be understand. However this process will usually causes the data to loss its information, and this needs an effective solution.

Veracity

Veracity or data in doubt is still a new term accociated with the Big Data, however it is something that cannpt be overlooed when describing big data. There are many reasons why the data is classified as data in doubt, betwixt is uncertainty due to the data inconsistency and incompleteness, ambiguities, latency, deception and also model approximations). With 80 percent all overall data is unstructured data, this problem keeps on increasing. In addition, it is said that;

"1 in 3 business leaders don’t trust the information they use to make decisions" IBM.com

This has become a challenge to put trust on the data as the number of sources and the variety of data are increasing. A relaible analyze of the data available is important, however, the visualization of the result of the analyzation is also crucial. A data that can be presented in

Technology

In targeting the big data challenges, many approaches have been taken and new technology has been implemented. For example, specific databases and data warehouse appliances are some of the technology introduced to the tackle this challenges.

NoSQL

NoSQL is very handy when dealing with big data. This is because of it can be scaled out. Before, most of the database adminitrotor depended on caling up the database storage, by getting a larger server etc. However with NoSQL, they have the option to scale out the database due to its ability in elactic scaling. NoSQL

SQL is a programming language developed in managing the relational database management systems. It is very ggod in handling relational Big Data is so large that it cannot be handle by traditional database and software.

MapReduce and Hadoop

One of the popular approach in handling non-relational database in big data is MapReduce created by google. Map reduce is actually a combination of two methods, map and reduce. The mao method is where the input (data) of the master node is divided into smaller nodes. The input sometime undergo more subdivision until it reaches the worker nodes. The worker nodes will then compute the problem of the smaller data. In the reduce method, the processed data is then send to the precedence nodes until it reaches the master node and it will then combine all the sub data to get the initial solution for the input http://readwrite.com/2011/08/12/from-big-data-to-nosql-the-rea-2.

Hadoop, one of the most well-known software use to handle big data. It is an open source project by The Apache Software Foundation and it is a framework written in Java. It implements the Google MapReduce and Google file system technology as its foundation. It handles a massive amount of data through parallelism where the data is sliced and then quickly processed by multiple servers. It can handles data in all variety either structured, unstructured or semi-structured. It know for its great performance and reliability where it replicates its data across different computer so if one goes down the data is processed to one of its replicated computer. Although Hadoop is very good for big data, it cannot be used for online transaction processing workload as the data is randomly accessed from structured data like relational database and it also not meant for online analytical processing or decision support system workload where data is sequentially access to unstructured data to generate reports that provide business intelligence. Apart from that it is a batch handling processing of data so the respond time is not immediate.

Complex Event Processing

Another technology introduced to handle the big data is Complex Event Processing. There are 2 key feature of this technology. First, it enables real time processing with limited resources. The company that developed the technology, NEC Corporation said that

Adaptive Locality-Aware Data Reallocation

Recently, Fujitsu laboratories have developed a new technology that can handle high volume and high velocity of data called "Adaptive Locality-Aware Data Reallocation". (Anonymous2012, Apr 05. Fujitsu Technology Puts Big Data to Use in Minutes. JCN Newswire - Japan Corporate News Network.) The method use is by increase the number of division of the amount of disc assessed by roughly 90% more than the previous level. This is done by rearranging and matching the data in the storage to the pattern in the assessed data. This technology has reduced the time for data analysis from hours to minutes. The optimum reallocation of data is achieved by a few steps; first by recording the history of the assessed data, then the data is grouped based on the frequency of it is being assessed to get the optimal allocation and finally by assign a disc’s location to the group and allocate the data that belong to the group at the specified location.

80% of data is unstructured data and it is essential to grasp this kind of data so that there is no information loss. One f the approach used to tackle this

Cloud Computing.

One of the possible solution of big data challenges in sharing the data with a single data storage is cloud computing. Cloud computing might be beneficial in reducing the total amount of data storage needed to keep data as many users can access the same information from a single source. However looking at the current technology and rate of growth of data, the effectiveness is still questionable. The main concern is the cloud computing bottleneck which is the bandwidth of the data transfer from the client to the provider and vice versa For the case of genome research mentioned before, for a single operation it need s to transfer ____ of data to the cloud, imagine doing that for hundreds or thousands s of time. Apart from that, it is also vital to be able to access the data as quickly as possible and this process of transferring the data into the cloud and accessing it back and forth might not be sophisticated system (FERGUSON, R.B., 2012.

Conclusion

Big Data has become one of the major influences in changing the way sectors, companies or even individual perceive information. However the process of extracting the useful information and determine what information is useful is not as easy as it is seemed. Company such as IBM, Oracle, and CEM are the few companies that pioneering the big data analysis services. Challenges such as volume, velocity, variety and also veracity of the big data need to be tackling in order to obtain the useful information. Some of the technology available today such as Hadoop by IBM, Complex Event Processing and also Fijitsi_____ is not enough and still can be improved to achieve a better result from the big data. However, other issues such as security and confidentionality of data have arise. Big data involves data generated from the social media for example sometimes include a confidential information. A new system that can store and handle the volume, velocity , variety and veracity of the big data efficiently and not crossing the boundary of users confidential information desired information.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now