Big Data Use Cases

Published Date: 02 Nov 2017

Every minute there are over 2000000 search queries, over 100000 tweets sent, facebook user shares 684478 pieces of contents, and email users send more than 204166667 messages (Visualnews, 2012). However, this is only a fragment of data internet users generate every minute of the day. There are still many more social medias, web page, mobile application and sensors to name a few that produce enormous amount of data. With the increasing number of internet users all over the world, it is certain that the amount of data generated every minute will only increase. Since data is not only generated through internet but also everywhere thus it is estimated that by 2015, we will be producing 5 zettabytes of data (a zettabyte is equivalent to a trillion gigabytes) (Siliconangle, 2012). This bring to the issue of just how much data is considered as Big Data. The word "Big" itself is a subjective to which different individual has different interpretation for it. The most common classification of Big Data is by the 3 Vs; Volume, Velocity, and Variety with an additional V which is introduced recently, Veracity. Data that is high in volume sometimes is not considered as Big Data if it is only a relational database data type, since it also take into account the variety, velocity and also the veracity of the data (Courtney, M., 2013).

Big Data Use Cases

Big Data has been used in many sectors due to its contribution in helping to understanding the sector in a much better way. With better understanding of the data in the sector, a lot of achievements have been accomplished. For example, customers who used the IBM solution for the Big Data problem in the healthcare sector have achieved a reduction of patient mortality by 20 percent, in the telecommunication sector, a great success is achieved when the processing time has been decrease by 92 percent and finally a bigger outcome in the utilities sector is accomplished when the placing of energy generation resources has improved its accuracy by 99 percent. (IBM, 2013)

Fraud Detection

In the book Understanding Big Data published by IBM in 2012, one of the use cases mentioned is the use of Big Data in improving the fraud detection pattern. When discussing about fraud it could be considered to be very synonymous with the financial sector, although it also exist in other service such as insurance and online transaction too. The conventional system faced with the problem of storing and the ability to process the data into information that can be used in fraud detection in which less than one fifth of the data is actually used. This is because the high cost needed to handle all the available information. IBM then introduced a new platform called BigInsights which function is to analyse huge amount of data at a cheaper cost and in shorter time. A collaboration with a large credit card company has proven the effectiveness of the system. The more impressive news is that the system now can also predict the fraud even before it happens. (Eaton, C., et al. 2012)

IBM Data Baby

Another example of the use cases is IBM Data Baby, one of the Big Data used in the health sector. This main objective of this project is to help the health sector to discover a better alternative to foresee the peculiar hospital-borne diseases that affect infants (Paul, Z., et al. 2013). The previous traditional system would only monitor for limited duration of data; hourly or half hourly, and after 3 days the data will be discarded. However, by utilizing the new Stream technology from IBM, the collaboration between University of Ontario Institute of Technology with the Hospital for sick children, Toronto, managed to produce an in-motion analytics platform from the analysation of thousands of information gathered from the infant. The data gathered includes the babyâ€™s heart beats, respiratory rate, oxygen level and many more are then visualized with the help of The Ogilvy team whereby all the data is formulated into graphic simulation (Ogilvy, 2010). This innovation helped the doctor to find new ways of detecting any life threatening infection up to 24 hour sooner (Bruno Alluci 2010).

Military

Another interesting use case of Big Data is in the government sector. Defence Advanced Research Projects Agency (DARPA), an agency of the United States Department of Defense has engaging in Big Data to help the agency in finding solutions for war fighters. Nowadays, the dependency on data provided from the virtual net sensor and communication systems is very high. Every decision made need to be based on the data obtained to optimize the efficacy of war mission and for countryâ€™s security. However, the currently adopted system is not efficient in handling the large amount of data.

"The sheer volume of information creates a background clutterâ€¦," said DARPA Acting Director, Kaigham J. Gabriel. "Let me put this in some context. The Atlantic Ocean is roughly 350 million cubic kilometres in volume, or nearly 100 billion, billon gallons of water. If each gallon of water represented a byte or character, the Atlantic Ocean would be able to store, just barely, all the data generated by the world in 2010. Looking for a specific message or page in a document would be the equivalent of searching the Atlantic Ocean for a single 55-gallon drum barrel". (Darpa 2012).

The Department of Defense is taking a new measure and approach in handling the massive amount of data by introducing XDATA program. This programâ€™s aim is to produce a new tools and technique in analyzing the data to provide more reliable information to the war fighters.

Big Data Challenges

Big Data is usually described as a data that has these four characteristics; Volume, Velocity, Variety and the newly introduced Veracity or better known as 4 Vs. The challenges in Big Data are also closely connected to the four characteristics.

Volume and Velocity

Researchers and companies are having problem in storing the data due to the volume and the velocity. Finding the most suitable ways in storing big volume of data that is generated at a different rate with different types of data is not an easy task. Sometime, the large amount of information generated is over the limit of the available database storage. Take for example Big Data in the human genome process, it is said that;

"For a single sequence of one individual human, that sequence operation consumes about two terabytes of data. But thatâ€™s just to start. A data scientist from a life sciences company told me that once you begin doing additional processing and analysis on that data, that two terabytes can quickly turn into seven terabytes."(Ferguson, R.B., 2012).

In the case of human genome project, the great volume of the information not only generated solely on the operation process, but also due to the duration of the storage. They want to store the data for a long period of time so that a better research can be conducted based on the historic feature of the data.

To understand the challenges in handling data storage problem especially a large one, some of these aspects need to be considered; capacity, number and size of file, shared volume and also backup facility (Reinert, J., 2013). The problem with capacity is that sometimes it is hard to estimate the volume needed to store the data, for example in the case of genome project , the initial data gathered is only two terabytes, however once it is analysed, it then multiplies into seven terabytes. In the case of number and size of files, especially the unstructured data such as the user generated content; newsfeed from social media and customer feedbacks to name a few, it is difficult to determine the size for each file. Metadata is the part of the data system that set the data features, and it is usually determined by the number and size of data (PARKINSON, J). Metadata is important in making the data easily accessible and retrievable so that a faster and dynamic database system can be implemented.

Apart from the storage problem, there is a problem in transferring the data either raw or analyzed. For example, the biggest genomic institution, BGI, is facing a problem in transferring the data to their clients. The conventional method of using the Internet as the medium to transfer the data is no longer valid due to the inability of the service to handle the large volume and the high velocity of the data. At the end they opt for a more manual approach where the data storage computer discs are sent to the client via FedEx (NYTimes, 2011).Data generated especially by devices such as from mobile devices, sensor and computer are usually very fast and with high velocity. Large volume of the new data or the updated one entering the database system and a fast and reliable result need to be produced upon the data entries. This gives challenges to data scientist in handling the transmission of the raw and also the analyzed data. They need an exceptionally fast and scalable stack layer and query processing layer.

Variety

In reality, data is generated from multiple sources. This causes the data to be in a variety of formats and models (Chen, J. et al, 2013). Data can be categorised into structured, semi-structured and instructed. The structured data is the traditional relational data or usually obtained from machines, sensors or web logs and unstructured data is the user generated contents from the social media and web pages to name a few . Each of these data has their own analysing platform and system; however, data scientists are faced with the challenge of integrating all of the data types into one platform for analysing process. It is even more challenging for unstructured data, since it is almost impossible to process the raw data; it needs to be transformed into something that is understandable. However this process will usually causes the data to lose its information, and therefore addresses the needs for an effective solution.

Veracity

Although Veracity or data in doubt is still a new term associated with the Big Data, it is something that cannot be overlooked when describing Big Data. There are many reasons why the data is classified as data in doubt, for example is uncertainty due to the data inconsistency and incompleteness, ambiguities, latency, deception and also model approximations (Walker, M., 2012). Since 80 percent all overall data is unstructured data, this problem keeps on increasing. In addition, it is said that;

"1 in 3 business leaders donâ€™t trust the information they use to make decisions" (IBM, 2013)

This has put a challenge in trusting on the data as the number of sources and the variety of data are increasing. Indeed a reliable analysis of the data available is important, but the visualization of the result of the analysis is also crucial. An article in the Future magazine stated that;

"Collecting, storing and managing Big Data are increasingly necessary challenges, but these are only requisite steps to addressing the heart of the issue. Truly valuable Big Data solutions must allow for robust visualization of data that enables constructive analysis, whether it is in the analysis that powers trading decisions or the tools chat provide a clear picture of risk." (WAN, M., 2012)

Technology

In overcoming the Big Data challenges, many approaches have been taken and new technology has been implemented. For example, specific databases and data warehouse appliances are some of the technology introduced to the tackle this challenges.

NoSQL, MapReduce and Hadoop

NoSQL is very handy when dealing with Big Data. This is because it can be scaled out. Previously, most of the database administrators had been depending on scaling up the database storage, by getting a larger server etc. However with NoSQL, they have the option to scale out the database due to its ability in elastic scaling. NoSQL is very common to Big Data because of unstructured data. It can store non-conventional data that cannot be stored using the traditional approach which is by column and rows. MySQL, Greenplum and Microsoft SQL Server are among the relational database management systems that provide storage for conventional data (Henschen, D., 2011).

One of the popular approaches in handling non-relational database in Big Data is MapReduce of which is created by Google. Map reduce is actually a combination of two methods, map and reduce. The map method is where the input (data) of the master node is divided into smaller nodes. The input sometimes undergoes more subdivision until it reaches the processing nodes. The processing nodes will then compute the problem of the smaller data. In the reduce method, the processed data is then send to the precedence nodes until it reaches the master node and the master node will then combine all the sub data to get the initial solution for the input (Finley, K., 2011)

Hadoop, one of the most well-known software being used to handle Big Data, is associated with NoSQL and MapReduce. It is an open source project by The Apache Software Foundation and it is a framework written in Java and implements the Google MapReduce and Google file system technology as its foundation. It handles a massive amount of data through parallelism where the data is sliced and then are quickly processed by multiple servers. It can handles data in all variety of format either structured, unstructured or semi-structured. It is known for its great performance and reliability where it replicates its data across different computer. Therefore, if one computer breaks down the data is processed to one of its replicated computer.

Although Hadoop is very compatible for Big Data, it cannot be used for online transaction processing workload. This is because the data is randomly accessed from structured data like relational database and it also not meant for online analytical processing or decision support system workload where data is sequentially accessed to unstructured data to generate reports that provide business intelligence. Apart from that it is a batch handling processing of data so the respond time is not immediate. (ReturnftheMus, 2012). Hadoop has been adopted by all sorts of companies for example AOL, eBay, Facebook and also JPMorgan Chase to name a few. IBM also engages with Hadoop in developing an analytics platform called InfoSphere BigInSights. (Henschen, D., 2011)

Complex Event Processing

Complex Event Processing is a method use to predict upcoming event based on the current streams of input data. It is very useful in managing Big Data as the purpose of this technique is to process data in motion (refer to velocity). The input from the various streams of information is processed to find any correlation between them that subject to the given rules and processes (Hurwitz, J., et al, 2013). However, there are flaws in the complex event processing method as mentioned in the article by JCN Newswire where the system is loaded with the assigned rules and also the wasteful information transmission between servers due to the connection and dependencies among rules that can cause the system to have a lower processing speed which is by more than 90 percent and also causes scalability limitation to the server (Anonymous1, 2012).

In conjunction with the flaws in the complex event processing, NEC Corporation has introduce a new technology developed from the complex event processing but with a better processing time and more efficient. There are 2 key feature of this new technology. First, it enables real time processing with limited resources. The approach that the technology takes is by allocating the processing rules that interconnected and interdependent into the same server. This reduces the transmission time and removes the unnecessary computing resources. Second, it has the feature to scale the system based on the data expansion. The efficiency of this method is proven as the article also mentioned that the technology managed to perform a real time processing for 2.7 million events per second by suing 16 servers where 6 of it is for the input generator and 10 as process unit featuring 100 000 rules. Or in the article it is simplified as;

â€˜This is the equivalent of taking just 20 seconds to send information on 50 million users from 100,000 stores for a service that provides store and coupon information to mobile phones., (Anonymous1, 2012).

Although the result is promising, this new technology is still in the trial period and not yet released for commercial purposes.

Adaptive Locality-Aware Data Reallocation

Recently, Fujitsu laboratories have developed a new technology that can handle high volume and high velocity of data called "Adaptive Locality-Aware Data Reallocation". (Anonymous2 2012) The method use is by increasing the number of division of the amount of disc assessed by roughly 90% more than the previous level. This is done by rearranging and matching the data in the storage to the pattern in the assessed data. This technology has reduced the time for data analysis from hours to minutes. The optimum reallocation of data is achieved by a few steps; first by recording the history of the assessed data, then the data is grouped based on the frequency of it is being assessed to get the optimal allocation and finally by assigning a discâ€™s location to the group and allocating the data that belongs to the group at the specified location. The article also stated that;

"This technology can perform analytic processing on Big Data using incremental processing while accepting data as quickly as it arrives, allowing for rapid analytic processing of current data." (Anonymous2 2012)

This technology gives a fast processing time and can also handle a large amount and high velocity of data

Cloud Computing.

One of the possible solutions of Big Data challenges in sharing the data with a single data storage is cloud computing. Cloud computing might be beneficial in reducing the total amount of data storage needed to keep data as many users can access the same information from a single source. However looking at the current technology and rate of growth of data, the effectiveness is still questionable. The main concern is the cloud computing bottleneck which is the bandwidth of the data transfer from the client to the provider and vice versa (Golden, B. 2009). For the case of genome research mentioned before, for a single operation it needs to transfer at least 2 millions data to the cloud at one time, imagine doing that for hundreds or thousands s of time. Apart from that, it is also vital to be able to access the data as quickly as possible and this process of transferring the data into the cloud and accessing it back and forth might not be sophisticated system (Ferguson, R.B., 2012).

Conclusion

Big Data has become one of the major influences in changing the way sectors, companies or even individual perceive information. However the process of extracting the useful information and determining the useful information is not as easy as it is seemed. Company such as IBM, Oracle, and EMC are the few companies that pioneer the Big Data analysis services. Challenges such as volume, velocity, variety and also veracity of the Big Data need to be tackled in order to obtain the useful information. Some of the technology available today such as Hadoop, Complex Event Processing and also Adaptive Locality-Aware Data Reallocation are essential but not sufficient. There are a lots of improvements can be done to achieve a better result from the Big Data. Sometimes, it is not that no advance in technology that causes the inefficiency in managing Big Data, but the lack of expertise in this field (MacAfee and Brynjolfsson 2012). Other issues such as security and privacy of data have arisen. Big Data involves data generated from the social media for example sometimes includes confidential information. A new system that can store and handle the volume, velocity, variety and veracity of the Big Data efficiently and not crossing the boundary of userâ€™s confidential information is desired.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Big Data Use Cases