Relationship Between Big Data And Hadoop

Published Date: 02 Nov 2017

Dumitru Clim

A00202920

Hadoop report

Contents

Background

Motivation for Hadoop

Apache Hadoop is an open source software framework supporting data intensive distributed applications. Hadoop is built on Googleâ€™s Map-Reduce framework and GFS, which in this case is HDFS (Hadoop Distributed File System). Hadoop is written in Java and is capable of performing analysis over petabytes of data using distributed computing on clusters of inexpensive servers. It is:

Reliable: The software is fault tolerant, it expects and handles hardware and software failures

Scalable: Designed for massive scale of processors, memory, and local attached storage

Distributed: Handles replication. Offers massively parallel programming model, MapReduce.

Organizations which deal with lots of data, or need analysis of their data heavily depend on Hadoop. Companies like Facebook & Twitter work extensively on hadoop. The Friend Suggestion module of all social networking sites work on hadoop. Amazon uses Hadoop for tracking user sentiment analysis, user navigation trend analysis, product recommendation etc. Google used Hadoop for its Indexing algorithms. The banking firms use hadoop for Fraud Detection Systems.

Adaptation of existing approaches for evolving end user practices

With the explosion of data usage and the quest for knowledge from internet search engines, it became very clear that all this data had to be stored on-location and on-demand so that the user(s) could access this data and download it in split seconds to their devices.

For example, Yahoo deployed large scale clusters in 2007 to cater for this explosion in data. Our interaction with the Yahoo website content optimization, search indexes, advertisement optimization and content feed processing all benefit from using Hadoop to process data. Yahoo has over 20,000 machines running Hadoop which amounts to several Petabytes of user data. Other users of Hadoop are IBM, Google, Amazon, Facebook and AOL.

Relationship between "Big Data" and Hadoop

Big Data is high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.Â More than 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, etc. All of this unstructured data is Big Data.[2]

Organizations are discovering that important predictions can be made by sorting through and analysing Big Data. However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way that that makes it suitable for data mining and subsequent analysis. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.[2]

Hadoop Architecture

What are the components of the Hadoop software library?

HDFSÂ is a highly fault tolerant distributed file system designed to be deployed on low cost commodity hardware, and works best with huge files.

Map-ReduceÂ on the other hand is a framework for dividing data sets into smaller independent problems and then merging the intermediate results into the final solution. The division phase is called Map and the Merge and solve phase is called the Reduce phase.

The typical Hadoop Ecosystem would look like this :

http://www.confusedcoders.com/wp-content/uploads/2012/07/hadoopDiagram.png

Figure 1 Hadoop overview

Figure 2 hadoop server roles

The three major categories of machine roles in a Hadoop deployment are Client machines, Masters nodes, and Slave nodes. Â The Master nodes oversee the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce). Â The Name Node oversees and coordinates the data storage function (HDFS), while the Job Tracker oversees and coordinates the parallel processing of data using Map Reduce. Â Slave Nodes make up the vast majority of machines and do all the dirty work of storing the data and running the computations. Â Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their master nodes. Â The Task Tracker daemon is a slave to the Job Tracker, the Data Node daemon a slave to the Name Node. [7]

Client machines have Hadoop installed with all the cluster settings, but are neither a Master nor a Slave. Â Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed and then retrieve or view the results of the job when itâ€™s finished. Â In smaller clusters (~40 nodes) you may have a single physical server playing multiple roles, such as both Job Tracker and Name Node. Â With medium to large clusters you will often have each role operating on a single server machine. [7]

A multi-node Hadoop cluster

A small Hadoop cluster will include a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes, and compute-only worker nodes.

In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus preventing file-system corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling

File Systems

Hadoop Distributed File System. HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[3]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS was designed to handle very large files.[3]

HDFS has added high-availability capabilities, allowing the main metadata server (the namenode) to be failed over manually to a backup in the event of failure. Automatic fail-over is being developed as well. Additionally, the file system includes what is called a secondary namenode, which misleads some people into thinking that when the primary namenode goes offline, the secondary namenode takes over. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which is then saved to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can be a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation is a new addition that aims to tackle this problem to a certain extent by allowing multiple name spaces served by separate namenodes. [1]

An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. An example of this would be if node A contained data (x, y, z) and node B contained data (a,b,c). Then the job tracker will schedule node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems this advantage is not always available.[1]

This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs.[4]

HDFS was designed for mostly immutable files [5] and may not be suitable for systems requiring concurrent write operations.

Another limitation of HDFS is that it cannot be mounted directly by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem (for Linux and other Unix systems).[1]

The MapReduce

MapReduce is a framework for performing distributed data processing using the MapReduce programming paradigm. In the MapReduce paradigm, each job has a user-defined map phase (which is a parallel, share-nothing processing of input; followed by a user-defined reduce phase where the output of the map phase is aggregated). Typically, HDFS is the storage system for both input and output of the MapReduce jobs.

The main components of MapReduce are:

JobTracker is the master of the system which manages the jobs and resources in the cluster (TaskTrackers). The JobTracker tries to schedule each map as close to the actual data being processed i.e. on the TaskTracker which is running on the same DataNode as the underlying block.

TaskTrackersÂ are the slaves which are deployed on each machine. They are responsible for running the map and reduce tasks as instructed by the JobTracker.

JobHistoryServer is a daemon for serving up information about historical completed applications so that the JobTracker does not need to track them. It is typically part of the JobTracker itself, but is recommended to run it as a separate daemon.

The following illustration provides details of the core components for the Hadoop stack.

HDP_Understanding_Hadoop_Ecosystem_1_Jan_16_2012.png

Figure 3 Apache Hadoop Components

Analysing Hadoop Performance

The performance improvement achievable using Hadoop in comparison to traditional SQL based approaches

For our comparison we use an experimental evaluation result about Hadoop based Map-Reduce on an applicable scenario - the aggregate billing process in the billing gateway owned by a cellular operator.[6]

Figure 4: An aggregate billing system architecture

A relational database matches data by using common characteristics found within the data set. The resulting groups of data are organized and are much easier for people to understand. Therefore, there is need of a software package or system to facilitate for those creation and maintenance of a computerized database, DBMS.

A more efficient way to calculate or sum up the content or service fees for users is expected by the operator, due to the complex charging models in the new mobile service business. Map-Reduce based solution may be a good alternative with the cloud computing concept, in addition to the traditional SQL Server based solution.[6] For the simulation, a billing data set comprising of different kinds of service fees, such as ADSL, WiMAX access fees, is generated to evaluate the performance of query on aggregate billing calculation(e.g.:total charge for some specified user).

Figure 5. Comparison between SQL Server and Hadoop evaluations about Hadoop

When there is only a few transactions in the database, SQL Server can complete the query in a very short period of time compared to Map-Reduce on Hadoop. As the data set increases, Hadoop can outperform SQL Server significantly. Even when there are 10,000,000 transactions, SQL Server takes almost 2 minutes to complete the query whereas Hadoop takes 80s only[6]. The fact may hinder a real-time billing information query. Overall, we can see that the relative performance of Hadoop becomes better with the increasing size of dataset. Therefore, time and computation energy is saved for the machines by using Hadoop so that other computation tasks can be added on without increasing the loading of the machines.[6]

So, the comparison of performance between Map-Reduce on Hadoop and SQL Server show that Hadoop can save much more time to complete the query than SQL Server in the simulation of aggregate billing calculation

Hadoop and the Telecommunication Industry

Telecommunications service providers are some of the biggest collectors of data today. Cell phones have evolved into mobile devices that literally serve as consumers' personal assistants and sidekicks. Each of their many functions generates its own, constant stream of multi-structured data, and all of that data must be efficiently captured, processed and analysed.

The telecommunications providers are turning to Apache Hadoop to keep pace with these massive volumes of data by efficiently deploying Hadoop in mission-critical applications. With hadoop, companies can reduce their infrastructure costs while maximizing average-revenue-per-user and preventing churn. In traditional environments, customer information is captured in different systems and combining it quickly and efficiently for analysis is very difficult. Therefore, Telecoms have turned to hadoop to help deploy Big Data systems enabling large-scale data processing and analysis.[8]

With network data volumes on the rise, it is imperative for telecom companies to keep a close watch on their networks to keep them functioning at peak performance, which is the key to retaining customers. Relying on data samples or aggregations isn't an option. Hadoop is a system that can scale to accommodate these data volumes at a reasonable cost.

Operators are used to manage peta-bytes of data. However converting this data into information and knowledge is the next step towards monetizing data. At the moment big data solutions focus on storing, manipulating and reporting large volume of data. However the Big Data revolution is only just starting, there is a need for big data apps, big data app stores, big data tools, etc.

Today's mobile devices produce a steady and voluminous stream of machine-generated data, which traditional R&D infrastructures cannot effectively capture and process. Hadoop's scalable, massively parallel infrastructure is the perfect solution, and it's incredibly cost effective because Hadoop runs on industry-standard hardware.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now