The Importance Of Big Data Computing Stems

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Abstract: — The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio - informaticists, sociologists and many others are clamoring for access to the massive quantities of information produced by and about people, things and their interactions. Diverse groups argue about the potential benefits and costs of analysing information from Twitter, Google, Verizon, 23andMe, Facebook, Wikipedia and every space where large groups of people leave digital traces and deposit data. Given the rise of Big Data as both a phenomenon and a methodological persuasion, it is time to start critically interrogating this phenomenon, its assumptions and its biases. Big data, which refers to the data sets that are too big to be handled using the existing database management tools, are emerging in many important applications, such as Internet search, business informatics, social networks, social media, genomics, and meteorology. Big data presents a grand challenge for database and data analytics research. Storage techniques and file system design for the Hadoop File System (HDFS) and implementation tradeoffs will be discussed in detail.  This paper is a blend of non-technical and introductory-level technical detail, ideal for the novice. We conclude with some technical challenges as well as the solutions that can be used to these challenges in social network data.

Key Words: data sets, database management tools, database management application.

INTRODUCTION

Every day, we create 2.5 quintillion bytes of data, so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured. Much has been written on the big data trend and how it can serve as the basis for innovation, differentiation and growth. According to IDC, it is imperative that organizations and IT leaders focus on the ever-increasing volume, variety and velocity of information that forms big data.

REQUIREMENT

We believe that organizations need to embrace and understand data to make better sense of the world. Big data matters because- The world is increasingly awash in sensors that create more data—both explicit sensors like point-of-sales scanners and RFID tags, and implicit sensors like cell phones with GPSs and search activity. Harnessing both explicit and implicit human contribution leads to far more profound and powerful insights than traditional data analysis alone, e.g.:1)Google can detect regional flu outbreaks seven to ten days faster than the Centers for Disease Control and Prevention by monitoring increased search term activity for phrases associated with flu systems. 2) MIT researchers were able to predict location and social interactions by analyzing patterns in geo/spatial/proximity data collected from students using GPS-enabled cell phones for a semester. IMMI captures media rating data by giving participants special cell phones that monitor ambient noise and identify where and what media (e.g., TV, radio, music, video games) a person is watching, listening to, or playing. Competitive advantage comes from capturing data more quickly, and building systems to respond automatically to that data. The practice of sensing, processing, and responding is arguably the hallmark of living things. We're now starting to build computers that work the same way. And we're building enterprises around this new kind of sense-and-respond computing infrastructure. As our aggregate behavior is measured and monitored, it becomes feedback that improves the overall intelligence of the system, a phenomenon Tim O'Reilly refers to as harnessing collective intelligence. With more data becoming publicly available, from the Web, from public data sharing sites like Info chimps, Swivel, and IBM's Many Eyes, from increasingly transparent government sources, from science organizations, from data analysis contests (e.g., Netflix), and so on, there are more opportunities for mashing data together and open sourcing analysis. Bringing disparate data sources together can provide context and deeper insights than what's available from the data in any one organization. Experimentation and models drive the analysis culture. At Google, the search quality team has the authority and mandate to fine-tune search rankings and results. To boost search quality and relevancy, they focus on tweaking the algorithms, not analyzing the data. Models improve as more data becomes available, e.g., Google's automatic language translation tools keep getting better over time as they absorb more data.

Hype Cycle of Big Data as calculated on July 2012.

3. Dimensions

Big data describes data that’s too large for existing systems to process.  They are categorised as: Volume. This original characteristic describes the relative size of data to the processing capability. Today a large number may be 10 terabytes.  In 12 months 50 terabytes may constitute big data if we follow Moore’s Law.  Overcoming the volume issue requires technologies that store vast amounts of data in a scalable fashion and provide distributed approaches to querying or finding that data.  Two options exist today: Apache Hadoop based solutions and massively parallel processing databases such as CalPont, EMC Green Plum, EXASOL, HP Vertica, IBM Netezza, Kognitio, ParAccel, and Teradata Kick fire.

Velocity. Velocity describes the frequency at which data is generated, captured, and shared. The growth in sensor data from devices, and web based click stream analysis now create requirements for greater real-time use cases.  The velocity of large data streams power the ability to parse text, detect sentiment, and identify new patterns.  Real-time offers in a world of engagement require fast matching and immediate feedback loops so promotions align with geo location data, customer purchase history, and current sentiment.  Key technologies that address velocity include streaming processing and complex event processing.  NoSQL databases are used when relational approaches no longer make sense.  In addition, the use of in-memory data bases (IMDB), columnar databases, and key value stores help improve retrieval of pre-calculated data.

Variety.  A proliferation of data types from social, machine to machine, and mobile sources add new data types to traditional transactional data.  Data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID’s, search, sentiment, streaming data, social, text, and web.  The addition of unstructured data such as speech, text, and language increasingly complicate the ability to categorize data.  Some technologies that deal with unstructured data include data mining, text analytics, and noisy text analytics.

In an age where we shift from transactions to engagement and then to experience, the forces of social, mobile, cloud, and unified communications add  two more big data characteristics that should be considered when seeking insights.  These characteristics highlight the importance and complexity required to solve context in big data.

Viscosity – Viscosity measures the resistance to flow in the volume of data.  This resistance can come from different data sources, friction from integration flow rates, and processing required to turn the data into insight.  Technologies to deal with viscosity include improved streaming, agile integration bus’, and complex event processing.

Virality – Virality describes how quickly information gets dispersed across people to people (P2P) networks.  Virality measures how quickly data is spread and shared to each unique node.  Time is a determinant factor along with rate of spread.

4. TECHNOLOGIES

The rising importance of big-data computing stems from advances in many different technologies:

Sensors: Digital data are being generated by many different sources, including digital imagers (telescopes, video cameras, MRI machines), chemical and biological sensors (microarrays, environmental monitors), and even the millions of individuals and organizations generating web pages.

Computer networks: Data from the many different sources can be collected into massive data sets via localized sensor networks, as well as the Internet.

Data storage: Advances in magnetic disk technology have dramatically decreased the cost of storing data. For example, a one-terabyte disk drive, holding one trillion bytes of data,

costs around $100. As a reference, it is estimated that if all of the text in all of the books in the Library of Congress could be converted to digital form, it would add up to only around 20 terabytes..

Cluster computer systems: A new form of computer systems, consisting of thousands of "nodes," each having several processors and disks, connected by high-speed local-area networks, has become the chosen hardware configuration for data-intensive computing systems. These clusters provide both the storage capacity for large data sets, and the computing power to organize the data, to analyze it, and to respond to queries about the data from remote users. Compared with traditional high-performance computing (e.g.,

supercomputers), where the focus is on maximizing the raw computing power of a system, cluster computers are designed to maximize the reliability and efficiency with which they

can manage and analyze very large data sets. The "trick" is in the software algorithms – cluster computer systems are composed of huge numbers of cheap commodity hardware parts, with scalability, reliability, and programmability achieved by new software paradigms.

Cloud computing facilities: The rise of large data centers and cluster computers has created a new business model, where businesses and individuals can rent storage and computing capacity, rather than making the large capital investments needed to construct and provision large-scale computer installations. For example, Amazon Web Services (AWS) provides both network-accessible storage priced by the gigabyte-month and computing cycles priced by the CPU-hour. Just as few organizations operate their own power plants, we can foresee an era where data storage and computing become utilities that are ubiquitously available.

Data analysis algorithms: The enormous volumes of data require automated or semi automated analysis – techniques to detect patterns, identify anomalies, and extract knowledge. Again, the "trick" is in the software algorithms - new forms of computation, combining statistical analysis, optimization, and artificial intelligence, are able to construct statistical models from large collections of data and to infer how the system should respond to new data. For example, Netflix uses machine learning in its recommendation system, predicting the interests of a customer by comparing her movie viewing history to a statistical model generated from the collective viewing habits of millions of other customers.

5. Uses

So the real issue is not that you are acquiring large amounts of data because we are clearly already in the era of big data. It's what you do with your big data that matters. The hopeful vision for big data is that organizations will be able to harness relevant data and use it to make the best decisions. Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to: Analyse millions of SKUs to determine optimal prices that maximize profit and clear inventory. Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk. Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers. Quickly identify customers who matter the most. Generate retail coupons at the point of sale based on the customer's current and past purchases, ensuring a higher redemption rate .Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers. Analyse data from social media to detect new market trends and changes in demand. Use click stream analysis and data mining to detect fraudulent behaviour. Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.

At multiple terabytes in size, the text and images of Wikipedia are a classic example of big data Created by IBM of Wikipedia edits.

6.Examples

RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems.

10,000 payment card transactions are made every second around the world. Wal-Mart handles more than 1 million customer transactions an hour. 340 million tweets are sent per day. That's nearly 4,000 tweets per second. Facebook has more than 901 million active users generating social interaction data. More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones.

7. Challenges

Much of the technology required for big-data computing is developing at a satisfactory rate due to market forces and technological evolution. For example, disk drive capacity is increasing and prices are dropping due to the ongoing progress of magnetic storage technology and the large economies of scale provided by both personal computers and large data centers. Other aspects require more focused attention, including:

High-speed networking: Although one terabyte can be stored on disk for just $100, transferring that much data requires an hour or more within a cluster and roughly a day over a typical "high-speed" Internet connection. (Curiously, the most practical method for transferring bulk data from one site to another is to ship a disk drive via Federal Express.) These bandwidth limitations increase the challenge of making efficient use of the computing and storage resources in a cluster. They also limit the ability to link geographically dispersed clusters and to transfer data between a cluster and an end user. This disparity between the amount of data that is practical to store, vs. the amount that is practical to communicate will continue to increase. We need a "Moore’s Law" technology for networking, where declining costs for networking infrastructure combine with increasing bandwidth.

Cluster computer programming: Programming large-scale, distributed computer systems is a longstanding challenge that becomes essential to process very large data sets in reasonable amounts of time. The software must distribute the data and computation across the nodes in a cluster, and detect and remediate the inevitable hardware and software errors that occur in systems of this scale. Major innovations have been made in methods to organize and program such systems, including the Map Reduce programming framework introduced by Google. Much more powerful and general techniques must be developed to fully realize the power of big-data computing across multiple domains.

Extending the reach of cloud computing: Although Amazon is making good money with AWS, technological limitations, especially communication bandwidth, make AWS unsuitable for tasks that require extensive computation over large amounts of data. In addition, the bandwidth limitations of getting data in and out of a cloud facility incur considerable time and expense. In an ideal world, the cloud systems should be geographically dispersed to reduce their vulnerability due to earthquakes and other catastrophes. But, this requires much greater levels of interoperability and data mobility. The Open Cirrus project is pointed in this direction, setting up an international tasted to allow experiments on interlinked cluster systems. On the administrative side, organizations must adjust to a new costing model. For example, government contracts to universities do not charge overhead for capital costs (e.g., buying a large machine) but they do for operating costs (e.g., renting from AWS). Over time, we can envision an entire ecology of cloud facilities, some providing generic computing capabilities and others targeted toward specific services or holding specialized data sets.

Machine learning and other data analysis techniques: As a scientific discipline, machine learning is still in its early stages of development. Many algorithms do not scale beyond data sets of a few million elements or cannot tolerate the statistical noise and gaps found in real-world data. Further research is required to develop algorithms that apply in real-world situations and on data sets of trillions of elements. The automated or semi automated analysis of enormous volumes of data lies at the heart of big-data computing for all application domains.

Widespread deployment: Until recently, the main innovators in this domain have been companies with Internet-enabled businesses, such as search engines, online retailers, and social networking sites. Only now are technologists in other organizations (including universities) becoming familiar with the capabilities and tools. Although many organizations are collecting large amounts of data, only a handful are making full use of the insights that this data can provide. We expect "big-data science" – often referred to as science – to be pervasive, with far broader reach and impact even than previous-generation computational science.

Security and privacy: Data sets consisting of so much, possibly sensitive data, and the tools to extract and make use of this information give rise to many possibilities for unauthorized access and use. Much of our preservation of privacy in society relies on current inefficiencies. For example, people are monitored by video cameras in many locations – ATMs, convenience stores, airport security lines, and urban intersections. Once these sources are networked together, and sophisticated computing technology makes it possible to correlate and analyze these data streams, the prospect for abuse becomes significant. In addition, cloud facilities become a cost-effective platform for malicious agents, e.g., to launch a bonnet or to apply massive parallelism to break a cryptosystem. Along with developing this technology to enable useful capabilities, we must create safeguards to prevent abuse.

8. solutions

Specific actions that the federal government could take include: Give the NSF a large enough budget increase that they can foster efforts in big-data computing without having to cut back on other programs. Research thrusts within computing must cover a wide range of topics, including: hardware and system software design; data-parallel programming and algorithms; automatic tuning, diagnosis and repair in the presence of faults; scalable machine learning algorithms; security and privacy; and applications such as language translation and computer vision. Interdisciplinary programs should marry technologists with applications experts with access to extremely large datasets in other fields of science, medicine, and engineering. Reconsider the current plans to construct special-purpose data centers for the major Science programs. Possible economies of scale could be realized by consolidating these into a small number of "super data centers" provisioned as cloud computing facilities. This approach would provide opportunities for technologists to interact with and support domain scientists more effectively. These efforts should be coupled with large-scale networking research projects. Renew the role of DARPA in driving innovations in computing technology and its applications. Projects could include applications of interest to the DoD, including language understanding, image and video analysis, and sensor networks. In addition, they should be driving the fundamental technology required to address problems at the scale faced by the DoD. Both the systems and data analysis technologies are clearly "dual use." Sensitize the DoD to the potential for technological surprise. An adversary with very modest financial resources could have access to supercomputer-class computer facilities. $100 buys 1000 processors for 1 hour on AWS. $100M – considerably less than the cost of single modern strike fighter – buys one billion processor-hours. Cloud computing must be considered a strategic resource, and it is essential that the US stays in the lead in the evolution and application of this technology. Get the DoE to look beyond traditional high-performance computing in carrying out their energy and nuclear weapons missions. Many of their needs could be addressed better and more cost effectively by cluster computing systems, possibly making use of cloud facilities. Encourage the deployment and application of big-data computing in all facets of the government, ranging from the IRS (tax fraud detection), intelligence agencies (multimedia information fusion), the CDC (temporal and geographic tracking of disease outbreaks), and the Census Bureau (population trends). Make fundamental investments in our networking infrastructure to provide ubiquitous, broadband access to end users and to cloud facilities. Big-data computing is perhaps the biggest innovation in computing in the last decade. We have only begun to see its potential to collect, organize, and process data in all walks of life. A modest investment by the federal government could greatly accelerate its development and deployment.

9. Conclusions

Many people view "big data" as an over-hyped buzzword. It is, however, a useful term because it highlights new data management and data analysis technologies that enable organizations to analyze certain types of data and handle certain types of workload that were not previously possible. The actual technologies used will depend on the volume of data, the variety of data, the complexity of the analytical processing workloads involved, and the responsiveness required by the business. It will also depend on the capabilities provided by vendors for managing, administering, and governing the enhanced environment. These capabilities are important selection criteria for product evaluation. Big data, however, involves more than simply implementing new technologies. It requires senior management to understand the benefits of smarter and more timely decision making. It also requires business users to make pragmatic decisions about agility requirements for analyzing data and producing analytics, given tight IT budgets. The good news is that many of the technologies outlined in this article not only support smarter decision making, but also provide faster time to value.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now