What Is Data Analysis

Published Date: 02 Nov 2017

Getting Started

In this chapter we discuss the principles of data analysis and why this is important for business. We get into the nature of data, structured (databases, logs and reports) and unstructured (image collections, social networks and text mining) and how the visualization of this data can help us.

What is Data Analysis?

Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future. Data analysis is not about the numbers, it is about make questions, develop explanations and test hypotheses. Data analysis is a multidisciplinary field which combines Computer Science, Artificial Intelligence, Machine Learning, Statistics, Mathematics and Knowledge Domain as is showed in the image bellow:

0995OS_01_01.png

Computer Science

Computer Science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased demand for skills like programing, data bases administration, networking administration and high performance computing. Some programming experience in Python (or any high level programing language) is needed to follow the chapters.

Artificial Intelligence (AI)

According with Stuart Russell and Peter Norvig,

"Artificial intelligence has to do with smart programs, so let's get on and write some".

In other words AI studies the algorithms that can simulate an intelligent behaviour. In data analysis we use AI to perform those activities that require intelligence like inference, similarity search or unsupervised classification.

Machine Learning

Machine Learning (ML) is the study of computer algorithms to learn how to react in a certain situation or recognize patterns. According to Arthur Samuel (1959)

"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed".

ML has a big amount of algorithms generally split in 3 groups given how itâ€™s training:

Supervised learning

Unsupervised learning

Reinforcement learning

Relevant number of algorithms is used throughout the book and is combined with practical examples, leading the reader through the process from data problem to its programming solution.

Statistics

In January 2009, Google's Chief Economist Hal Varian said

"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"

Statistics is the development and application of methods to collect, analyse and interpret data. Data analysis encompasses a variety of statistical techniques such as simulation, Bayesian methods, forecasting, regression, time-series analysis and clustering.

Mathematics

The data analysis makes use of a lot of Mathematical techniques like Linear Algebra (vector and matrix, factorization, eigenvalue), numerical methods and conditional probability in the algorithms. In this book all the chapter are self-contained and include the necessary math involve.

Knowledge Domain

One of the most important activities in data analysis is make questions and a good understanding of the knowledge domain can give you the expertise and intuition needed to make good questions. The data analysis is used in almost all the domains like Finance, Administration, Business, Social Media, Government and Science.

The Data Analysis Process

When you have a good understanding of a phenomenon it is possible make predictions about it. The data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed by:

The Problem definition

Obtain your data

Clean the data

Normalize the data

Transform the data

Exploratory Statistics

Exploratory Visualization

Predictive Modeling

Validate your model

Visualize and Interpret your results

Deploy your solution

All this activities can be grouped as is showed in the image bellow:

0995OS _01_02.png

The Problem

The problem definition starts with high level questions like how to track differences in behavior between groups of customers or whatâ€™s going to be the gold price in the next month. Understanding the objectives and requirements from a domain perspective is the key for a successful data analysis project.

Types of data analysis questions:

Inferential

Predictive

Descriptive

Exploratory

Causal

Correlational

Data Preparation

Data preparation is about how to obtain, clean, normalize and transform the data into an optimal data set, trying to avoid any possible data quality issues like, invalid, ambiguity, out-of-range or missing values. This process can take a lot of your time. In Chapter 2 we go into more details about working with data using OpenRefine to address the complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of a good data are listed as follows:

Complete

Coherent

Ambiguity elimination

Countable

Correct

Standardized

Redundancy elimination

Data Exploration

Data exploration is essentially looking at the data in a graphical or statistical form trying to find Patterns, Connections and Relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found. In Chapter 3 we present a visualization framework (D3.js) and we implement some examples in how to use visualization as a data exploration tool.

Predictive Modeling

Predictive modeling is a process used in data analysis to create or chose a statistical model to try to best predict the probability of an outcome. In this book we use a variety of those models and we can group in 3 categories based in its outcome:

Chapter

Algorithm

Categorical outcome (Classification)

NaÃ¯ve Bayes Classifier

Decision Tree Learning

Natural Language Toolkit + NaÃ¯ve Bayes Classifier

Numerical outcome (Regression)

Random Walk

Support Vector Machines

Distance Based Approach + k-nearest neighbor

Descriptive modeling (Clustering)

Fast Dynamic Time Warping (FDTW) + Distance Metrics

Multidimensional Scaling + K-Means

Other important task we need to accomplish in this step is evaluating the model we chose to be optimal for the particular problem. The No Free Lunch Theorem proposed by Wolpert in 1996 said

"No Free Lunch theorems have shown that learning algorithms cannot be universally good".

The model evaluation helps us to ensure that our analysis is not overoptimistic or over fitted. In this book we are going to present two different ways to validate the model:

k-fold cross-validation: we divide the data into k subsets of equal size. We build models k times, each time leaving out one of the subsets from training and use it as the test set. If k equals the sample size, this is called "leave-one-out".

Hold-Out: mostly for large dataset is randomly divided to three subsets: Training set, Validation set and Test set

Visualization of Results

This is the final step in or analysis process and we need to answer to questions:

How is going to be presented the results?

Where is going to be deployed?

In this book we chose the visualization framework D3.js in order to deploy directly in a web server.

Data, Information and Knowledge

Data are facts of the world. For example financial transactions, age, temperature, number of steps from my house to my office, is simply numbers. The information appears when we work with those numbers and we can find value and meaning. The information can help us to make informed decisions.

We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we canâ€™t store knowledge because it implies theoretical or practical understanding of a subject. However using predictive analytics we can simulate an intelligent behavior and provide a good approximation. As is showed in the image below, an example of how to turn data into knowledge:

0995OS_01_03.png

The Nature of Data

Data is the plural of datum, so is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists or in our Twitter account. In fact data can be seen as the essential raw material of any kind of human activity. According to the Oxford English Dictionary Data are

"known facts or things used as basis for inference or reckoning".

As is showed in the image bellow we can see data in two distinct ways Categorical and Numerical.

0995OS_01_04.png

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable having two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, Discrete and continuous. A discrete data are values or observations can be counted and are distinct and separate. For example, number of lines in a code. A continuous data are values or observations may take on any value within a finite or infinite interval. For example, an economic time series like historic gold prices.

The kinds of datasets used in this book are:

Emails (unstructured, discrete)

Digital Images (unstructured, discrete)

Stock Market Logs (structured, continuous)

Historic Gold Prices (structured, continuous)

Credit Approval records (structured, discrete)

Social Media Friends Relationships (unstructured, discrete)

Tweets and Treading Topics (unstructured, continuous)

Sales Records (structured, continuous)

For each of the projects in this book we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

Social Networks Analysis

Formally the SNA(Social Network Analysis) perform the analysis of social relationships in terms of network theory. With nodes representing individuals and ties representing relationships between the individuals, as we can see in the image below. The social network creates groups of related individuals (Friendship) based in different aspects of their interaction. We can find important information like hobbies (for product recommendation) or who has the most influential opinion in the group (Centrality). We present in the Chapter 10 a project "Who is your closest friend?" and we show a solution for Twitter clustering.

0995OS_01_05.png

Social Networks are strongly connected and these connections are often not symmetric. This makes the SNA computationally expensive, and needs to be addressed with high performance solutions less statistical and more algorithmic.

The visualization of a social network can help us to get a good insight of how people are connected. The exploration of the graph is done through displaying nodes and ties in various colors, sizes and distributions. D3.js has animation capabilities that enable us the visualization of the social graph with an interactive animation. These help us to simulate behaviors like information diffusion or distance between nodes.

Facebook process more than 500 TB data daily (images, text, video, likes and relationships), this amount of data need non-conventional treatment like NoSQL databases and MapReduce frameworks, in this book we work with MongoDB a Document-Based NoSQL database, which also has great functions for Aggregations and MapReduce processing.

Sensors and Cameras

Interact with the outside world is highly important in data analysis. Using sensors like RFID (Radio-frequency identification) or a smart phone to scan a QR code (Quick Response Code) is an easy way to interact directly with the customer, make recommendations and analyze consumer trends.

In the other hand, people are using their smart phones all the time, using their cameras as a tool. In Chapter 5 we use these digital images to perform search by image. This can be used for example in face recognition or to find recommendations of a restaurant just by taking a picture of the front door.

The interaction with the real world can give you a competitive advantage and a real time data source directly from the customer.

What about Big Data?

Big Data is a term used when the data exceeds the processing capacity of typical database. We need a Big Data Analytics when the data grows fast and we need to uncover hidden patterns, unknown correlations and other useful information.

There are three main features in big data:

Volume: Large amounts of data.

Variety: Different types of structured, unstructured and multi-structured data.

Velocity: Needs to be analyzed quickly.

As is showed in the image bellow we can see the interaction between the 3 Vs.

0995OS_01_06.png

Big Data is the opportunity for any company to gain advantages from data aggregation, data exhaust, and metadata. This makes to Big Data become a useful business analytic tool, but there is a common misunderstanding about what is Big Data?

Apache Hadoop is the most popular implementation of MapReduce, to solve large-scale distributed data storage, analysis and retrieval tasks. However MapReduce it is just one of three classes of technologies to storing and managing Big Data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. In this book we implement MapReduce functions and NoSQL storage through MongoDB, see Chapter 12 and Chapter 13.

MongoDB provide us with Document-Oriented Storage, High Availability and Map/Reduce flexible aggregation for data processing.

In a paper published by IEEE in 2009 "The Unreasonable Effectiveness of Data" sets that "But invariably, simple models and a lot of data trump over more elaborate models based on less data." this is a Fundamental idea in Big Data ( you can find the full paper in http://bit.ly/RVeE8g ). The trouble with real world data is that, the probability of finding false correlations is high and gets higher as the datasets grows. Thatâ€™s why in this book we focus in meaningful data instead of big data.

One of the main challenges for the big data is how to storage, protect, backup, organize and catalog the data in a petabyte scale.

0995OS_01_07.png

Quantitative vs. Qualitative Data Analysis

Quantitative data is numerical measurements expressed in terms of numbers.

Qualitative data is categorical measurements expressed in terms of natural language descriptions.

As is showed in the image bellow we can observe the differences between Quantitative and Qualitative analysis.

0995OS_01_08.png

Quantitative analytics involves analysis of numerical data. The level of measurement can influence the type of analysis we can use. There are four kinds of measurements:

Nominal data has no logical order and is used as classification data.

Ordinal data has a logical order and differences between values are not constant.

Interval data is continuous and has a logical order. The data has standardized differences between values, but not include zero.

Ratio data is continuous, ordered, has standardized differences between values, and zero.

Qualitative analysis can explore the complexity and meaning of social phenomena, Data for qualitative study may include written texts (e.g. documents or email) and/or audible and visual data (digital images or sounds). In Chapter 11 we present a sentiment analysis from twitter data as an example of qualitative analysis.

Why data visualization matters

The goal of the data visualization is expose something new about the underlying patterns and relationships contained within the data. The visualization not only needs to be beautiful but also meaningful in order to help organizations make better decisions. Visualization is an easy way to jump into a complex dataset (small or big) to describe and explore the data efficiently. Many kinds of data visualization are available like, bar chart, histogram, line chart, pie chart heat maps, frequency Wordle (as is show in the image bellow) and so on, for one variable, two variable, many variables in one, two or three dimensions.

0995OS_01_09.png

Data visualization is an important part of our data analysis process because is a fast an easy way to do an exploratory data analysis through summarize their main characteristics with a visual graph.

The goals of exploratory data analysis are listed as follows:

Detection of data errors.

Checking of assumptions.

Find hidden patters (like tendency).

Preliminary selection of appropriate models.

Determining relationships between the variables.

We will get into more details about data visualization and exploratory data analysis in Chapter 3.

Tools and Toys for this book

The main goal of this book is to provide the reader with self-contained projects ready to deploy, in order to do this, as you go through the book we will use and implement tools like Python, D3 and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the Author's Github repository.

https://github.com/hmcuesta

You can see a detailed Installation and Setup process of all the tools in Appendix A.

Why Python?

Python is a "scripting language", an interpretive language with its own built-in memory management and good facilities for calling and cooperating with other programs. There are two popular versions, 2.7 or 3.x, in this book we will focused in the 3.x version because is under active development and has already seen over two years of stable releases.

Python is multi-platform, runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines. Python has powerful standard libs and wealth of 3rd party packages like NumPy, SciPy, pandas, SciKit, mlpy, etc.

Python is excellent for beginners, yet great for experts and is highly scalable, suitable for large projects as well as small ones. Also is easily extensible and object-oriented.

Python is widly used by organizations like Google, Yahoo maps, NASA, Red Hat, Raspberry Pi, IBM, etc.

http://wiki.python.org/moin/OrganizationsUsingPython

Python has an excellent documentation and examples:

http://docs.python.org/3/

Python is free to use, even for commercial products, download is available for free from:

http://python.org/

Why mlpy?

mlpy (Machine Learning Python) is a Module built on top of NumPy, SciPy and the GNU Scientific Libraries. Is open source and supports Python 3.x. mlpy has a big amount of machine learning algorithms for supervised and unsupervised problems.

Some of the features of mlpy that will be used in this book are:

Regression: Support Vector Machines (SVM).

Classification: Support Vector Machines (SVM), k-Nearest-Neighbor (kNN), Classification Tree.

Clustering: k-means, Multidimensional Scaling.

Dimensionality Reduction: Principal Component Analysis (PCA).

Misc.: Dynamic Time Warping (DTW) distance.

We can download the latest version of D3.js from:

http://mlpy.sourceforge.net/

Reference: D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, C. Furlanello. mlpy: Machine Learning Python, 2012.

http://arxiv.org/abs/1202.6548

Why D3.js?

D3.js (Data-Driven Documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js you can manipulate all the elements of the DOM (Document Object Model), it is as flexible as the client side web technology stack (HTML, CSS, SVG).

D3.js supports large datasets and includes animation capabilities that make it a really good choice for web visualization.

D3 has an excellent documentation, examples and community:

https://github.com/mbostock/d3/wiki/Gallery

https://github.com/mbostock/d3/wiki

We can download the latest version of D3.js from:

http://d3js.org/d3.v3.zip

Why MongoDB?

NoSQL (Not only SQL) is a term that covers different types of data storage technology, used when you can't fit your business model, into a classical relational data model. NoSQL is mainly used in web 2.0 and in social media applications.

MongoDB is Document-based database. This means that MongoDB stores and organizes the data as a collection of documents. That gives you the possibility to store the view models almost exactly like you model them in the application. Also you can be able to perform complex searches for data and elementary data mining with MapReduce.

MongoDB is highly scalable, robust and perfect to work with JavaScript based web applications because you can store your data in a JSON (JavaScript Object Notation) document and implements a flexible schema which makes it perfect for no structured data.

MongoDB is used by highly recognize corporations like Foursquare, Craigslist, Firebase, SAP, Forbes, we can see a detailed list in:

http://www.mongodb.org/about/production-deployments/

MongoDB has a big and active community and a well written documentation.

http://docs.mongodb.org/manual/

MongoDB is easy to learn and its free, we can download MongoDB from:

http://www.mongodb.org/downloads

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

What Is Data Analysis