Specific Elements For A Smart City

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

In our age the manager of big data for produce knowledge and innovation are the elements which will determined the growth of productivity.

In (Diriks, 2009) is highlight that every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is name big data.

The knowledge and innovation are determined by the investment in manage of data, research, education, development, creativity, transmission.

The efficient manage of big data is a solution for innovation, competition, and productivity.

From big data which offer us a lot of information we can select, using the decision based by our knowledge, the useful solutions and we can produce new knowledge and innovation (Figure 18).

Figure – Knowledge and innovation

Evolutions and development of our society in fact wouldn't exist without innovation. We can say that: "It is all about knowledge and innovation".

All solutions that we have are the result of someone inventing in a better way of doing something, a better tool (Franklin, 2009).

In an economic crisis the need for knowledge and innovation is higher as ever and the analyses of the new solutions for manage data like open data is very important.

Open data is the concept who tries to accentuate the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

Open data offers the new possibilities to analyze and visualize data from different sources.

Open data can make the world work better and this is no exaggeration, and the reason is that information is a crucial driving force in innovation. The must quantity of data from our world is generating by public sector.

Moreover, the European Council stated in the Visby Declaration (Presidency of the European Council, 2009) said that "European Union (EU) member states should seek to make data freely accessible in open machine - readable formats and stimulate the reuse of public sector information using open data".

The European Commission and the EU member states committed announced in the European eGovernment Action Plan 2011-2015 to "maximizing the value of re-use of public sector information (PSI), by making raw data and documents available for re-use in a wide variety of formats (including machine-readable ones) and languages and by setting up PSI portals" [38]. This highlights the necessity by using open data in every part of our cities.

Figure – Knowledge and innovation

Open data for public sector focuses on the following areas (Figure 19):

Transparency and accountability;

Participation in the decisions;

Decisions;

Communication with enterprisers and citizens;

Participation and citizens engagement;

Internal and external collaboration;

Innovation.

Intelligent processing of data is essential for addressing societal challenges. Data can for example be used to enhance the sustainability of national health care systems.

Over 30% of all data stored on earth is medical data and it is growing rapidly. The medical sector has a problem with this data [40].

Open data for the moment in medicine means only access for medical professionals and patients, protected by laws, rules and regulations.

Open Government Data Initiative (OGDI) [39] from Microsoft is a cloud-based collection of open government software assets that enables publicly available government data to be easily accessible.

Using open standards and application programming interfaces (API), developers and government agencies can retrieve the data programmatically for use in new and innovative online applications.

The Microsoft solutions can [39]:

Encourage citizens and communities to participate with governments;

Enhance collaboration between government agencies and private organizations;

Increase government transparency;

Provide unique insight into data trends and analysis.

OGDI promotes the use of this data by capturing and publishing re-usable software assets, patterns, and practices. OGDI data is hosted in Windows Azure. It is accessible through open, standards-based web services from a variety of development environments, including Microsoft .NET, JavaScript, Adobe Flash, PHP, Ruby, Python, and others.

IBM initiative in this direction is based on Real Web 2.0 Linking Open Data (LOD) [2], a community initiative for moving the Web from the idea of separated documents to a wide information space of data.

The key principles of LOD are that it is simple, readily adaptable by Web developers, and complements many other popular Web trends.

Learn how to make your data more widely used by making its components easier to discover, more valuable, and easier for people to reuse—in ways you might not anticipate [36].

A look back to the immediate history shows that the data storing capacity is continuously increasing, while the cost per GB stored is decreasing, figure 1. The first hard disk came from IBM in 1956, it was called The IBM 350 Disk Storage and it had a capacity of 5 MB, (IBM, 2010). In 1980, IBM 3380 breaks the gigabyte-capacity limit providing storage for 2.52 GB. After 27 years, Hitachi GST that acquired IBM drive division in 2003, deliver the first terabyte hard drive. After only two years, in 2009, Western Digital launches industry’s first two terabyte hard drive. In 2011, Seagate introduces the world’s first 4TB hard drive, [11].

Figure Average HDD capacity (based on (Smith, 2008))

In terms of price (Fern, 2011), the cost per gigabyte decreased from an average of 300.000 $ to a merely an average of 11 cents in the last 30 years, figure 2. As a fact, in 1981 you must use 200 Seagate units, each having a five megabytes capacity and costing 1700$, to store one gigabyte of data.

Figure Average $ cost per GB (based on (Smith, 2008))

The reduce costs for data storage has represented the main premise of the current data age, in which it is possible to record and store almost everything, from business to personal data. As an example, in the field of digital photos, it is estimated that around the world has been taken over 3.5 trillion photos and in 2011 alone have been made around 360 billion new snapshots, (Good, 2011). Also, the availability and the speed of Internet connections around the world have generated increased data traffic that is generated by a wide range of mobile devices and desktop computers. Only for mobile data, Cisco expects that mobile data traffic will grow from 0.6 exabytes (EB) in 2011 to 6.3 EB in 2015, (Cisco, 2010). The expansion of data communications has promoted different data services, social, economic or scientific, to central nodes for storing and distributing large amounts of data. For example, Facebook social network hosts more than 140 billion photos, which is more than double compared to 60 billion pictures at the end of 2010. In terms of storage, all these snapshots data take up more than 14 petabytes. In other fields, like research, the Large Hadron Collider particle accelerator near Geneva, will produce about 15 petabytes of data per year. The SETI project is recording each month around 30 terabytes of data which are processed by over 250.000 computers each day, (SETI project, 2010).The supercomputer of the German Climate Computing Center (DKRZ) has a storage capacity of 60 petabytes of climate data. In the financial sector, records of every day financial operations generate huge amounts of data. Solely, the New York Stock Exchange records about one terabyte of trade data per day, (Rajaraman, 2008).

Despite this spectacular evolution of storage capacities and of deposits size, the problem that arises is to be able to process it. This issue is generated by available computing power, algorithms complexity and access speeds. This paper makes a survey of different technologies used to manage and process large data volumes and proposes a distributed and parallel architecture used to acquire, store and process large datasets. The objective of the proposed architecture is to implement a cluster analysis model.

As professor Anand Rajaraman questioned, more data usually beats better algorithm [15]. The question is used to highlight the efficiency of a proposed algorithm for the Netflix Challenge, (Bennett, 2007).Despite the statement is still debatable, it brings up a true point. A given data mining algorithm yields better results with more data and it can reach the same accuracy of results of a better or more complex algorithm. In the end, the objective of a data analysis and mining system is to process more data with better algorithms, (Rajaraman, 2008).

In many fields more data is important because it provides a more accurate description of the analyses phenomenon. With more data, data mining algorithms are able to extract a wider group of influence factors and more subtle influences.

Today, large datasets means volumes of hundreds of terabytes or petabytes and these are real scenarios. The problem of storing these large datasets is generated by the impossibility to have a drive with that size and more important, by the large amount of time required to access it.

Access speed of large data is affected by the disk speed performances [22], table 1, internal data transfer, external data transfer, cache memory, access time, rotational latency, that generate delays and bottlenecks.

Table Example of disk drives performance.

Interface

HDD

Spindle

[rpm]

Average

rotational

latency [ms]

Internal transfer

[Mbps]

External transfer

[MBps]

Cache

[MB]

SATA

7,200

11

1030

300

8 – 32

SCSI

10,000

4.7 – 5.3

944

320

8

High-end SCSI

15,000

3.6 – 4.0

1142

320

8 – 16

SAS

10,000 / 15,000

2.9 – 4.4

1142

300

16

Source SEAGATE

Despite the rapid evolution of drives capacity, described in figure 1, the large datasets of up to one petabytes can only be stored on multiple disks. Using 4 TB drives requires 250 of them to store 1 PB of data. Once the storage problem is solved another question arises, regarding how easy/hard is to read that data. Considering an optimal transfer rate of 300 MB/s then the entire dataset is read in 38.5 days. A simple solution is to read from the all disks at once. In this way, the entire dataset is read in only 3.7 hours.

If we take into consideration the communication channel, then other bottlenecks are generated by the available bandwidth. In the end, the performance of the solution is reduced to the speed of the slowest component.

Other characteristics of large data sets add supplementary levels of difficulty:

many input sources; in different economic and social fields there are multiple sources of information;

redundancy, as the same data can be provided by different sources;

lack of normalization or data representation standards; data can have different formats, unique IDs, measurement units;

different degrees of integrity and consistency; data that describes the same phenomenon can vary in terms of measured characteristics, measuring units, time of the record, methods used.

For limited datasets the efficient data management solution is given by relational SQL databases, (Pavlo, 2009), but for large datasets some of their founding principles are eroded (Helland, 2011), (Pavlo, 2009), (Tamer, 1996).

The Extract, Transform and Load (ETL) process provides an intermediary transformations layer between outside sources and the end target database.

In literature [29], [30] we identified ETL and ELT. ETL refers to extract, transform and load in this case activities start with the use of applications to perform data transformations outside of a database on a row-by-row basis, and on the other hand ELT refers to extract, load and transform which implied the use first the relational databases, before performing any transformations of source data into target data.

The ETL process (Davenport, 2009) is based on three elements:

Extract – The process in which the data is read from multiple source systems into a single format. In this process data is extracted from the data source;

Transform – In this step, the source data is transform into a format relevant to the solution. The process transform the data from the various systems and made it consistent;

Load – The transformed data is now written into the warehouse.

Usually the systems that acquire data are optimized so that the data is being stored as fast as possible. Most of the time comprehensive analyses require access to multiple sources of data. It’s common that those sources store raw data that yields minimal information unless properly process.

This is where the ETL or ELT processes come into play. An ETL process will take the data, stored in multiple sources, transform it, so that the metrics and KPIs are readily accessible, and load it in an environment that has been modeled so that the analysis queries are more efficient (Vassiliadis, 2009).

The ETL advantages are [47], (Dumey, 2007), (Albrecht, 2009):

save time and costs when developing and maintaining data migration tasks;

use for complex processes to extract, transform, and load heterogeneous data into a data warehouse or to perform other data migration tasks;

in larger organizations for different data integration and warehouse projects accumulate;

such processes encompass common sub-processes, shared data sources and targets, and same or similar operations;

ETL tools support all common databases, file formats and data transformations, simplify the reuse of already created (sub-)processes due to a collaborative development platform and provide central scheduling.

When developing the ETL process, there are two options: either take advantage of an existing ETL tool, some key players, in this domain, are: IBM DataStage, Ab Initio, Informatica or custom code it. Both approaches have benefits and pitfalls that need to be carefully considered when selecting what better fits the specific environment.

MapReduce is a programming model and an associated implementation for processing and generating large data sets, (Dean, 2008). The model was developed by Jeffrey Dean and Sanjay Ghemawat at Google. The foundations of the MapReduce model are defined by a map function used top process key-value pairs and a reduce functions that merges all intermediate values of the same key.

The large data set is split in smaller subsets which are processed in parallel by a large cluster of commodity machines.

Map function (Dean, 2004) takes an input data and produces a set of intermediate subsets. The MapReduce library groups together all intermediate subsets associated with the same intermediate key and send them to the Reduce function.

The Reduce function, also accepts an intermediate key and subsets. This function merges together these subsets and key to form a possibly smaller set of values. Normally just zero or one output value is produced per Reduce function.

In [26] is highlight that many real world tasks such used MapReduce model. This model is used for web search service, for sorting and processing the data, for data mining, for machine learning and for a big number of other systems.

A general MapReduce architecture can be illustrated as Figure 22.

Figure MapReduce architecture (based on (Yu, 2008))

The entire framework manages how data is split among nodes and how intermediary query results are aggregate.

The MapReduce advantages are (White, 2009), (Dean, 2010), (Carroll, 2008), (Dean, 2008), (Pavlo, 2008), (Stonebreaker , 2010), (Dean, 2004).

the model is easy to use, even for programmers without experience with parallel and distributed systems;

storage-system independence as it not requires proprietary database file systems or predefined data models; data is stored in plain text files and it is not required to respect relational data schemes or any structure; in fact the architecture can use data that has an arbitrary format;

fault tolerance;

the framework is available from high level programming languages; one such solution is the open-source Apache Hadoop project which is implemented in Java;

the query language allows record-level manipulation;

projects as Pig and Hive (Yu, 2008) are providing a rich interface that allows programmers to do join datasets without repeating simple MapReduce code fragments;

Hadoop is a distributed computing platform, which is an open source implementation of the MapReduce framework proposed by Google (Yu, 2008). It is based on Java and uses the Hadoop Distributed File System (HDFS). HDFS is the primary storage system used by Hadoop applications. It is uses to create multiple replicas of data blocks for reliability, distributing them around the clusters and splitting the task into small blocks. The relationship between Hadoop, HBase and HDFS can be illustrated as Figure 23.

HDFS

HBase

Hadoop

Input

Input

Input

Output

Output

Output

Figure The relationship between Hadoop, HBase and HDFS (based on (Yu, 2008))

HBase is a distributed database. HBase is an open source project for a database, distributed, versioned, column-oriented, modeled after Google’ Bigtable [46].

Some as the features of HBASE as listed at [46] are:

convenient base classes for backing Hadoop MapReduce jobs with HBase tables including cascading, hive and pig source and sink modules;

query predicate push down via server side scan and get filters;

optimizations for real time queries ;

a Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options

extensible JRuby-based (JIRB) shell;

support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX.

HBase database stores data in labeled tables. In this context (Yu, 2008) the table is designed to have a sparse structure and data is stored in table rows, and each row has a unique key with arbitrary number of columns.

A distributed database (DDB) is a collection of multiple, logically interconnected databases distributed over a computer network. A distributed database management system (distributed DBMS) is the software system that permits the management of the distributed database and makes the distribution transparent to the users. A parallel DBMS is a DBMS implemented on a multiprocessor computer (Özsu, 1996). Based on the above definitions, we can conclude that parallel database systems improve performance of data processing by parallelizing loading, indexing and querying data. In distributed database systems, data is stored in different DBMSs that can function independently. Because parallel database systems may distribute data to increase the architecture performance, there is a fine line that separates the two concepts in real implementations.

Despite the differences between parallel and distributed DBMSs, most of their advantages are common to a simple DBMS, (Pavlo, 2009):

stored data is conform to a well-defined schema; this validates the data and provides data integrity;

data is structured in a relational paradigm of rows and columns;

SQL queries are fast;

the SQL query language is flexible, easy to learn and read and allows programmers to implement complex operations with ease;

use hash or B-tree indexes to speed up access to data;

can efficiently process datasets up to two petabytes of data.

Known commercial parallel databases as Teradata, Aster Data, Netezza (Henschen, 2008), DATAllegro, Vertica, Greenplum, IBM DB2 and Oracle Exadata, have been proven successful because:

allow linear scale-up and linear speed-up, (Özsu, 1996);

implement inter-query, intra-query and intra-operation parallelism, (Özsu, 1996);

reduced implementation effort;

reduced administration effort;

high availability.

In a massively parallel processing architecture (MPP), adding more hardware allows for more storage capacity and increases queries speeds. MPP architecture, implemented as a data warehouse appliance, reduces the implementation effort as the hardware and software are preinstalled and tested to work on the appliance, prior to the acquisition. It also reduces the administration effort as it comes as a single vendor out of the box solution. The data warehouse appliances offer high availability through built-in fail-over capabilities using data redundancy for each disk.

Ideally, each processing unit of the data warehouse appliance should process the same amount of data at any given time. To achieve that, the data should be distributed uniformly across each processing unit. Data skew is a measure to evaluate how data is distributed across each processing unit. A data skew of 0 means that the same number of records is distributed on each processing unit. A data skew of 0 is ideal.

By having each processing unit do the same amount of work it ensures that all processing units finish their task about the same time, minimizing any waiting times.

Another aspect that has an important impact on the query performance is having the all the data that is related on the same processing unit. This way the time required to transfer data between the processing units is eliminated. For example, if the user requires the sales by country report, having both the sales data for a customer and his geographic information on the same processing unit will ensure that the processing unit has all the information that it needs and each processing unit is able to perform its tasks independently.

The proposed architecture is used to process large financial datasets. The results of the data mining analysis help economists to identify patterns in economic clusters that validate existing economic models or help to define new ones. The bottom-up approach is more efficient, but more difficult to implement, because it can suggests, based on real economic data, relations between economic factors that are specific to the cluster model. The difficulty comes from the large volume of economic data that needs to be analyzed.

The architecture, described in figure 24, has three layers:

the input layer implements data acquisition processes; it gets data from different sources, reports, data repositories and archives which are managed by governmental and public structures, economic agencies and institutions, NGO projects; some global sources of statistical economic data are Eurostat, International Monetary Fund and World Bank; the problem of these sources is that they use independent data schemes and bringing them to a common format it is an intensive data processing stage taking into consideration national data sources or crawling the Web for free data, the task becomes a very complex one;

the data layer stores and process large datasets of economic and financial records; this layer implements distributed, parallel processing;

the user layer provides access to data and manage requests for analysis and reports.

Figure Proposed architecture (Boja et al, 2012)

The ETL intermediary layer placed between the first two main layers, collects data from the data crawler and harvester component, converts it in a new form and loads it in the parallel DBMS data store. The ETL normalize data, transforms it based on a predefined structure and discards not needed or inconsistent information.

The ETL layer inserts data in the parallel distributed DBMS that implements the Hadoop and MapReduce framework. The objective of the layer is to normalize data and bring it to a common format, requested by the parallel DBMS.

Using an ETL process, data collected by the data crawler and harvester gets consolidated, transformed and loaded into the parallel DBMS, using a data model optimized for data retrieval and analysis.

It is important that the ETL server also supports parallel processing allowing it to transform large data sets in timely manner. ETL tools like Ab Initio, DataStage and Informatica have this capability built in. If the ETL server does not support parallel processing, then it should just define the transformations and push the processing to the target parallel DBMS.

The user will submit his inquiries through a front end application server, which will convert them into queries and submit them to the parallel DBMS for processing.

The front end application server will include an user friendly metadata layer, that will allow the user to query the data ad-hoc, it will also include canned reports and dashboards.

The objective of the proposed architecture is to separate the layers that are data processing intensive and to link them by ETL services that will act as data buffers or cache zones. Also the ETL will support the transformation effort.

For the end user the architecture is completely transparent. All he will experience is the look and feel of the front end application.

Processing large datasets obtained from multiple sources is a daunting task as it requires tremendous storing and processing capacities. Also, processing and analyzing large volumes of data becomes non-feasible using a traditional serial approach. Distributing the data across multiple processing units and parallel processing unit yields linear improved processing speeds.

When distributing the data is critical that each processing unit is allocated the same number of records and that all the related data sets reside on the same processing unit.

Using a multi-layer architecture to acquire, transform, load and analyze the data, ensures that each layer can use the best of bread for its specific task. For the end user, the experience is transparent. Despite the number of layers that are behind the scenes, all that it is exposed to him is a user friendly interface supplied by the front end application server.

In the end, once the storing and processing issues are solved, the real problem is to search for relationships between different types of data (Helland, 2011). Others has done it very successfully, like Google in Web searching or Amazon in e-commerce.

A Quality Management Framework in the context of actual services has the following components: on-line service as object (entity), on-line service development and delivery process as process, business and consumer as users, specific service request as request as requirements, evaluation and measurement of the e-service to determine its quality.

On-line service quality management process uses surveys and questionnaires to evaluate the following qualitative aspects of the e-service:

Awareness – the degree of which users are aware of the on-line service existence and its features.

Expectations – what users think that the on-line service offers.

Accessibility – the degree of which all individuals can access service regardless of education, age, sex, culture, religion or the existence of any physical handicap.

Driving reasons for use – what made the user access the on-line service instead of using the traditional method.

Use preventing reasons – what prevents the user from using the service.

Feedback on additional features needed – what users are requesting in order to enhance their experience while using the on-line service

User impact – how on-line service changes user’s routine

Overall satisfaction – how satisfied the user is with on-line service, overall.

Qualitative characteristics of the on-line service are helping to construct an image of the current level of quality. They do have the disadvantage though that they can’t be used in calculations, that they are difficult to compare or aggregate, can’t be used trending analysis and targets can’t be set-up for them.

The quantitative metrics of the quality management framework eliminate the disadvantages of qualitative evaluation. The following are metrics that can be included in the QMF for on-line services:

Accuracy represents the percentage of the number of times the on-line service has provided accurate results to users’ requests.

The degree of satisfaction (Pocatilu, 2007: pp.122-125) can be computed as:

where:

DSR – the degree of satisfaction for the requirement i

TR – total number of requirements

p – the number of requirements

The degree of satisfaction for a user of executive requirement is a value from 0 (no satisfaction) to 1 (fully satisfied).

Repeat consumers represent the percentage of users that have used the same on-line service more than one time.

Awareness represents the percentage of targeted users that are aware of the e-service existence and its features.

Cost represents the fee that has to be paid to access the service. It can be expressed as per-use cost or per-membership cost. Per-use cost implies that the user is going to pay a fee every time he is accessing the service, where per-membership cost implies that the user pays a fee once a period, usually in advance, and gets access to the e-service for that period.

In (Pocatilu, 2007: pp.122-125) the cost of resources takes into account the category of resources and the cost per unit for each category:

where:

NRi – number of resource from the category i

pi – price per unit for the resource category i

di – units of usage for the resource category i

The total cost of on-line service can be defined as:

,

where

k – the number of project phases

ci, - the cost of all resources from the phase i

Request of satisfaction based on time represents the time consumed to access the on-line service. Depending on the e-service nature it can be expressed in seconds, minutes, hours, days, month and even years.

where:

T – period of time

Oi – the output i (deliverables, results)

At a national level the following indexes are used for comparative assessment of the states’ ability to deliver on-line services and products to their citizens.

Web measure index is based on a five stage model (emerging, enhanced, interactive, transactional, and connected) and ranks countries based on their position through the various stages.

Telecommunication Infrastructure Index is a composite index of five primary indices, each weighting 20% in the total value of the index:

Internet Users /100 persons

PCs /100 persons

Main Telephones Lines /100 persons

Cellular telephones /100 persons

Broad banding /100 persons

Human Capital Index is a composite index of the adult literacy rate and gross enrollment ratio. Adult literacy rate is weighted 67% and gross enrollment ratio is weighted 33%.

Readiness index is a composite index comprising the web measure index, the telecommunication infrastructure index and the human capital index.

Table E-Government Readiness for Eastern Europe

Country

Indicator 2008

Indicator 2010

Position 2008

Position 2010

Czech Republic

0.6696

0.6060

25

33

Hungary

0.6494

0.6315

30

27

Poland

0.6134

0.5582

33

45

Slovakia

0.5889

0.5639

38

43

Ukraine

0.5728

0.5181

41

54

Bulgaria

0.5719

0.5590

43

44

Romania

0.5383

0.5479

51

47

Belarus

0.5213

0.4900

56

65

Russia

0.5120

0.5136

60

59

Moldavia

0.4510

0.4611

93

80

As we can see in table 1, Romania ranks 6th in Eastern Europe and 47 in the world in the report by the United Nations - e-Government Survey 2010.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now