The Temporal Evolution Of A System

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

1. Introduction

The human brain is the most complex system we know of and it has been a topic of

research interest for a long time. It is a large network of approximately 100 billion cells

that are connected by about an 100.000 billion synapses. As long as researchers tried

to understand how the human brain works they tried to model or rebuild it in artificial

neural networks. Today, artificial neural networks are used in a variety of tasks e.g.

in learning and controlling robot behaviors [Ito04, RS09, HBO+12], Stock market pre-

diction [LYS11, KAYT90] or landside detection [CL04]. All these studies use a specific

type of neural networks called Reccurent neural network3. These networks allow loops

between neurons and can be described as a dynamical systems, which are mathematical

concepts that describe the development of a system over time. While these Recurrent

neural networks are widely used it is not well understood how the internal processes

inside a recurrent network behave. We want to offer a general approach to measure

information storage in Recurrent neural networks and investigate the internal behavior

for one specific type of them.

1.1. Motivation

Echo state networks[Jae01] (ESN) which are the particular type of neural networks we

want to analyze in this thesis, are an instance of input-driven dynamical systems. The

theory behind ESNs is a new approach to train Recurrent neural networks (RNN) by

adding two main features.

1. The hidden-layer (here called Reservoir) is randomly initialized and only the output

weights are trained.

2. The hidden layer should have a fading memory property to keep the network stable.

ESNs and some other recent developments in neural network theory are summarized

under the term of Reservoir computing (RC). ESNs are able to achieve excellent per-

formance by just using a linear readout from the reservoir [JH04]. Previous studies

[NBL05] also discovered that the best task performance occurs around the so-called

Edge of Chaos. Hereby Edge of Chaos describes a state of a dynamical system when it is

in the phase transition between the ordered- and the chaotic phase. However reservoirs

are still randomly initialized, what can lead to networks that have a bad performance.

This problem occurs because in RC we do not train the reservoir itself, so if we once

generate a bad reservoir, it has most likely a not optimal performance. To resolve this

problem it is necessary to find principals to generate reservoirs that are optimally suited

for a certain task.

To find these principals we want to focus on knowing and understanding the reservoir’s

internal computational capabilities such as information storage, transfer and modifica-

tion. The computational capabilities are certainly not the only thing to consider, but

3The terms highlighted by an italic Font will be explained in detail in section 2

1

they are very important. There have been several approaches to measure computational

capabilities inside neural networks, for example a recently published framework by Lizier

[LPZ08a]. He suggests to measure computational capabilities in three atomic measures,

such as information storage, -transfer and -modification. While Liziers’s framework has

been successfully used to analyze different dynamical systems such as cellular automata

or random Boolean networks [LPZ08b], focusing on input-driven dynamical systems we

discovered, these measures do not take input into account explicitly, what may lead to

distorted results.

1.2. Problem Statement

We extract two main problems from the motivation.

1. How can we measure the computational capabilities inside an input-driven dynam-

ical system?

2. How do the computational capabilities (information storage, transfer and modifi-

cation) correlate with the performance inside an ESN and what kind of principles

can we derive to generate better suited reservoirs for a certain task?

1.3. Contributions

At first we introduce a new measure for input-driven dynamical systems, which takes the

input into account explicitly and measures information storage in a certain neuron. We

show at a simple example why we need a new measure before we test it on a more complex

task. Afterwards we compare the results to other information theoretical measures.

For the second part we examine how the task performance correlates with the internal

computational capabilities. It will be shown, that the computational capabilities can

explain the performance in a better way with the new measures, then it was possible in

previous studies with a similar aim [BOL+12]. We also found, that the probabilities for

positive information storage for one neuron at anyone time-step and the probabilities an

information modification event occurs are maximized around the edge of chaos.

1.4. Structure of the Thesis

The rest of the thesis is organized as follows.

The basic methods used in the thesis are described in the next section. We explain the

terms used in the introduction in detail before we describe the basic ideas of Echo State

Networks and explain the most important equations. How to quantify information inside

an input-driven dynamical system is another important tool for this work. We review

the basic ideas of Shannon’s information theory and illustrate the way to Active Infor-

mation Storage, Transfer Entropy and information modification. All three measures are

explained in detail in this section as well.

In the third section we introduce the most relevant related work. At first we introduce

the history of information dynamics in complex systems highlighting the most relevant

2

research results. Afterwards we explain in detail the work by Boedecker et al about

information dynamics in neural networks which is a previous study to this thesis.

Section 4 presents our new approach to measure information storage in input-driven

dynamical systems. The idea behind the new measure takes the input into account

explicitly and enables us to measure information storage in networks where the input

is the driving factor for the dynamics. We compare the new measure with the already

existing ones and show at two simple examples where the problems occur using the ex-

isting measures. In the last part the new measure is related to interaction information

and Partial Information Decomposition.

We use the measures from section 2 and 4 to quantify the internal computational ca-

pabilities inside an Echo State Network. The experiment setup and the approach to

estimate criticality are introduced, before presenting the results for task performance,

information storage, -transfer and -modification. We show, the problems to measure

information storage in input-driven systems occur not only in theoretical examples but

as well in a non-trivial complex system.

In section 5 we discuss the results from the two previous sections and point out some

possible topics for the future work.

3

2. Basic Methods

In this section we want to explain the basic concepts and approaches we used during our

studies.

2.1. Disambiguation

We mentioned some terms before in the introduction, these and a few terms used

throughout this thesis have to be clarified.

Neural networks consist in most cases of an input-layer, a hidden-layer and an output-

Figure 1: Model of a Feed Forward Network (left). Every input gets forwarded through-

out the network. Recurrent neural networks (right) allow directed cycles be-

tween units. The network has an internal state space which allows to exhibit

dynamic temporal behavior [Jae05]

layer. There exist two major types feed forward and recurrent neural networks. Feed

forward neural networks (FNN) forward activation from the input through the hidden-

layer to the output (see figure 1 left). They are an interconnection of perceptrons where

the data flows only in one direction. From a mathematical point of view a FNN can be

described as an input output mapping. The most popular training algorithm for FNNs

is the Backpropagation algorithm developed by David Rumelhart in 1986[RHW88]. The

backpropagation algorithm can be explained in three phases.

• An input-pattern gets propagated through the network and the output is calculated

• The actual output is compared to the desired output and the error is the difference

between them.

• The error is backpropagated from the output to the input adjusting the weights

between the neurons depending on their influence on the error. That should lead

to an approximation to the desired result.

Recurrent neural networks (RNN) are a closer model of biological neural networks.

Similar to a biological neural network, they allow connections between the units to form

direct cycles (see figure 1 right). It allows RNNs to operate on an internal state space

4

as well as the input state space FFNs are operating on. The internal state space is a

trace of what already has been processed by the network. This makes them applicable

to tasks such as the learning of many behaviors or sequence processing - tasks FNNs are

not able to solve. If we describe an RNN mathematically it can be considered as a dy-

namical system - what we explain in detail later on. One of the main issues of "classical"

RNNs is the relatively difficult training. The most used methods are Backpropagation

Through Time (BPTT) [Wer90] and Real-Time Recurrent Learning [WZ89](RTRL).

Both of them are computationally complex and expensive. However RNNs gained pop-

ularity in the last couple of years, what has mainly to do with the new more efficient

training methods. There are several other approaches like Long Short-Term Memory

[HS97] and Evolino [SWGG07], but in this thesis we focus on an approach called Reser-

voir Computing (RC)[SVC07].

Ten years ago a new idea of training Recurrent Neural Networks has been proposed

independently by two researchers.

One is called Echo State Network [Jae01, JH04]

developed by Jaeger, the other one is called Liquid State Machine [MNM02] and was in-

troduced by Maass. Both ideas share the same principal to use large randomly initialized

hidden-layers and just training the output layer instead of every connection inside the

hidden-layer. This makes training and using RNNs a lot easier and less cost-intensive in

terms of computing power. These approaches and the more recently explored Backprop-

agation Decorrelation learning rule [SS05] are subsumed under the name of Reservoir

Computing[Jae07]. Reservoir is the key term behind these approaches. The dynamical

system behaves similar to a water filled reservoir. In case we put a drop of water into

the reservoir the information about the action fades out in waves (see figure 2). Instead

of using water filled reservoirs ESNs use RNNs.

A dynamical system is a mathematical concept, which describes the system develop-

ment over time. In [Mei07] it is defined as follows:

A dynamical system is a state space S , a set of times T and a rule R for

evolution, R : S · T → S that gives the consequent (s) to a state s ∈ S . A

dynamical system can be considered to be a model describing the temporal

evolution of a system.

The system is pictured by a state vector whose coordinates describe the phase/state space

at any instance. The state space can be discrete or continuous. In a discrete dynamical

system the state is changing in equidistant time steps, while a continuous system changes

the state in infinitesimal time steps. For our chase the distinction between autonomous

or input-driven is important as well. An autonomous system (e.g. cellular automaton)

develops its dynamics only depending on its previous states. Input-driven systems (e.g.

Echo State Network) on the other hand have some input data given into the system,

which is one of the driving factors calculating the following state. Every dynamical

system has a parameter that controls the phase the system is currently in. The system

can either be in the stable-, the chaotic-phase or in the phase-transition.

5

Figure 2: The idea behind Reservoir computing presented by a water reservoir. The

information about the drop falling into a reservoir of water fades out over

time.4

6

Output

Reservoir

Input

W

out

W

W

in

trainable weights

fixed weights

Figure 3: Basic model of an Echo State Network. In ESNs usually just the connections

which are represented by the dashed lines are trained. The other connections

are initialized randomly and stay fixed. The reservoir has a fading memory

property.

2.2. Echo State Networks

Working with input-driven dynamical systems, we decided to use Echo State Networks

developed by Jaeger in 2001[Jae01] for our experiments. ESNs are an approach to train

and design RNNs, where the basic ideas are similar to Liquid State Machines which

have been invented simultaneously by Maas [MNM02]. A basic model of an Echo State

Network is shown in figure 3.

An ESN is a recurrent neural network, it consist of

an input u, a reservoir x and an output o. The connections between these nodes are

stored in the matrices Win,W and Woutfor input to reservoir, reservoir to reservoir

and reservoir to output connections. Activations for k input units at time step n are

u(n) = (u1(n),u2(n),...,uk(n)). The activations of internal units are updated according

to :

xn+1= f(Winu(n + 1) + Wx(n) + Wbacky(n))

(1)

Where f is the internal activation function (usually logistic or tanh function) for a

neuron. Wbackstores the weights from output to reservoir in case output-feedback is

used. The systems output is updated as follows:

on+1= fout(Wout(u(n + 1),x(n + 1),o(n)))

(2)

One of the main issues of "‘normal" RNNs is the slow convergence. To solve this,

ESNs introduce two new features:

1. The hidden layer should have a fading memory property to keep the network stable.

2. The reservoir is randomly initialized and only the output weights are trained.

To ensure, the network does not amplify itself due to its random initialization, reservoirs

are designed following the echo state property. In [BOL+12] Boedecker et al define the

7

echo state property as follows. For a given time-discrete recursive function:

xt+1= F(xt,ut+1)

(3)

Where x ∈ Rn, xtthe internal state and ut+1some input given into the system. They

state:

"Assume an infinite stimulus sequence ¯ u∞= u0,u1,..., and two random initial

internal states of the system x0and y0. From both initial states x0and y0

the sequences ¯ x∞= x0,x1,... and ¯ y∞= y0,y1,..." can be derived from the

update equation Eq. (3) for xt+1 and yt+1. If, for all right-infinite input

sequences ¯ u+∞= ut,ut+1,··· taken from some compact set U,for any (x0,y0)

and all real values ∈> 0, there exists a δ(∈) for which kxt− ytk =<∈ for all

t >= δ(∈) (where k ∗ k is the Euclidean norm), the system F(∗) will have

the echo state property relative to the set U."

In simple words, the echo state property is fulfilled if the system converges for different

initial states. In practical usage this is mostly done by dividing all random generated

weights in Wresby the largest eigenvalue of Wres. It fulfills the echo state property

most of the time, but it does not ensure it. By having a network with this property,

every input signal given into the system will trigger an internal-response signal.

An input given into the system gets nonlinear transformed by the reservoir. This infor-

mation can be used through a linear readout by the output-layer. To achieve the desired

output the ESN has to be trained. There are two main approaches to train a network

offline and online training.

Offline training

After generating a new network and ensuring the echo state property is fulfilled. The

network gets initialized with either 0 or random values. In the next step we start to feed

the teacher input into the reservoir. Since the network is randomly initialized we have

to wash out the initial memory by running the network for n0steps. After ensuring the

initial memory is gone we store the internal reservoir states as well as the output states

into the matrices R and OUT. The new trained output-layer is calculated as:

Wout= (Pseudoinvert(R) ∗ OUT)T

(4)

Online Training

There are several algorithms to train ESNs online. Here we describe an online training

algorithm introduced in [Ven07]. The first few steps for online learning are the same as

they are for offline learning. The difference is that after calculating the internal states,

we do not store them anymore but calculate the output right away. The estimated

output ˆ yt+1is compared to the actual output yt+1and the error vector eyis calculated.

ey= ˆ yt+1− yt+1

(5)

8

Afterwards the output weights are updated according to equation 6

Wout(n + 1) = Wout(n) + ηx(n + 1)Tey(n + 1) + γx(n)Tey(n)

(6)

Where η is the learning rate and γ is the momentum gain.

2.3. Quantifying information in dynamical systems

In the early 90th it was suggested that the computational capabilities of dynamical sys-

tems are maximized around the order chaos phase transition [Lan90] [Wol84]. Further

studies discovered for input-driven dynamical systems, such as reservoir computing net-

works, that the possibility for maximal computational performance is the highest around

the edge of chaos [LM07] [BSL10]. Here the order chaos phase transition, also called the

edge of chaos, describes the transition between the ordered phase of a network, where

activities die out quickly, and the chaotic phase of a network where activities amplify

themselves. However it is not completely understood why we find increased performance

around the edge of chaos and how reservoirs should be designed to compute a given task

in an optimal way. To get a better understanding we want to look at the internal com-

putational capabilities of an Echo State Network. A framework to describe information

flows in complex systems can be found in Information theory [SW49]. In the following

we describe basic concepts of Shannon’s information theory, as well as the used measures

information storage, -transfer and -modification that are based on Shannon’s theory.

2.3.1. Review Information theory

One of the basic concepts in Shannon’s theory is entropy. Entropy describes the average

uncertainty over a random variable X in bits and it is described as

HX= −

X

x

p(x)log2p(x)

(7)

The entropy is the average amount of information contained in the variable X.

The joint entropy for two (or more) variables is a measure to quantify uncertainty for

a joint distribution.

HX,Y = −

X

x,y

p(x,y)log2p(x,y)

(8)

The conditional entropy of X and Y is the average uncertainty of X in case Y is known.

HX|Y= −

X

x,y

p(x,y)log2p(x|y)

(9)

= HX,Y− HY

(10)

Mutual information is a way to measure the mutual dependence between two random

variables X and Y. It measures the reduction of uncertainty about Y by learning the

9

value of X or vice versa.

IX,Y = −

X

x,y

p(x,y)log2p(p(x,y)

p(x)p(y))

(11)

Intuitively, the mutual information measures the information that X and Y share. Based

on these basic theories Lizier introduced a framework to quantify information in dynam-

ical systems. In the following we describe the framework in detail.

2.3.2. Active Information Storage(AIS)

xn+1

x(k)

n

X

0

0

1

1

Figure 4: Active Information Storage for one node X. It measures the amount of in-

formation in the past (xk

n)that is required to predict its own future(xn+1).

[Boe12]

In our studies we want to investigate information dynamics on a local level. Information-

storage describes the amount of information in the past that is required to predict its

future. One attempt to measure information storage is Excess entropy [Gra86]. The

Excess entropy is the mutual information between the semi-infinite past and the semi-

infinite future of a system.

EX(k) = IX(k);X(k+)

(12)

Thereby Excess entropy measures the information in the past that is used at some point

in the future. Because the dynamics of computation are computed step by step, we want

to know about the stored information that actually is used in the next computed step.

Therefore we use a measure called Active Information Storage [LPZ12], which is defined

as the mutual information between the semi-infinite past X(k)and the next state Xn+1

10

compared to the probability the events occur independently.

ax(n + 1) = lim

k→∞log2

p(x(k)

n ,xn+1)

p(x(k)

n ) ∗ p(xn+1)

(13)

AX= hax(n)in

(14)

Intuitively, the AIS measures how much information the history X(k)provides, that

helps us predict the next step Xn+1(see also figure 4). AIS has been successfully used

in several neural computation applications such as random Boolean networks [LPZ08c]

or swarm research [WML+11].

2.3.3. Transfer Entropy

Transfer Entropy (TE) is a measure introduced by Schreiber [Sch00] which measures

the transferred information between two variables X and Y . In our case we want to

know how much information provides Y to reduce the uncertainty about X’s next step,

that is not already stored in X’s history (see figure 5). Transfer Entropy is defined as

the mutual information between the next state of the destination xn+1and the previous

state of the source ynconditioned on the destinations history x(k)

n . The history is again

semi finite.

ty→x(n + 1) = lim

k→∞log2p(xn+1|x(k)

n ,yn)

p(xn+1|x(k)

n )

(15)

TY →X= htY →X(n)in

(16)

Information transfer is a phenomenon that occurs in complex systems. For example it

can be found in particles in Cellular automata [HSC01, Mit96, Wue98].

xn+1

yn

x(k)

n

X

Y

0

1

0

0

1

1

Figure 5: Transfer Entropy between two nodes X and Y computes the amount of in-

formation contributed from the source ynto the destination xn+1, that is not

already known from the destinations history x(k)

n . [Boe12]

11

2.3.4. Information modification

Information modification events, such as particle collisions in cellular automata are an

important operation in biological neural networks [Ati92, SMC02] and their artificial

representations. Information modification (separable information) is described as the

interaction between information transfers and stored information by modifying one or

the other. Thereby modification involves a non-trivial processing of information.

A popular example for information modification describes a glider inside a cellular au-

tomaton that is observed. Information modification occurs if the observer is surprised

or misinformed about the next state of the glider. For example we have two gliders, as

long no interaction takes place we can predict their next states by observing them. In

case of a collision our prediction would be wrong, because the next step was modified

due to the collision.

To identify those kind of events, a tool called separable information [LPZ10] is used (see

figure 6). Information modification is defined as follows:

sx(n) = lim

k→∞ax(n + 1) +

X

Y ∈V x\X

ty→x(n + 1)

(17)

SX= hsx(n)in

(18)

We consider Information as separable in case the sum of AIS and all incoming transfers

for one variable is positive. The interesting invents are those where sx(n) is negative,

because we assume miss information occurred that should be caused by an information

modification event. However the measure cannot tell us exactly how much information

is gained by the modification itself, it just indicates that a modification occurred.

xn+1

x(k)

n

X

Y2

1

0

1

0

0

1

1

Y1

1

1

0

y1(n)

y2(n)

Figure 6: Separable information between two nodes tries to find non trivial information

modification events. These events occur in case the sum of Active Information

Storage and all incoming transfers is negative. [Boe12]

12

3. Related Work

Complex systems have been a topic of interest in neural informatics for a long time.

In 1969 Kauffmann proposed a model of genetic regulatory networks called Random

Boolean Networks (RBN)[Kau69], also known as Kauffman networks or NK-Networks

. RNBs consist of N binary-state nodes with K inputs per node. Every node mostly

consists of two states (on/off). The nodes inside an RBN are updated synchronous.

The states at time step t+1 depend on the states of time step t. In several experiments

[Kau93] Kauffmann observed that at K = 2 a phase transition between the ordered and

chaotic phase inside the system occurs. The observations have been proven analytically

later one by Derrida et al [DP86].

In his 1990 paper [Lan90], Langton addresses the question under what conditions physical

systems support the basic operations of information transmission, information storage,

and information modification to support computation. He used Cellular Automata (CA)

as an abstraction of a physical system. CAs consist of a grid of cells, with a number

of finite states (e.g. On/Off). After initializing the cells with a start state at t=0, the

following states are computed according to a fixed rule. Generally the fixed rule is a

mathematical function that determines the state out of the last state of the cell and

all its neighbors. The results suggested that the optimal conditions for the support of

information transmission, storage, and modification, are achieved in the vicinity of a

phase transition.

Another study by Wolfram uses a parameterization of possible CA rules, a qualitative

survey of the different dynamical regimes is presented, along with the observation that

CAs exhibiting the most complex behaviors, in general, near the phase transition between

highly ordered and highly disordered dynamics.

In [Pac88] Packard performed an experiment using genetic algorithms to evolve CAs.

He interprets his results as evidence that the highest computational power can be found

near the critical line. These results have been criticized by Mitchel et al in [MHC93].

They run similar experiment to the one Packard used. However they could not find

results that support Packard’s conclusion. Rather they suggested that:

"It (computation) is measured in terms of generic, structural computa-

tional elements such as memory, information production, information trans-

fer, logical operations, and so on."[MHC93]

Information theory [SW49] and the framework of information dynamics [LPZ08d, LPZ10,

Liz10], which was explained detailed in the section before, then provides the tools to

quantify the elements of computation in complex systems using the basic operations

information transmission, -storage, and -modification that have been mentioned above.

This framework was successfully used in several studies. One example is the 2008 pub-

lished paper by Lizier et al [LPZ08b], that analyzes the phase transition between the

ordered and chaotic phase in Random Boolean Networks. As it can be seen in figure

7 they found maximized Active Information Storage on the ordered side of the phase

transition. Transfer Entropy on the other hand is maximized on the chaotic side of the

phase transition.

13

Figure 7: Maximized Active Information Storage at the ordered side of the phase tran-

sition (green) and Transfer Entropy at the chaotic side of the phase transition

(black). The phase transition is located at¯K = 2.5 [LPZ08b]

Another study by Wang in [WML+12] analyzes information flow inside swarms to

find an evidence for the conjecture that information cascades in swarms occur in waves.

Closely related to the problems we address in section 4 Lizier discovers in [LPZ12]

that Active Information Storage might be misinformative when information transfer

dominates the computation. Another study by Lizier [LPP11] discovered information

storage as the dominant factor in the stable regime of RBNs while information transfer

dominates the chaotic regime.

While the previous mentioned studies analyze non input-driven dynamical systems,

A

B

C

0.01

0.1

1

10

100

−0.6

−0.4

−0.2

0

0.2

0.4

σ2

m

3bit parity (K = 5)

MC (MI)

0

1

2

3

4

5

0.01

0.1

1

10

100

σ2

3bit parity (K = 10)

MC (MI)

0

1

2

3

4

0.01

0.1

1

10

100

σ2

5bit random boolean functions

mean MI

0

0.2

0.4

0.6

0.8

Figure 8: The performance of the trained network in dependence of µ and σ for the

3-bit(A,B) and 5-bit(C) parity task measured as memory capacity(MC). The

black line represents the critical line memory capacity is represented by shades

of grey. Darker parts indicate high MC and vice versa. [NBL05]

Natschlaeger investigated the computational power of Echo State Networks in [NBL05].

They developed a new complexity measure called network mediated separation (NM-

separation) . NM-separation is the amount of state difference in a network, calculated

14

as the differences in the input minus the differences in the initial network states. For

their experiments they run a 3-bit parity task on a network that was trained with linear

regression. The results (see figure 8)show that the highest performance is achieved close

to the critical line. Over all, the results "support the hypothesis that dynamics near

the critical line are expected to be a general property of input-driven dynamical systems

which support complex real-time computations"[NBL05].

3.1. Previous Work

Beside the related work presented before, there was a previous work done by Boedecker

et al The results presented in this thesis are based on their studies. They investigate

in [BOL+12] the information processing in Echo State Networks. Using the information

theoretical framework by Lizier mentioned above, they are able to quantify the com-

putational capabilities between elements of these networks directly. The experimental

set-up consists of an analog ESN trained with two different tasks. As a first task the

Memory capacity task, introduced by Jaeger in [Jae02] is used. It measures how long

a single input given into a network can be stored inside, by using several output units

that are trained to different delays. The results for this task can be found in figure 9.

The second task was the NARMA (nonlinear autoregressive moving average) task. It

calculates the output by combining a window of past inputs in a highly nonlinear way.

Unfortunately counting the probabilities for an analog ESN is quite cost intensive in

ï1

ï0.9

ï0.8

ï0.7

ï0.6

ï0.5

ï0.4

ï0.3

ï0.2

ï0.1

0

0.1

0.2

0.3

0.4

0.5

0

5

10

15

20

25

30

35

40

h

MC

Figure 9: Memory capacity of an ESN plotted against the estimated Lyapunov exponent

[BOL+12]

terms of computing power. To get proper results in a reasonable time they decided to

use a fixed kernel estimation of 0.2 and set the history size to k=2. The criticality was

estimated using the Lyapunov exponent [LS00]. Active Information Storage (AIS) and

Transfer Entropy (TE) (see section 2.3.2 and 2.3.3) are the measured computational

capabilities. The results indicate almost nonexistent information storage and transfer in

the stable regime. Getting to the edge of chaos both measures have a big spike before

dropping to a relatively low level, but not as low as in the stable regime (see figure

10). AIS and TE match the networks performance (figure 9) almost perfectly around

the edge of chaos. However neither AIS nor TE can explain why the performance is

non-zero in the stable regime. Here both indicate an almost nonexistent storage and

15

ï1

ï0.9

ï0.8

ï0.7

ï0.6

ï0.5

ï0.4

ï0.3

ï0.2

ï0.1

0

0.1

0.2

0.3

0.4

0.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

h

Active Information Storage [bits]

ï1

ï0.9

ï0.8

ï0.7

ï0.6

ï0.5

ï0.4

ï0.3

ï0.2

ï0.1

0

0.1

0.2

0.3

0.4

0.5

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

h

Transfer Entropy [bits]

Figure 10: Active Information Storage (left) and Transfer Entropy (right) vs. the esti-

mated Lyapunov exponent. [BOL+12]

transfer. Another problem the study faces is the short history size because of the trade

of between computation resources and history size. We address that problem in our

study by using a binary ESN, which enables us to use longer history-sizes. Furthermore

we use separable information as well as our new measure for information storage to

explain the performance inside the stable regime in a better way.

16

4. Input-Corrected Active Information Storage5

The previously shown measure of AIS has proven to be useful in the analysis of complex

systems in general, such as cellular automata [LPZ12] or Random Boolean Networks

[LPZ08c]. Using input-driven dynamical systems, the input has a high influence on the

next state, especially in the stable regime. By using AIS we do not consider the input as

external to the system it is rather seen as a part of it. Looking at two simple examples

(see figure 11) we want to show how AIS quantifies information storage in an input-

driven system. In the first case we use a simple feed forward unit, passing on the input.

The second case the unit computes the output between the stored last output and the

new input by using an XOR-gate. Intuitively, we expect the information storage to be

0 for the forwarding unit and 1 for the XOR-unit. In fact AIS measures information

storage in the first case. As we will show, the information storage depends highly on the

structure of the input. In the second case we will find information storage of 0, even it

is expected to be 1 because the unit needs to store the previous state to compute the

output.

For both examples we are going to use two different kinds of input u1and u2, where for

0 1 0 1 1 0 1 1 0 0 0 0 1

Input into system

0 1 0 1 1 0 1 1 0 0 0 0 1

Output ( = Input)

a)

XOR

Figure 11: a) shows a simple feed forward unit that passes on the input. b) implements

an XOR-neuron that stores the last state to compute the output.

u1we randomly draw 0 or 1 from a Bernoulli distribution with a probability of p = 0.5.

In case of u2we use a Markov condition to generate the input, where the previous value

is repeated with a probability of p = 0.7 and with p = 0.3 the value switches from 0 to

1 and vice versa. In both examples we assume the history size k to be one, the input

to be binary and the time series to be infinite. For the time series u1and u2we have

a probability of p(xn= 0) = p(xn= 1) = 0.5. The joint probability for two subsequent

values is p(xn+1,xk

n) = 0.25 in case of u1, while in case of u2 we get a probability of

5This section is mainly based on the paper Computing local Active Information Storage in input-driven

systems [BOSA13]

17

p(xn+1,xk

n) = 0,7 if xn+1 = xk

nor p(xn+1,xk

n) = 0,3 if xn+1 6= xk

n. Given the joint

probabilities and the history size of one the Active Information Storage is computed

as A(X,1) = log

0.25

0.5·0.5= 0 for u1 and A(X,1) = log

0.3

0.5·0.5+ log

0.7

0.5·0.5≈ 0.1 for u2.

For random uniform distributed input (u1) we get an AIS of 0 as expected, but for the

structured input we find an AIS of ≈ 0.1, even though we know the unit has no storage.

In the second example we want to analyze a neuron, calculating its output as an XOR

between the input and its own last state (see figure 11 b). The probabilities for p(xn+1)

and p(x(k)

n ) will be 0.5 due to the random uniform input. For p(xn+1,x(k)

n ) we have

4 possibilities that are all equally possible, so the probability is p = 0.25. Against our

expectations we get an AIS of A(X,1) = log

0.25

0.5·0.5= 0 for a history size of one. Assuming

a history size of 1 we get a probability of 0.5 for p(xn+1) and p(x(k)

n ) for u2. This leads

to an AIS of 0 because p(xn+1,x(k)

n ) = 0.25. It has to be stated that this just applies

for a history size of 1, because in that case AIS is not able to find the structure that

applies through the input. With a longer history-size AIS should be able to measure

some information storage by identifying the input structure, but it will not be necessarily

1 as we assume for the XOR unit.

If we want to estimate the information storage correctly we have to rethink our approach

to measure storage for input-driven systems. We propose to "condition out" the input

given into the system and call the new measure local Input Corrected Active Information

Storage (icAIS). It measures the difference between the joint distribution of Xn+1and

Yk

nand the case where Xn+1and Yk

nare generated by the input but they are completely

independent from each other. The icAIS at a time step n + 1 for a variable x and with

an input u is defined as :

aU

X(n + 1) = lim

k→∞aU

X(n + 1,k)

(19)

aU

X(n + 1,k) = log

p(x(k)

n ,xn+1|un+1)

p(x(k)

n |un+1)p(xn+1|un+1)

(20)

= logp(xn+1|x(k)

n ,un+1)

p(xn+1|un+1)

(21)

The definition can be generalized to a process Xiin a system X:

aU

X(i,n + 1) = lim

k→∞aU

X(i,n + 1,k)

(22)

aU

X(i,n + 1,k) = aU

Xi(n + 1,k)

(23)

= logp(xi,n+1|x(k)

i,n,un+1)

p(xi,n+1|un+1)

.

(24)

We then have the input-corrected Active Information Storage AU

X(i,k) = haU

X(i,n,k)in.

18

For homogenous processes we can again average over these, resulting in:

AU

X(k) = haU

X(i,n,k)ii,n.

(25)

Applying icAIS to our two previous examples we again assume the history size k to be

one, the input to be binary and the time series to be infinite. In case of the first example

conditioning on the input leads to a probability of 1 in case of p(xn+1|x(k)

n ,un+1) and

p(xn+1|un+1). Thus we get an icAIS of zero, which applies for random input as well as

for structured input. The icAIS for a forwarding unit will be zero for all kinds of inputs

as we expect. The second case of an XOR-neuron conditioning on the input un+1and

the history x(k)

n

leads to a probability of 1 for p(xn+1|x(k)

n ,un+1) while p(xn+1|un+1) is

0.5 because of the missing information about the history x(k)

n . With these probabilities,

the AU

X(1) = log

1

0.5= 1 for our second example, again independent of the input and as

we would expect.

We explained AIS as well as icAIS in detail, in the following we explain the relation

between AIS and icAIS and an already exiting measure for shared information, called

interaction information.

4.1. Interaction Information and icAIS

Figure 12: Venn diagram that visualizes the interaction information between three vari-

ables. The shaded part represents the three-way redundancy and the dark

part the interaction information. [Bel03]

Interaction information [McG54] or Co-Information [Bel03] is a generalization of mu-

tual information developed by McGill respectively Bell. It describes the information

shared by k random variables, which can be either positive or negative. The part of

interest for us is the information shared among all three variables I(X,Y,Z). Here we

19

want to demonstrate how this idea is related to icAIS. Interaction information for three

variables is defined as follows:

I(X,Y,Z) =I(X,Y |Z)

I(X,Y )

(26)

=I(X,Z|Y )

I(X,Z)

=I(Y,Z|X)

I(Y,Z)

where I(X,Y |Z) and I(X,Y ) are defined as

I(X,Y |Z) = log2

p(X,Y |Z)

p(X|Z)p(Y |Z)

(27)

I(X,Y ) = log2

p(X,Y )

p(X)p(Y )

(28)

As mentioned before interaction information can either be positive or negative for

k >= 3, what can be interpreted as synergy and redundancy [Ley09]. If two sources

contribute the same information to a destination redundancy occurs, this overlap is

represented by negative interaction information. In the opposite case of synergy and

positive interaction information, two variables U and V contribute information that

does not overlap. Regarding to the previous examples we find redundancy in the for-

warding neuron and synergy in the XOR-neuron. (see figure 12)

With icAIS we want to take redundancy and synergy explicitly into account. We can

say we want to add the interaction that occurs between input and history to the AIS.

We already see that I(X,Y ) equates to AIS, while equation 29 shows that I(X,Y |Z)

equates icAIS.

I(X,Y |Z) =

p(X,Y |Z)

p(X|Z)p(Y |Z)substitueX = xn+1,Y = x(k)

n,Z = un+1

=

p(xn+1,x(k)

n |un+1)

p(xn+1|un+1) ∗ p(x(k)

n |un+1)

= logp(xn+1,x(k)

n ,un+1)

p(un+1)

− logp(xn+1,un+1) ∗ p(x(k)

n ,un+1)

p(un+1) ∗ p(un+1)

= logp(xn+1,x(k)

n ,un+1) ∗ p(un+1

p(xn+1,un+1) ∗ p(x(k)

n ,un+1)

= logp(xn+1|un+1,x(k)

n )

p(xn+1|un+1)

(29)

Equation 29 proves that interaction information can be written as I =icAIS

AIS, what can

be transformed to icAIS = AIS + I matching the assumption we made before.

20

Beside interaction information there is another more recent framework to analyze shared

information between variables. The framework called Partial Information Decomposition

and how it relates to AIS and icAIS should be discussed in the following.

4.2. Partial Information Decomposition

Partial Information Decomposition (PID) is a recently developed method by Williams

and Beer [WB10] to decompose information from several sources about a destination into

information-theoretically atomic concepts of redundant, unique and synergistic informa-

tion. In contrast to interaction information, it does not measure negative information.

The basic concept of PID is to decompose the information a vector R = R1,R2,...,Rn

provides about a variable S. If we consider the most basic example with three variables,

PID measures how much information does R = R1,R2provide about S and how do R1,

R2contribute to the total information. The information R = R1,R2provides about S

can be measured as I(S;R1,R2). The contribution to the total information is identified

in three distinct possibilities. At first unique information describes the case where R1

provides information that R2does not. If R1and R2provide the same information it

is called redundant information. The third possibility is synergy here we have informa-

tion that is only available as a combination of R1and R2. The previously mentioned

forward- and XOR-neuron are examples for redundant (forward) and synergistic (XOR)

information. Figure 13 shows the structure of PID for three variables. If we relate that

unique

unique

redundant

synergy

Figure 13: Structure of Partial Information Decomposition for three Variables

to the ideas of icAIS and AIS, we get the representation of AIS as seen in figure 14,

where Xk

nis the history of the neuron and Yn+1is the input. In [LFW13] Lizier states

that if we consider xk

nas the past state and y as another casual source AIS measures

the white part. In our case the other casual source is the input u and AIS includes the

history’s unique information plus the redundant information between input and history.

The problems faced by the redundant part have been discussed earlier. It can be argued

that infinite history sizes solve the problem for some cases, but on the one hand this is

just true for input that has no structure and it is hardly possible to use infinite history-

21

sizes in practice, because of computational limitations.

{Xn

(k) Yn}

{Xn

(k)}

{Xn

(k)} {Yn}

{Yn}

Figure 14: AIS represented by PID with xk

nas history and yn+1as input.

Knowing AIS may give us distorted results, there are three possible variants to measure

information storage.

Variant 1 input corrected Information Storage

If we want to measure the history’s effect on the current step it may just mean to

measure the unique information the history provides (see figure 15). The redundant

part can offer some information that is contained in the history, but it may be irrelevant

for increasing k and longer time series. However this leads to the same problem we had

for AIS, that infinite history size is not possible for practical applications. Beside that

another practical problem is the lack an exact measure for unique information. There

is an approach to measure redundant information by Williams and Beer called Imin

[WB10], but it suffers some problems that have been discussed in [LFW13].

Variant 2 input corrected Information Storage

The approach we use to measure information storage is represented by figure 16. By using

conditional mutual information we exclude the redundant information but also take the

synergistic one into account [LFW13]. It can be argued, that the redundant information

{xk

n}{Yn} should not be removed, because we remove the information completely not

just the redundant part. Also it might be criticized that the synergy is not a part of the

system, because the interaction that occurs can be seen as external and not a part of

the system itself. We claim the synergistic effects appear due to the processing of the

system and this is an essential part of a neural network, which should not be handled as

external. For redundant information we know if we keep it as part of information storage

it may distort the results. On the other hand we cannot be sure not to miss information

by excluding it. Until now there is no trivial way to estimate the information storage

and only use the relevant parts out of the redundant information. Both arguments have

22

{Xn

(k) Yn}

{Xn

(k)}

{Xn

(k)} {Yn}

{Yn}

Figure 15: icAIS variant 1 represented by PID. Identifying the unique information (white

part) contained in the system for an input Y.

valid pros and cons and it might be a topic for further investigations if our approach to

measure information storage is correct.

{Xn

(k) Yn}

{Xn

(k)}

{Xn

(k)} {Yn}

{Yn}

Figure 16: icAIS variant 2 represented by PID with xk

nas history and yn+1as input..

Identifying the unique information (white part) plus the synergistic informa-

tion contained in the system.

Variant 3 input corrected Information Storage

As we pointed out in Variant two already, we consider synergistic information as an

essential part of the system. Also we cannot be sure that we will not miss information

in some cases if we exclude the redundant part. The third variant describes information

storage as all information coming to the system except the one that is uniquely from the

input (see figure 17). However it would still suffer the same problems with redundant

information as we have shown earlier on for AIS.

23

{Xn

(k) Yn}

{Xn

(k)}

{Xn

(k)} {Yn}

{Yn}

Figure 17: icAIS variant 3 represented by PID. Identifying all information (white part)

except the unique information from the input contained in the system for an

input Y.

5. Experiments and Results

In chapter 4 we discussed the necessity of a new approach to measure information storage

in dynamical systems. However the examples presented in chapter 4 have been demon-

strative but they have been trivial too. In this section we want to test the new measure

on a "more realistic" example, by measuring information storage inside an Echo State

Network (see section 2.2) at the order chaos phase-transition.

5.1. Experiment Settings

A basic binary Echo State Network was implemented in python (see figure 18) using the

scipy6and numpy7packages. The state update equations are:

xn+1= f(Winu(n + 1) + Wx(n))

(30)

yn= f(Woutx(n))

(31)

Where u is the input x the activation of the neuron and y the output. Output-feedback

is not used. The internal activation function f and the output-function are represented

by a step function with f(x) = 1 for x ≥ 0.5 and f(x) = 0 for x < 0.5. Win, W and

Woutare the weight-matrices, where Winis initialized with ones , while the weights

for W and Woutare randomly drawn from a Gaussian distribution. The system has

one input, 250 reservoir units, and 9 outputs. Every output computes a 5-bit parity

task with a certain delay Ï„, e.g. the first output checks parity for the last 5 inputs, the

second output checks parity between the six- and the second-last input and so on. The

6www.scipy.org

7numpy.scipy.org

24

.

.

.

training

250 Nodes

… -1 1 -1 1 -1

Figure 18: Settings of the ESN used in the 5-bit parity task experimets.

output-function is defined as follows:

yτ

t+1= PARITY (ut−τ,ut−τ−1,ut−τ−2,ut−τ−3,ut−τ−4)

(32)

The network is sparsely connected by setting the in-degree K of every reservoir-unit

to 20. Weights are trained offline, pseudo inverting the matrices. As input we use the

previously mentioned inputs u1 and u2 where u1is drawn from a Bernoulli distribution.

For u2, we also draw binary random values, but impose a Markov condition where with

a probability of 0.7 the last value is repeated, and with a probability of 0.3, the value

is changed from 0 to 1 or vice versa. We set the history size to k = 7 which was

recommended by Kashima [Kas10] in a previous work on this subject. He found that

a network with parameters given above has hardly any information storage after seven

steps.

5.2. Measuring criticality

In our experiments we want to compare the internal computational capabilities to a

systems performance. Therefore it is essential to know if a system is in the ordered or

chaotic phase. To estimate the state of an ESN, one technique is to look at the sensitivity

to perturbations given into the system. The idea behind this is if a small perturbation

given into a system dies out in a certain amount of time, the system should be in the

ordered phase; in the other case the perturbation amplifies and the network should

be in the chaotic phase (see Figure 19). We mentioned several conjectures about the

maximized computational capabilities in the related work section. These conjectures

have been analytically proven by Derrida and others [DW86, DP86]. Based on their

theories Bertschinger and Natschlaeger developed a method to find the critical line in

discrete dynamical systems [BN04] such as the binary ESN we use for our experiments.

The approach is very closely related to the Lyapunov exponent [LS00] which is used

to find criticality in continuous systems or autonomous dynamical systems. Two initial

states are given into a system, mapped to their corresponding states, and then the time

25

Figure 19: Dynamics of a binary network for K = 4, N= 250 and variable ¯ u and σ2. The

plots show a stable, critical and chaotic network from left to right. [BN04]

evolution of the hamming distance between the two states is observed. The equation for

the time evolution of the hamming distance is:

dfade(t + 1) = FADE(d(t))

= r ∗ d(t + 1;u + 1) + (1 − r) ∗ d(t + 1;u + 1)

(33)

Where u = u−1 and u = u+1 with the probabilities r and 1−r. Increasing (decreasing)

hamming distance is a sign for a chaotic (stable) system. The developed theory provides

a measure α, that analyzes the slope of the Derida plots. Hereby the Derida plots show

dfade(t + 1) = FADE(d(t)) plotted versus d(t). Finally the criticality measure α only

depends on static values and is defined as follows:

α =∂dfade(t + 1)

∂d(t)

|d(t)=0= K ∗ (r ∗ PBF(1, ¯ u + 1) + (1 − r) ∗ PBF(1, ¯ u − 1)

(34)

PBF(1,u) =

∞

Z

−∞

{φ(b,0,σ2)

max{b−u,−b−u}

Z

min{b−u,−b−u}

φ(a,0(K − 1)σ2)da}db

(35)

φ(x,µ,σ2) =

1

√2πσ2e−1(x−µ)2

2σ2

(36)

Where K is the in-degree of the units, PBFis the probability that exactly c bit-flips out

of the K inputs to an unit will cause a different output.

26

Figure 20: The Derrida plots show the the dependence of dfade(t+1) = FADE(d(t)) on

d(t). The plots are generated with ¯ u = 0,σ = 1,r = 0.5 and three different

values of k. The numerical fixpoint dfade(t + 1) = FADE(d(t)) is given for

each plot. [BN04]

PBF can be calculated as an integral between a and b, where a and b are randomly

drawn weights which are normally distributed with a mean 0 and a variance σ2. We

can compute α in dependence of K, σ, r and µ. (For a more detailed explanation how

to measure criticality see [BN04]) The critical line where the phase transition occurs in

all our further experiments is at α = 1. At that point the hamming distance does not

change anymore and the disruption that still exists in the network stops to amplify by

itself.

5.3. Estimation of required test steps

In terms of computational costs, measuring computational capabilities of a dynamical

system is very expensive. To measure the information dynamics we count how often

a given pattern occurs in a time series, but to get meaningful results it is necessary

that every possible pattern occurs at least a few times. Keeping the computational

costs low and still get reliable results a tradeoff between the length of the time series

27

and the number of times a pattern has to occur is required. Boedecker et al [Boe11]

used a kernel-estimation approach to resolve the problem. They suggested the following

equation to find the optimal number of test steps:

?l

2r

?k+2

=N

n

(37)

On the left we have the maximum number of hyperspheres that can occur in the state

space compared to the required number of test steps (N) divided by the minimum number

a pattern has to occur on the right (n). Where l is the range between the smallest and

the largest possible activation given into the network, r is the kernel radius, k the history

size. To still get meaningful results it was recommended in [LS06] to set n=3. In section

5.1 we defined our network to be binary with an interval [−1,1] what leads to l=2 and

because we use binary units the radius will be 0,5 to cover the full range l by using

two kernels. Transfer Entropy requires considering the highest number of dimensions in

state space, with one dimension for the current state of the destination and one for the

previous state of the source plus k dimension for the history of the destination. In case

of the used history size 7 we get 9 dimensions. Transforming the equation we get

N =

?

2

2 ∗ 0.5

?7+2

∗ 3 = 1536

(38)

We decided to use a time series with a length of 1600 test steps to get reliable results.

5.4. Task-performance

random uniform input

semi structured input

α

α

[bits]

[bits]

Figure 21: The performance in bits for every α measured using equation 39. Once ran-

dom uniform distributed input was used (left) and once semi-structured input

generated by a Markov condition (right).

In order to compare the results of the internal computational capabilities to the actual

28

system performance we use a performance measure similar to the measure Jaeger used

for his memory capacity task in [Jae02]. We draw weights from a Gaussian distribution

with a variance σ2, to get results for different dynamical regimes. As shown in the section

5.2 the criticality depends on σ. We run the experiments for −1.1 <= σ <= −0.3 and

increase σ by 0.002 in every run. That leads to a total amount of 400 runs in different

regimes between α = 0 and α ≈ 2.5. The performance is calculated as the sum of the

mutual information between the calculated output and the desired output over all Ï„ and

is measured in bits (see equation 39).

∞

X

Ï„=0

MI(otarget

Ï„

,oτ)

(39)

The plot in figure 21 shows the performance as one point for every run against alpha.

For a better visualization we plot the result of three different runs for every alpha. This

will be the case for all the following plots as well. As one can easily see the results shown

match the conjectures that the performance is maximized between the ordered and the

chaotic phase at α ≈ 1. Beside that we found a very low performance for 0 < α < 0.5

and a rapid increase if we get closer to the edge of chaos at 0.5 < α < 1. After stepping

into the chaotic regime the performance starts to drop very quickly. These observations

are true for both kinds of inputs u1and u2.

In the following we are going to present the results for information storage, Transfer

Entropy and local separable information and show how the performance correlates with

these internal computational capabilities.

5.5. AIS and icAIS

In section 4 we have shown at two simple examples, that AIS can produce unexpected

results in case of input-driven systems caused by the influence of the input. The results of

the previously described experiments (see section 5.1) we are going to present, support

the view that the AIS is highly influenced by the input. AIS and icAIS have been

measured in bits as well as the probabilities where the storage is positive on a local level.

Hereby "local level" describes the AIS/icAIS for one neuron at anyone time step.

Figure 23 shows the results for the average AIS over all neurons and all time steps.

Every dot represents the average AIS of the network for one alpha.

We easily see in figure 23 the assumptions made in section 4, that the input can affect

the AIS in a direct way, are valid for real systems as well. While the AIS for u1is slowly

increasing in the stable regime and reaching its maximum around α = 1.7, in case of

u2 the maximum is reached right at the beginning and starts decreasing from there.

Interesting to note is the overall higher AIS in case of semi-structured input.

By analyzing the difference between AIS for random input and AIS for semi-structured

input (see figure 22) the difference is highest in the beginning and starts to decrease with

increasing alpha. Far into the chaotic regime, around α = 2.5 the difference starts to

get insignificant. We assume that this phenomenon occurs because in the chaotic regime

the system is mostly driven by its own dynamics and barely by the input.

29

Figure 22: The difference between the average AIS values for random uniform input and

semi-structured input measured in bits for every alpha.

Apart from measuring the average AIS, we also investigated the chance to get a positive

local AIS for one neuron at anyone time step. The results shown in figure 24 are similar

to the results we got for the average AIS. Again for random uniform input we get a very

slow increase reaching the maximum probability at α ≈ 1.7, while for semi-structured

input the maximum is reached at the beginning and the probability keeps decreasing.

The distance between the probabilities against an increasing alpha is decreasing in a

similar way it does for the average information storage. For u1 and u2 the differ



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now