Advanced Research Centralized Database Server

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

With XML becoming more and more adopted in the avionic world, new standards are emerging from ARINC working groups that define XML schemas to leverage the extensible markup language as vehicle for structured data exchange. While ARINC 424A aims to redefine the wellknown ARINC 424 standard to facilitate XML instead of plain text files, other working groups are defining new schemas to allow exchange of additional data such as terrain data, cultural data, obstacle, and other data that is used in modern aviation products.

However, with all the obvious advantages of XML also come some significant drawbacks; the format is very verbose and adds significant structural overhead to the transported payload. In addition the flexible structure requires more complex parsing of the data and prohibits direct access to the contained data entities.

Although the technical capabilities of modern avionic systems have been increased the safety related design constraints of avionic architectures still impose serious limitation on these systems.

The verbosity of XML data and the increased computational effort required to process this data are the driving factors for this research effort since the bandwidth of the communication channels used to transport this data is limited and will stay limited in the foreseeable future. Inflight data updates over the air are slow and expensive; the networked communication defined by AFDX also provides only a limited bandwidth to the attached systems. This project focused on ways to overcome the massive increase of data sizes through compression.

Scope

This project focused on the efficient compression of the transported XML data. While simple extraction of a particular data schema into raw records could be (although quite limited) one potential option, the project focused on a more versatile approach that would allow compressing any kind of XML encoded data.

The objective of this effort was to define the cornerstones for a new common standard that would allow further improvements and customizations through Jeppesen to provide better compression rates and other advantages to its customers.

The standard shall allow customizations to allow Jeppesen to provide advantages to their customers that cannot be utilized without proprietary knowledge of the applied extensions that are a trade secret or even in some way protected knowledge of Jeppesen.

Approach

During the project several compression strategies were evaluated and compared. These results were taken into account for the development of a custom compressor that should achieve the best compression ratio compared with all the other strategies evaluated before. This prototypical implementation will be referred to as NavZip compressor.

The lessons learned from this implementation activity were then taken into account to define a strategy for Jeppesen regarding standardisation and differentiation from its competitors.

Jeppesen / Boeing / Industry activities related to EXI

(as of April6 2012)

Industry

W3C is leading the EXI specification. It was published in March 2011 on http://www.w3.org/TR/exi/ .

Several implementations of EXI are available on the market, most of them from members of the W3C group:

AgileDelta: http://www.agiledelta.com/: commercial implementation; offers SAX/DOM/JAXP access APIs to EXI encoded files; free trial version available; company is headed by the chair of the EXI group

Exificient: http://exificient.sourceforge.net/: Java, supported by Siemens; GPL license; used for the testing and AR benchmarking below

GUI: http://www.movesinstitute.org/exi/ExiGui.tar.gz

OpenEXI: http://openexi.sourceforge.net/ , http://sourceforge.net/projects/openexi/: Java, Apache License 2.0; some good tutorials to EXI; does not support the self-contained option (required for random-access to EXI files); milinda.zip includes a SAX-EXI-Reader

Papers:

Sheldon Snyder: Efficient XML Interchange (EXI) Compression and Performance Benefits - Development, Implementation and Evaluation, 2010, www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA518679

Boeing

Two persons from Boeing are/were involved:

Terry Lammers (ATF, Boeing Defense Solutions, [email protected]): Boeing representative in the W3C XML Binary Characterization Working Group which was the predecessor of the EXI Standard; considers AgileDelta as the technology leader

Ann Bassetti (ATF, Boeing Information Technology, Global Collaboration Technologies, [email protected]): Liason person to W3C

Also they are not directly involved in the current EXI specification, they offered to build contact to the W3C and/or to the chair of the group, John Schneider (CEO of AgileDelta)

Advanced Research

AR did this initial study of XML encoding and compression methods having upcoming standards like NDBX (ARINC 424A) or Terrain&Obstacles in mind. Focus is on optimum compression of the files being uploaded to an aircraft, with 2 profiles: using XML as database file, thus requiring random access, and using it as input for a centralized database server, thus allowing streaming decompression of gzip.

In addition, options to further optimize regular EXI for a Jeppesen proprietary version are investigated.

Contact:

Christian Pschierer <[email protected]>

Marine

More and more XML data such as AIS-ASM or S-100 is used in the marine world. The Marine division is starting investigations how to optimize these data streams.

Contact:

Vadim Gaiduk-RUS <[email protected]>

Raphael Malyankar-EXT <[email protected]>

David D'Aquino <[email protected] >

Eivind Mong <[email protected]>

Geir Olsen-NOR <[email protected]>

John Klippen-NOR <[email protected]>

Shrinking XML Data

Data compression in general describes algorithms to reduce redundancy in a stream of information. While compression can be either symmetric or asymmetric, wheras symmetric compression allows the complete reconstruction of the original data and asymmetric algorithms usually eliminate parts of the inbound data that are considered irrelevant. Asymmetric compression will never allow fully reconstructing the original data. These compression types are also known as loss-less / lossy compression, with JPEG and MP3 being amoung the most prominent types of lossy compression.

The particular characteristics of the XML input format lead to a set of requirements for a compression algorithm to develop:

The algorithm shall not alter the payload data of the compressed document.

The algorithm may impose requirements on the ingested data that may improve its compression ratio or even enable its operation.

These requirements lead to a symmetric compression algorithm whereas the elimination of portions in the ingested data will be put into the requirements imposed on the accepted input data. This way the algorithm will allow digest checksums for the data to validate the integrity of the received data. The necessary amendments of the input data to comply with the format requirements of the compressor will need to be addressed during data production and the associated DO-200A quality assurance process.

Data Characteristics

In order to determine a suitable method to compress the kind of data

Since XML documents and especially data deliveries created from relational database systems usually tend to be mainly XMLified record dumps with large structural parts of a XML document being repeated. With the limited matching logic of pattern match based dictionaries, these structural similarities cannot be identified in a way that they could be fully eliminated.

<NDBXRecord xsi:type="ndbx:NavaidSystemType" codeICAO="FA" area="AFR" identifier="1000000000">

<NavaidSystemTimeslice name="ALEXANDER BAY" type="VOR/DME" designator="ABV">

<uses_NavaidComponent xsi:type="ndbx:DMEType" stationDeclination="-18.0" figureOfMerit="2">

<hasLocation_ElevatedPoint elevation="98" position="28.570694 16.533889"/>

</uses_NavaidComponent>

<uses_NavaidComponent xsi:type="ndbx:VORType" figureOfMerit="2" additionalInfo="W" usage="H" frequency="112.1">

<hasLocation_ElevatedPoint position="28.570694 16.533889"/>

</uses_NavaidComponent>

</NavaidSystemTimeslice>

</NDBXRecord>

The above XML fragment depicts the verbosity of XML data; the required markup code significantly increases the size of the data. The actual payload that is printed bold in the above example can be found below:

ndbx:NavaidSystemType FA AFR 1000000000

ALEXANDER BAY VOR/DME ABV

ndbx:DMEType -18.0 2

98 28.570694 16.533889

ndbx:VORType 2 W H 112.1

28.570694 16.533889

The original message consisted of 663 characters wheras the new message including whitespaces is only 155 characters long. Without re-encoding single values into specialized encodings (like turning literal values like ‘true’ and ‘false’ into bit values) this was already a reduction by more than 75% of the original size.

Evaluation of existing compression strategies

In order to judge any new compression method, some of the common compression algorithms and also specialized compression methods have been evaluated to define a benchmark.

While a huge variety of different compression tools is available for different purposes, we focused on dictionary based compression tools and specialized XML compression tools. With the advent of digital audio and video broadcasting, many technologies for the compression of these particular data streams have been invented with impressive compression ratios. However, almost all of these technologies rely on the reduction of quality and the removal of irrelevant meaning unhearable or not seeable information.

Dictionary compression

The basic function of a dictionary based compression is to re-encode tokens identified in the compressed data stream as identifiers that can be decoded by looking up their original value in a dictionary. Since the dictionary in a data stream with maximum entropy would be the original data stream, compressors will dynamically switch between dictionary lookups and direct encoding of the processed data.

The most used algorithm defined by Lempel-Ziv is using a sliding window that contains all processed characters and searches for repetitive values by searching that buffer and encodes any matches by giving the offset and length in the window buffer. It is clear that the size of the dictionary is important for the effectivity of the compressor. On the other hand will an increased size require longer offset encodings and would also consume more memory since the dictionary would need to be kept in memory for encoding and decoding. The following example taken from Wikipedia shows a simple compression of the word ANANAS with a window size of eight characters:

0

1

2

3

4

5

6

7

Preview

Backlog

Message

A

N

A

N

A

S

0

0

A

N

A

N

A

S

0

0

A

N

A

N

A

S

6

2

A

N

A

N

A

S

0

0

A

N

A

N

A

S

In addition to the encoding of offsets the probability of particular patterns could be used to create shorter encodings by applying Huffman encoding.

ZIP

Many are familiar with various zip formats. All of these are partially based on dictionary-type compression algorithms and have their roots in the Lempel-Ziv (LZ77) algorithm and entropy encoding using Huffman trees

BZip2

The BZIP2 compressor uses the lossless Burrows-Wheeler transform (BWT). The BWT groups identical symbols together via sorted rotations. For example, the text ".BANANA_" becomes "BNN.AA_A". The BWT exposes the true entropy of the original file at lower Markov orders. The ‘_’ signifies an EOF symbol. Given that the BWT was designed for textual data, BZIP2 can be expected to perform well on XML data. The BWT is effective on data other than plaintext.

7Zip

This toolkit is based on the LZMA algorithm which is an improved version of the Lempel-Ziv algorithm combined with arithmetic encoding, Huffman encoding and other entropy encoders.

GZIP

The GZIP compressor is a variation of the LZ77 algorithm. While symbol combination match lengths are stored in a Huffman tree, a separate Huffman tree stores match distances. The compression setting used at runtime controls the examined match length. A higher compression setting looks for longer matches, but takes longer to execute. The maximum match length in GZIP is 258 bytes and the maximum match distance is 32 KB

XML Compression Tools

OGC Binary XML (BXML)

OGC Binary XML (BXML) is a binary representation of XML. It was developed by the Open Geospatial Consortium (OGC) in 2006 as part of GC Web Services 2 Initiative (OWS) as a standard to allow XML documents to be transmitted in a compact manner over mobile networks.

The OGC Web Services Initiative is part of the OGC’s Interoperability Program: a global, collaborative, hands-on engineering and testing program designed to deliver prototype technologies and proven candidate specifications into the OGC’s Specification Development Program. In OGC Interoperability Initiatives, international teams of technology providers work together to solve specific geo-processing interoperability problems posed by Initiative sponsors..

Efficient XML (EXI)

EXI is a knowledge based encoding that uses a set of grammars to determine which events are most likely to occur at any given point in an EXI stream and encodes the most likely alternatives in fewer bits. It does this by mapping the stream of events to lower entropy set of representative values and encoding those values using a set of simple variable length codes or an EXI compression algorithm.

The result is a very simple, small algorithm that uniformly handles schema-less encoding, schema-informed encoding, schema deviations, and any combination thereof in EXI streams. These variations do not require different algorithms or different parsers; they are simply informed by different combinations of grammars.

Since EXI will become more important at the end of this document, this section will describe the general function of EXI in more detail. Consider the following example XML document taken from the EXI primer document [1] :

<?xml version="1.0" encoding="UTF-8"?>

<notebook date="2007-09-12">

<note category="EXI" date="2007-07-23">

<subject>EXI</subject>

<body>Do not forget it!</body>

</note>

<note date="2007-09-12">

<subject>Shopping List</subject>

<body>milk, honey</body>

</note>

</notebook>

The EXI compressor will separate the payload in the document from its structure and will encode each separately.

Since each content value can be encoded with a different encoder, the encoded content can have very low entropy. With compression enabled, the content is rearranged to allow dictionary based compression routines to reduce the size of the compressed data stream even more than without reordering:

EXI applies basically three concepts to achieve its compression ratios:

Separate Streams for the encoded values of different types.

Probability driven encoding of the different events codes (tokens) to allow minimal message lengths.

Utilizes only String and Integer encoding with both having separate encoding strategies depending on the value ranges of the individual types (varying bit lengths for integers and reduced character sets for strings).

Measurements

The following comparison numbers were collected from different XML data sources compressed with different XML compression tools:

Source

LFPG1

%

WSSS1

%

SAEZ1

%

aip_yyyyy_ndbx

%

XML

3.723.640

100,00

2.947.233

100,00

881.778

100,00

24.104.615

100,00

BXML

1.884.391

50,61

1.644.394

55,79

487.589

55,30

8.796.351

36,49

EXI

1.517.161

40,74

1.361.592

46,20

427.001

48,43

3.787.606

15,71

EXI+XSD

875.967

23,52

799.286

27,12

253.916

28,80

3.845.578

15,95

EXI+XSD+comp.

555.087

14,91

514.747

17,47

163.037

18,49

1.164.105

4,83

Source

katl_112552

%

ktus_112552

%

eham_112832

%

ksea_11255

%

XML

2.345.944

100,00

1.003.904

100,00

2.936.236

100,00

1.094.903

100,00

BXML

765.002

32,61

314.082

31,29

954.168

32,50

342.386

31,27

EXI

1.284.246

54,74

554.125

55,20

1.568.598

53,42

580.995

53,06

EXI+XSD

711.920

30,35

304.201

30,30

875.992

29,83

329.330

30,08

EXI+XSD+comp.

469.258

20,00

197.261

19,65

576.553

19,64

208.499

19,04

Source

Point Obst. 3

%

Line Obst. 3

%

Poly. Obst. 3

%

Terrain3

%

XML

1.014.398

100,00

2.188.846

100,00

24.959.412

100,00

3.337.202

100,0

BXML

758.225

74,75

322.428

14,73

6976632

27,95

EXI

271.332

26,75

121.433

5,55

2.046.298

8,20

3.261.515

91,7

EXI+XSD

191.484

18,88

91.967

4,20

2.267.442

9,08

1.722.850

51,63

EXI+XSD+comp.

42.543

4,19

270272

1,25

703.420

2,82

482.078

14,45

1 Sample AMDB data from 2006, also using the standard BXML encoder from cwxml (schema-less).

2 Production AMDBs from 2011, using the Jeppesen ARINC 816 BXML encoder (using XSDs).

2 Data from the PILAS Research project (using XSDs).

Source

AMDB

%

NDBX

%

Obstacles

%

Terrain

%

XML

2.936.236

100,00

24.104.615

100,00

1.014.398

100,00

3.337.202

100,00

BXML

954.168

32,50

8.796.351

36,49

758.225

74,75

NA

EXI

875.992

29,83

3.845.578

15,95

191.484

18,88

1.722.850

51,63

EXI compressed

576.553

19,64

1.164.105

4,83

42.543

4,19

482.078

14,45

A similar evaluation on smaller message sizes has been performed in a study by the IETF: http://tools.ietf.org/html/draft-shelby-6lowapp-encoding-00

Encoding

Complexity

RDF Test

SE Test

SensorML Test

XML

Med

206B

409B

300B

EXI

Low

6B (3%)

13B (3%)

57B (19%)

BXML

Med

177B (86%)

210B (51%)

177B (59%)

FI

Med

143B (69%)

200B (49%)

185B (62%)

A comparision between EXI and GZIP can be found here: http://www.w3.org/TR/2009/WD-exi-evaluation-20090407/ . The result is also that EXI under all scenarions is at least as good as GZIP.

Custom Compression: NavZip

The prototype compressor developed as part of the project tried to achieve better compression ratios by combining several strategies to reduce the entropy of the data:

Externalized Dictionary: Leverage an external dictionary that is shared between the source compressing the data and the target system inflating the provided data. With the dictionary being externalized, this information does not need to be encoded with the compressed data and thereby reduces the overhead.

Struct Lexical Rules: Enforce general lexical rules to allow correct reconstruction of the data. These rules can be understood as part of the externalized dictionary. Since these rules mainly address structural information and whitespaces, these are not explicitly defined in the dictionary.

Domain Knowledge-Based Encoding: Utilize special encoders for particular values that reduce the required bandwidth for the lossless transport data. Knowledge about probabilities, value ranges, and other specifics of the encoded data is used to optimize the encoding.

Extended Structural Matching: Other than typical lookback pattern matching strategies of typical dictionary based compressors, the new compression algorithm is using structural patterns to separate the payload from the structural overhead of XML.

Value Channels: Values of the same domain are expected to share the same entropy. In order to allow better compression it makes sense to collect the values that are distributed across the compressed document at one central bucket. Since each value is identified by its type, each type will have an associated channel to store its values in.

Parsing and reconstructing XML data

In order to accomplish the correct reconstruction of the compressed XML data at the target without providing each single piece of data in the file, a few lexical rules are defined to allow unambiguous document construction:

Documents are UTF-8 encoded

Whitespaces are ignored except in attribute values

The following particle types are forbidden:

Processing instructions

Comments

CDATA fragments

Empty attributes

Elements must not have mixed content

Elements without children are finished without a separate closing tag

The example below is parsed by the compressor and converted into a token stream that in its raw form can be serialized to an output stream.

<ndbx:A829Database xmlns:xlink="http://www.w3.org/1999/xlink"

xmlns:ndbx="http://www.arinc.com/aeec/xmlschema/xxx"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<startOfValidity>2008-11-20Z</startOfValidity>

<endOfValidity>2008-12-17Z</endOfValidity>

<NDBXRecord xsi:type="ndbx:NavaidSystemType" codeICAO="FA" area="AFR" identifier="1000000000">

<NavaidSystemTimeslice name="ALEXANDER BAY" type="VOR/DME" designator="ABV">

<uses_NavaidComponent xsi:type="ndbx:DMEType" stationDeclination="-18.0" figureOfMerit="2">

<hasLocation_ElevatedPoint elevation="98" position="28.570694 16.533889"/>

</uses_NavaidComponent>

<uses_NavaidComponent xsi:type="ndbx:VORType" figureOfMerit="2" additionalInfo="W" usage="H" frequency="112.1">

<hasLocation_ElevatedPoint position="28.570694 16.533889"/>

</uses_NavaidComponent>

</NavaidSystemTimeslice>

</NDBXRecord>

With these restrictions in place, the number of potential tokens created by a parser can be reduced to four basic tokens:

S: Start Element resp. Start Document

A: Attribute name and value

C: Character data

E: End Element resp. End Document

A standard XML parser parsing the input document can fire these token events. The token stream can be used to reconstruct the original document in exact the same lexical form as the input document. The token stream can also be used as input for more complex processing.

The XML fragment from above is converted into the following token stream:

S[ndbx:A829Database][0]

S[startOfValidity][1]

C[2008-11-20Z]

E

S[endOfValidity][2]

C[2008-12-17Z]

E

S[NDBXRecord][3]

A[xsi:type][4][ndbx:NavaidSystemType]

A[codeICAO][5][FA]

A[area][6][AFR]

A[identifier][7][1000000000]

S[NavaidSystemTimeslice][8]

A[name][9][ALEXANDER BAY]

A[type][10][VOR/DME]

A[designator][11][ABV]

S[uses_NavaidComponent][12]

A[xsi:type][4][ndbx:DMEType]

A[stationDeclination][13][-18.0]

A[figureOfMerit][14][2]

S[hasLocation_ElevatedPoint][15]

A[elevation][16][98]

A[position][17][28.570694 16.533889]

E

E

S[uses_NavaidComponent][12]

A[xsi:type][4][ndbx:VORType]

A[figureOfMerit][14][2]

A[additionalInfo][18][W]

A[usage][19][H]

A[frequency][20][112.1]

S[hasLocation_ElevatedPoint][15]

A[position][17][28.570694 16.533889]

E

E

E

E

…

The individual token events are encoded using two different bit lengths. Since the ultimate objective is to replace most tokens by a higher level encoding, another virtual token was defined to indicate a complex structure. This is expected to be the majority of entries in the stream and so it got the shortest encoding (1 bit) whereas all other tokens are encoded using three bits:

0

1

2

Length

Token

Meaning

Payload

1

–

–

1

M

Match

Match-Id

0

0

0

3

A

Attribute

Name, Value

0

0

1

3

C

Characters

Value

0

1

0

3

S

Start

Name

0

1

1

3

E

End

The fifth token M is used to express complex data types. These complex types are identified by matching fragments in the token stream against predefined patterns which then in turn can be used to extract the variant values and encode them using specialized encoders that can utilize domain knowledge to reduce the redundancy of the textual expression found in the XML document.

Given the following simple pattern:

<complexToken tokenId="4">

<match>

<startOfValidity>$1</startOfValidity>

</match>

<encoding>

<token placeholder="1" encoder="dateEncoder" />

</encoding>

</complexToken>

If applied to the token stream from the initial XML example leads to a modified token stream where the following token sequence was matched and the payload value ‘2008-11-20Z’ got extracted:

S[startOfValidity][1]

C[2008-11-20Z]

E

The unique identifier of this match is associated with the match token and written to the final token stream, followed by all values extracted into their associated placeholders:

S[ndbx:A829Database][0]

M[4]

V[2008-11-20Z][dateEncoder]

M[5]

V[2008-12-17Z][dateEncoder]

…

The new ‘virtual’ token V introduced here is used to carry the combination of encoder and the value to be encoded. Once the token stream is written out to the target output stream, the encoder is converting the provided value into the specialized internal structure.

Since the particular match token is associated with a fixed set of encoders, the tokenId of the match is sufficient for the extraction algorithm to determine the correct extraction strategy.

<complexToken tokenId="4">

<match>

<startOfValidity>$1</startOfValidity>

</match>

<encoding>

<token placeholder="1" encoder="dateEncoder" />

</encoding>

</complexToken>

The match allows storing data very efficiently since no structural information needs to be provided except the tokenId.

Value Encoders

Since the data domain of the compressed documents is well known, we can apply specialized tools to encode the individual values found in a XML document.

In order to select the best encoder several methods were evaluated:

Type Inference using XML Schema

It is safe to assume that any data processed by the avionic compressor is defined through a XML schema that can be used to determine the data type of each single particle found in the processed document.

The following fragment from a XML schema file describes different types for a decimal angle datum:

<simpleType name="valAbstractAngle">

<restriction base="xsd:decimal">

<minInclusive value="-360"/>

<maxExclusive value="360"/>

</restriction>

</simpleType>

<simpleType name="valMagVar">

<annotation>

<documentation/>

</annotation>

<restriction base="ndbx:valAbstractAngle">

<minInclusive value="-180"/>

<maxExclusive value="180"/>

<totalDigits value="4"/>

<fractionDigits value="1"/>

</restriction>

</simpleType>

The example above shows the type valAbstractAngle that inherits all features from the system datatype xsd:decimal and reduces the value range to [-360;360). The second type valMagVar inherits from the valAbstractAngle type but further reduces the domain to [-180,0;180,0).

Whenever a particle is detected that is having these types, the reduced domain ranges can be used to reduce the number of bits written out to the target stream.

However, this method is limited since the expression capabilities of do not allow expressing all special characteristics of a value range. The valFrequence type below is describes as floating point number.

<simpleType name="valFrequency">

<annotation>

<documentation/>

</annotation>

<restriction base="xsd:decimal"/>

</simpleType>

In fact frequency values can only range between 108.00 and 135.00 in 0.025 Hz steps. This way the frequency values can be expressed as f = 108 + x * 0.025, with x being an 11 bit value storing the 1080 possible values. This special scaling and transforming of the original values cannot be expressed in XML schema.

In these cases, the type inference should account for domain specific encoders that are overriding the default encoders provided for the default schema types.

Pattern Detection with explicit encoding

Beside simpleType definitions does XML schema allow the definition of complexType particles that combine simple types, complex types, and define references and cardinalities.

It is desired to find complex types that can be fully reduced to the contained payload data without losing the ability to reconstruct the original XML fragment from the encoded payload data. Initial development on a fully dynamic type inference for complex types was started but due to the complexity of XML schema and the resulting ambiguities it was cancelled and a less complex pattern matching process was implemented.

An example for such a pattern-based encoder can be found below:

<complexToken tokenId="1">

<match>

<boundedBy>

<lowerLeft>

<latitude>$1</latitude>

<longitude>$2</longitude>

</lowerLeft>

<upperRight>

<latitude>$3</latitude>

<longitude>$4</longitude>

</upperRight>

</boundedBy>

</match>

<encoding>

<token placeholder="1" encoder="latitudeEncoder" />

<token placeholder="2" encoder="longitudeEncoder" />

<token placeholder="3" encoder="latitudeEncoder" />

<token placeholder="4" encoder="longitudeEncoder" />

</encoding>

</complexToken>

The fragment in the upper box is the XML data to be matched. The placeholders are numbered and prefixed with a $-sign. Once the pattern has been found, the placeholders are associated with the values found at their place. These values are then encoded using the related encoders found in the lower box. Since the fragment is well known, the types of the found values are also well known and can be encoded using the best suitable encoder.

<Data_Block xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns3="http://www.w3.org/2001/XMLSchema">

<boundedBy>

<lowerLeft>

<latitude>4775E-2</latitude>

<longitude>1025E-2</longitude>

</lowerLeft>

<upperRight>

<latitude>480E-1</latitude>

<longitude>105E-1</longitude>

</upperRight>

</boundedBy>

The XML fragment above is taken from a terrain data file. The boundedBy element is matched by the complexToken definition above and converted into a set of four encoders:

S[Data_Block][0]

M[1]

V[4775E-2][latitudeEncoder]

V[1025E-2][longitudeEncoder]

V[480E-1][latitudeEncoder]

V[105E-1][longitudeEncoder]

These tokens are then serialized into their associated value channels.

Value Channels

Without separate channels, all the tokens coming from the parser and the followup matching toolkit would be serialized into one binary output stream:

Token-Stream

Header-Bits

Data

S[Data_Block][0]

0

1

0

[10][Data_Block]

M[1]

1

–

–

[1]

V[4775E-2][latitudeEncoder]

–

–

–

[4775E-2]

V[1025E-2][longitudeEncoder]

–

–

–

[1025E-2]

V[480E-1][latitudeEncoder]

–

–

–

[480E-1]

V[105E-1][longitudeEncoder]

–

–

–

[105E-1]

…

…

If two encoders would generate serialized values that could easily reduced by an entropy encoder, it would not work in a single stream since the other encoder values placed between its values would prevent the compression to work efficiently.

To overcome this limitation, the output of each encoder is directed to its own associated output stream:

Token-Stream

Main Token Stream

Stream

Stream

Header-Bits

Data

"latitudeEncoder"

"longitudeEncoder"

S[Data_Block][0]

0

1

0

[10]

–

–

M[1]

1

–

–

[1]

–

–

V[4775E-2][latitudeEncoder]

–

–

–

–

[4775E-2]

–

V[1025E-2][longitudeEncoder]

–

–

–

–

–

[1025E-2]

V[480E-1][latitudeEncoder]

–

–

–

–

[480E-1]

–

V[105E-1][longitudeEncoder]

–

–

–

–

–

[105E-1]

…

…

The above example shows how the contextual information is removed through the exisitence of the M[1] token. The distribution of the values across different streams is just an enhancement for followup compressors after the token stream has been written out.

Encoders

String Encoder

One of the most important encoders is the string encoder. Basically encoding a string breaks down to converting the provided string into a set of byte values and storing the number of bytes along with the actual bytes in the target stream.

Some experiments have been done to determine often-used character subsets. One subset that was identified is the set of alphabet letters and some punctuation characters. With this limited characterset many strings can be encoded using five bits, allowing a total of 32 different characters.

These specialized strings are written to their own stream and an indicator is stored with the string length.

Integer Encoder

The integer encoder tries to reduce the number of bits used for encoding by splitting the total number bits into chunks with a continuation flag and a separate sign flag:

Value: 685 Bit-Representation: 00000000000000000000001010101101 (32 Bit)

Window-Size: 7 Bits Encoded Value: 0 [sign] 1 Bit

1 [continue] 1 Bit

0101101 [data] 7 Bits

0 [continue] 1 Bit

0000101 [data] 7 Bits

(17 Bit)

Since the best encoding would be to have no additional flags and only the necessary number of bits required for the value range of the encoded values, the integer encoder can be customized. On the one hand the window sizes for the first window and for the following windows can be defined separately. This would allow capturing most of the values in the first window without the need for a second window that is just carrying a few additional bits. The same value above with a window size of 10 bits would have been encoded like this:

Window-Size: 10 Bits Encoded Value: 0 0 1010101101 (12 Bit)

In addition to the window sizes the encoder can also be customized with the value range. If for example the minimum value would be -30 and the maximum value would be 30, the encoder would determine that the width of the range would fit into an unsigned integer from 0 to 60 and would only require five bits for encoding and no sign bit. If the window size is larger or the same size, the encoder would not even write out any continuation bits and the window size would be reduced to the number of bits required.

If the value above would have a range from 0 to 700, it would fit into 10 bits, would have no sign and the encoding would be like this:

Window-Size: 10 Bits Encoded Value: 1010101101 (10 Bit)

Enumeration Encoder

One of the often-used features in XML schema definitions is using an enumeration to restrict the value domain of a simple type. With a limited set of values, the encoding of these values is reduced to mapping the enumeration values to integer values that can be stored instead of the enumeration values.

<encoder name="areaEncoder" class="com.jeppesen.ar.airzip.encode.simple.EnumerationEncoder">

<parameter name="bitLength">4</parameter>

<parameter name="mappings">

<map>

<value key="0">EUR</value>

<value key="1">EEU</value>

<value key="2">AFR</value>

<value key="3">MES</value>

<value key="4">SPA</value>

<value key="5">PAC</value>

<value key="6">CAN</value>

<value key="7">USA</value>

<value key="8">LAM</value>

<value key="9">SAM</value>

</map>

</parameter>

</encoder>

Besides the mapping table the encoder will need the wanted number of bits to encode the index values from the mapping table.

Floating point Encoder

During the project different encoding strategies for floatingpoint values have been tested. While the most reliable approach would have been to separate the three components of an IEEE 754 floatingpoint value:

sign

exponent

mantissa

Depending on the type of floatingpoint data (float or double) these values would have been either 32 bits for float and 64 bits for double values in total. Since the mantissa end exponent are encoded as powers of 2 values, the encoding values from the decimal system usually eliminates the specific chracteristics of the decimal value range and cannot be used to enhance the binary encoding.

The second approach was to apply a fixed number of digits to allow the values to be expressed as integer values.

The third and finally chosen approach was to split the numbers into the integer portion and sign and keep the fractional part as an integer of its own. These two values were then stored as two integer values. Since the same rules that were applicable for integer numbers could be applied to these two parts of a floatingpoint value, this encoding strategy turned out to be the most efficient.

Frequency Encoder

The frequency encoder is a specific integer encoder that takes the original value and passes it into the following formula to determine the actual value to store:

Since the frequency values are in the range between 108.00 and 135.00 this leads to values between 0 and 1080, which can be stored within 11 bits.

Results

While current version of the NavZip compressor achieves already the best compression ratios in the field, it is worth questioning if this approach shall be used to define a ommon standard for compressed XML data or if other approaches are compareable good. In addition, other requirements on the algorithm need to be taken into account like processing speed, memory consumption, or direct access to the encoded data.

Benchmark Comparison

The tools used for compression can be divided into four main categories:

Standard Compression Tools: These are the common compression tools available on almost any platform. We chose ZIP as the best-known compression tool and three others to see if there may be significant differences between the deployed techniques.

Binary Serialization: As an opposite approach to structural analysis and full recovery of the original XML, the data contained in the document was loaded into memory as Java objects and then serialized to disk. The required class definitions were created using the JAXB code generator. All created classes implemented the standard interface Serializable for persistence. In the table below the sizes for the raw serialized data as well as its the ZIP compressed equivalent are listed.

Binary XML: The currently deployed BXML compression technique applied in some ARINC standards.

Efficient XML Interchange: The upcoming new standard defined by the W3C consortium. Several options can be used for the compressor, which have been used in combination to determine the different resulting ratios.

EXI: Plain encoding with the EXI grammar encoder

EXI+XSD: Encoding with the XML schema of the encoded file provided.

EXI+XSD+comp: Same as above with additional gzip compression

NavZip: This is the compression tool created as part of the evaluation project. Since this is a prototypical implementation of different compression and encoding techniques, we have also evaluated different options:

NavZip: This is the typical approach that facilitates the XML schema and an external dictionary.

NavZip (5bit, aligned): This is is basically the typical NavZip approach from above that detects special strings that are only using a limited character set of 32 characters and encodes these using 5-bit numbers into a separate stream. These 5-bit tuples are aligned on byte boundaries to allow the followup compression to have an effective lookback-buffer matching (which is usually byte-aligned).

NavZip (PLAIN): This is the NavZip compressor without a schema or a dictionary.

For the comparison of the current NavZip incarnation we chose the largest file aip_yyyyy_ndbx.xml with a size of about 24 MB. This file is a sample NDBX file according to the spec from 2009, having only a few datatypes like radio navaids, runways, and airports/helipads.

The table below shows the resulting sizes as well as the relative size of the compressed data compared to the original size of the document:

File: aip_yyyyy_ndbx

Size in bytes

Relative size in %

XML

24.104.615

100,00

Standard Compression Tools

7Zip

1.375.042

5,70

BZip2

1.142.933

4,74

ZIP

1.800.286

7,47

Gzip

1.821.623

7,56

Binary Serialization

Serializer

17.502.321

72,61

Serializer+ZIP

2.031.731

8,43

Binary XML

BXML

8.796.351

36,49

Efficient XML Interchange

EXI

3.787.606

15,71

EXI+XSD

3.845.578

15,95

EXI+XSD+comp.

1.075.907

4,46

NavZip

NavZip (Uncompressed)

1.777.138

7,37

NavZip (Compressed)

1.106.141

4,59

NavZip (5bit, aligned)

996.830

4,14

NavZip (PLAIN)

1.461.059

6,06

In addition to the navigation data above, a brief test was conducted with terrain data providing a grid of elevation posts encoded in a XML document:

Source

MR_477501025

%

MR_487501100

%

XML

3.535.827

100,00

244.200

100,00

EXI+adb+strict

482.078

23,52

799.286

27,12

NavZip

891.909

23,95

810.776

27,51

While compression ratio of the NavZip algorithm appears to be the best in the field it is questionable if the compression ratio of the EXI compressor is too far off to be regarded a good alternative. Since Efficient XML Interchange is an official W3C Standard it should be considered if the results from the NavZip effort could be used to extend upon the EXI standard to provide a standards based data delivery with Jeppesen specific enhancements.

Enhancements for NavZip

Enhanced Encoders

The current encoders provide full symmetric encoding. With deeper knowledge about the requirements on the data quality (e.g. resolution of 10-5 for geo coordinate values) the encoders could be further enhanced. Also if a particular distribution of individual values exists in a types value domain, the encoding of these values could utilize this knowledge to generate variable length encodings that would ultimately decrease the average bit length used to encode values of these types.

Statistics for better encoding

The following table shows a part of the statistical data that was collected compressing the reference document aip_yyyyy_ndbx.xml used for all the measurements above.

The Count value shows the number of times the encoder was used and how many Encoded Bits this encoder has produced. The average number of bits per use is shown in the next column. These numbers are the current status quo. The data in the remaining columns should give some insight into potential improvements that could be gained applying statistical encoding.

The ‘Count distinct values’ is the number of distinct values that have been encoded. A value of one would indicate no variation and could be regarded as constant value that would not require any encoding at all.

The next column ‘Ø count / value’ shows the average number of occurrences per each distinct value. This number is just approximative. In reality this number can be massively biased towards only a few distinct values.

The ‘Bits for encoding’ value indicates the number of bits required to encode the distinct values if they were in a lookup table. This encoding would assume an equal distribution of the values probabilities. If a Huffman encoding was applied, the average bit-length could be significant smaller.

Encoder-Name

Count

Encoded

Bits

Ø Bits / Item

Count

distinct

values

Ø count

/ value

Bits for

encoding

Encoded

Size

elevationEncoder

24541

417.197

17,00

3534

6,94

12

294492

frequencyEncoder

1416

15.576

11,00

92

15,39

7

9912

booleanEncoder

16207

16.207

1,00

1

16207,00

1

16207

dateEncoder

2

196

98,00

2

1,00

2

4

angleEncoder

11374

125.816

11,06

454

25,05

9

102366

figureOfMeritEncoder

3580

10.740

3,00

3

1193,33

2

7160

runwayIdentifierEncoder

7793

56.990

7,31

129

60,41

8

62344

codeAngleTypeEncoder

22377

22.377

1,00

1

22377,00

1

22377

navaidUsageEncoder

1416

2.832

2,00

3

472,00

2

2832

navaidAddtlInfoEncoder

3565

10.695

3,00

4

891,25

3

10695

magVarEncoder

14584

145.840

10,00

53

275,17

6

87504

codeNavaidSystemTypeEncoder

3681

18.405

5,00

6

613,50

3

11043

integerEncoder

15586

202.908

13,02

2679

5,82

12

187032

areaEncoder

59417

237.668

4,00

9

6601,89

4

237668

codePublicMilitaryIndicatorEncoder

14584

29.168

2,00

3

4861,33

2

29168

codeLandingAreaEncoder

7793

15.586

2,00

1

7793,00

1

7793

unsignedIntegerEncoder

59172

848.273

14,34

1896

31,21

11

650892

rangePowerEncoder

2149

4.298

2,00

2

1074,50

2

4298

stringEncoder

99147

6.039.318

60,91

72437

1,37

17

1685499

navaidXSDTypeEncoder

5729

11.458

2,00

4

1432,25

3

17187

geoPositionEncoder

60769

4.128.767

67,94

59242

1,03

16

972304

identifierEncoder

59417

721.664

12,15

59417

1,00

16

950672

floatEncoder

2186

82.971

37,96

288

7,59

9

19674

runwayGradientEncoder

31

186

6,00

15

2,07

4

124

icaoEncoder

51624

412.992

8,00

71

727,10

7

361368

xpointerEncoder

9038

117.092

12,96

2668

3,39

12

108456

Total Encoded Size

13.695.220

5.859.071

Percent

100 %

42,78 %

With these approximative values, the resulting size could be massively reduced. The encoded stream could be reduced by approximately 50 % of the original size (without compression applied). Huffman encoding would again reduce the total size [2] . Depending on the chosen approach, there would be a remaining penalty added to the total size, if the value mappings would need to be provided as part of the data stream to inform the decoder of the correct lookup values and key lengths.

Stream grouping

Currently each stream is compressed separately and each stream will have its own set of zip-specific adornments that may add significant overhead to the entity. Like in the EXI standard it would make sense to group smaller streams together to avoid this penalty.

Summary

Recommendations

The above results indicate that the NavZip algorithm still has sufficient potential to further enhance the current compression ratios. Compared to the EXI standard the gain will most likely minimal. The question that will make the difference is, how the encoded data will be used on the target system. While database servers will prefer the compact compressed format, other systems may require direct access to the data without the overhead of inflating the whole file first.

The EXI standard addresses both use cases, well knowing that direct access is hard to achieve with compressed data. EXI defines relatively fine-grained control over the encoding process without making the process of decoding overly complex. However, it is interesting to see how close the two approaches are. With a significant amount of collective thoughts and work-power being spent at this W3C standard, the EXI encoding very elaborate.

That part for sure is more powerful than the pattern based encoding approach that was taken in NavZip. The domain specific know-how that is built into NavZip cannot be found in EXI since it is defined to be generic in order to serve any type of XML data.

Especially with EXI as the new W3C standard for compression, there will be no real use for a second standard for XML compression if it was not dramatically more efficient than EXI. The next section will discuss potential approaches for Jeppesen to facilitate this standard without sacrificing the ability for specific extensions that would allow differentiation from its competitors.

Differentiation

Custom Encoders

The EXI Specification does not allow replacing complete complex structures like the NavZip compressor does. However, through a feature called datatypeRepresentationMap the EXI specification provides a mechanism to replace simple types with custom type implementations and to exclude it from the fairly simple EXI type inference.

The results show that even with this relatively simple encoder an improved compression ratio can be achieved. Since none of the freely available implementation of EXI compressors does support the option to apply datatypeRepresentationMap changes, the source of the openEXI project was modified to support at least one custom encoder:

File: aip_yyyyy_ndbx

Size in bytes

Relative size in %

XML

24.104.615

100,00

Efficient XML Interchange

EXI+XSD+comp.

1.075.907

4,46

EXI+XSD+comp. + Frequency-Encoder

1.073.700

4,45

NavZip

NavZip (5bit, aligned)

996.830

4,14

Although the improvement in the above example is just about 1.200 bytes, it shows that the encoding of EXI is not the most efficient. Looking at the previous statistic data collected by NavZip for the same file, it becomes evident, that the expected improvements could never have been so much since the total number of frequency values is not very high.

Encoder-Name

Count

Encoded

Bits

Ø Bits / Item

Count

distinct

values

Ø count

/ value

Bits for

encoding

Encoded

Size

elevationEncoder

24541

417.197

17,00

3534

6,94

12

294492

frequencyEncoder

1416

15.576

11,00

92

15,39

7

9912

If there were more elements to encode the savings would obviously be bigger than in this simple example. If for example the elevation values could be encoded more efficient, the savings will eventually be much higher than indicated in the table above.

But there is another advantage with custom encoders; their functional description is not part of the transferred data. It is a functional contract between the encoder and the decoder and can be treated as a secret allowing excluding other decoders.

Probability Driven Encoders

Along with the use of custom encoders the use of adaptive encodings with varying bit lengths will make a huge difference. A size reduction by 50 % or even more that was estimated in section 3.2.2 could be achieved applying statistical data and adaptive encoding.

The EXI explicitly allows user defined meta-data provided with the document to be used for the interpretation of the contained data:

„The user defined meta-data conveys auxiliary information that applications may use to facilitate interpretation of the EXI stream. The user defined meta-data MUST NOT be interpreted in a way that alters or extends the EXI data format defined in this specification. User defined meta-data may be added to an EXI Options document just prior to the alignment option."

Direct Access

One big advantage of EXI over the NavZip approach is the ability to index the data stream for direct access to the encoded elements. The feature called ‘self-contained elements’ ensures that the decoding of individual elements can be done without any previously built contextual information that is usually built while the stream of data is being decoded. The external indexing itself is not further specified. Jeppesen can take advantage of this and try to define an extended packaging format for the encoded EXI stream and the associated index information required for direct access.

However, it must be understood, that self-contained elements are only working in uncompressed EXI streams. The NavZip compressor could be modified in a way that it would support compressed fragments that can be accessed and expanded independently without the whole stream being expanded.

Summary

Jeppesen should build a set of tools on top of the EXI standard that allow custom encodings, which may be just smart encodings, encrypted, or statistically optimized.

The potential of self-contained elements must be evaluated to be fully understood and to see the impact on encoded documents to see the additional payload that would be created by the indexing. Since EXI is a fairly young standard, it is hard to find reference implementations that could easily be used for these evaluations since especially the extensions mentioned in section 4.2 are not implemented in any of the tools.

EXI will enable all typical use-cases having either compressed data with low memory footprint or uncompressed data with random access and fast parsing.

The NavZip Prototype must be regarded as a pool of ideas that could be used to enhance the encoding of navigational data provided by Jeppesen. However, since EXI is so prominent and even had Boeing employees helping at the development of the standard, it will be hard to get around it.

Statistics for the NavData Sample

Terrain Encoder Summary

First line for each Encoder:

Number of occurances

Number of bits coded

Number of distinct values

Following lines:

Number & Percentage that can be coded by x bits



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now