Increasing Requirements For Storage Space

Published Date: 02 Nov 2017

\section{Introduction}

Increasing requirements for storage space as well as the amount of transmitted data in various fields of communications, such as over the Internet, telecommunication, data compression is used to optimize the use of limited space and bandwidth.

Nowadays, data compression is affecting many of our daily activities, either when listening to music which many refer as MP3, taking pictures, using the computer and many more. It helped reduce the amount of space the devices needed to store the data and such had a direct affect on the devices size (of course the development of storage components size need to be considered in this case).Not only our lives is affected in such direct way, scientific research use compression to optimize the amounts of data that is needed to be process. Space crafts sent into to the outer skirts of our galaxy need to transmit large amounts of data for long distances, a process that takes time considering how far the craft really is, so data compression enables the amount of data to be send in one transmission to be larger then the limit of that bandwidth.

Nonetheless, the use of data compression has its weaknesses. Loss of data occurs in some cases and common in image, sound and video compression. This is considered lossy compression where some of the data is lost, while a better compression factor could be achieved.

Another limitation of this process is the requirement of decompression. Unlike lossy compression that can use a decoding process to access the compressed data, lossless compression requires a decompression process to obtain the full original data. Decompression requires additional processing and therefore the computational costs are higher.

Compression algorithms highly relates to information theory, which was developed by Claude E. Shannon in 1948, as he was researching the fundamentals of signal processing operations, which was the base for the modern data compression methods. Data compression is fundamentally bit reduction, a method that represent data with fewer bits than the original. Based on information theory, bit reduction can be achieved using different algorithms which will be discussed in this report.

\section{Data Compression Algorithms Theory}

\subsection{History}

In 1948 Claude E. Shannon published a paper on Information Theory. He formulated the theory of both of lossless and lossy data compression. In the paper he established that there is a fundamental limit to lossless compression, the limit is called Entropy. For Shannon developed rate-distortion theory, in other word some amount of data lost is acceptable. This Information Theory Shannon has developed has set the basic fundamentals for future data compression algorithms.

\subsection{Source Modelling}

Modelling is the first step before coding using Entropy. In this stage the sequence to be compressed is built in order to estimate the probability of occurrences. There are two way this can be implemented, one is to build a frequency table based on the number of time each symbol appears in the data, from this one can derive a probability list, although a frequency list is enough. The other way is the knowledge of probability of symbols in a sequence. for example in the English alphabet its is known what is the probability of some letter to appear after the other, this can help build the table quicker, the main drawback of this method is that a probability table for English text won't be the same for other languages, not mention random symbols appearing in non text data.

Therefore it is usually better to use the first method although not very efficient in time process as the data is being read more then once, but the end result will be more affective. Non the less there is another approach where the table is built while the data is being read, this is called the adaptive approach which will be discussed later.

\subsection{Entropy Coding}

\subsection{Dictionary Coding}

\subsection{Lossless and Lossy Compression}

\subsubsection{Lossless Compression}

Lossless data compression implies that the data retrieved after decompression is exactly the same as the data obtained before the compression process, hence no data is lost. This method is very important when a loss of data means that it can't be used. Considering the space craft example from the introduction page, if the collected data from the space craft will be sent back to earth missing some parts of it, that data cannot be relay on for research as it might miss some vital information or it won't be readable at all.

Not losing any data comes with a price, compression factor cannot be controlled and depend solely on the algorithm that is used to compress the data. Furthermore, computational cost of the decompression process are higher then lossy compression as large amount of data is collected during the process. Some of the algorithms require that information about how the compression process have been done to be stored in the file.

Lossless data compression can achieve a maximum of 2:1 ratio($50\%$), this is really hard to achieve for large amount of data. To achieve higher ratio one will use lossy compression.

\subsubsection{Lossy Compression}

As the name suggests lossy compression involves loss of data. This data is non recoverable and therefore the amount and type of lost data should be considered. Lossy compression use the data redundancy (bit reduction) method that are used in lossless compression and exploits the human perception properties. Such properties include our limitation of hearing a range of frequencies. For example a music piece will include a large frequency range, some of the frequencies in the range are not intercepted by human ears, removing these will not affect the way the ear will hear the music, such method is used in music compression, like an MP3 music file. By removing this data a greater compression factor could be achieved. Therefore lossy compression is highly used in the filed of media data, such as videos, audio, images and etc.

Lossy compression has its disadvantages, removed data cannot be retrieved, the amount of removed data, set by user parameters, can yield non usable data, for example in an image, removing too much information can yield an image that is not longer viewable or gives little information about the original image, therefore the user control on the compression rate can be a disadvantage.

There are clear advantages over lossless compression, compression rate, user control and although an appropriate decoder need to be used to read the data, the computational cost are lower as a the data is already usable in its compressed state. Unlike lossless, lossy compressed can be streamed and be decoded without collecting the whole data, a method that for example is used when streaming conversation over the mobile phone, one is not waiting for the whole sentence of the speaker on the other side to be received and then decoded in order to hear that sentence.

\subsection{Algorithms}

\subsubsection{Shannon-Fano}

Shannon-Fano(SF) algorithm, developed by Claude E. Shannon and Robert Fano in 1948 is an algorithm that generates a binary tree based on a frequency table.

This frequency table is a table consisting of each symbol in the file and the number of times it appears in the file. Based on the Frequency table a probability list that would be used to build the tree can be constructed:

As an example $$ \textit{a frequency table} = {{(s_1, f_1), . . . , (s_n, f_n)}}$$ where $s$ is a symbol and $f$ is the frequency of that symbol.

Let $$\textit{$i$ be in } \{{1, . . . , n}\}$$ and let x be the total sum of frequencies, $$x = \sum_1^nf$$

Then the probability $$P(s_ i) = \frac{f_i}{x}$$

(One can build a tree based only on the frequency list)

Shannon-Fano tree is built top-to-bottom recursively, using the frequency or probability list using the following steps:

\begin{itemize}

\item

$1^{st}$ sorting the list in a descending order with the highest frequency on the left and the lowest on the right.

\item

$2^{nd}$ dividing the list into two lists, where the sum of the frequencies is approximately equals.

\item

$3^{rd}$ the left list would be assigned with 0 and the right list would be assigned with 1.

\item

$4^{th}$ the second and third steps need to be applied recursively to each list.

The recursive step base case is when the size of the list is 1(a leaf).

\end{itemize}

This will leave a binary tree where each symbol has a representation in binary, based on the route taken to get to that symbol. Only leaves of the tree are symbols, nodes do not contain symbols for binary coding, so a route that will resort in a node will be invalid.

Using the 0s and 1s one can create the binary code for each symbol. As 0 is left and 1 is right, one should follow the route until a leaf is reached and store the route as a sequence of 0s and 1s. As an example, going left twice and once right to get to the symbol â€˜aâ€™, so â€˜aâ€™ has a representation of â€˜001â€™. This code is unique and called a prefix code, as none of the symbols are a prefix of another code. Hence, taking the code from the previous example for â€˜aâ€™ = â€˜001â€™, no other symbol in the tree will have â€˜001â€™ as itâ€™s code. The code list of each symbol is called a dictionary, important for the compression process, which will be discussed later. Following these routes, the higher frequencies symbols will have smaller code representation then symbols with smaller frequencies.

Furthermore, smaller frequency symbols might be represented in a larger code than they use, but that should be meaningless considering the compression rate of higher frequencies symbols.

Although efficient, the tree construction method does not ensure an optimal tree will be constructed, but it does guarantee that all code word lengths are within one bit of their theoretical idea â€“logP(s).

The

actual compression process is rather simple after constructing the dictionary. The file is rewritten using the dictionary to recognise symbol coding representation and output a sequence of bits of 0s and 1s. New bytes will be created that will be represented by new symbols and hence the data displayed will be unreadable and useable without a decompression process.

\subsubsection{Huffman}

Huffman algorithm was developed by David A. Huffman in 1951, as part as his term paper set by Professor Robert Fano.

In essence, Huffman algorithm is the same as the Shannon-Fano algorithm. The main difference between them is the way the tree is being built. Unlike the Shannon-Fano tree the Huffman Tree is built from bottom-to-top approach. By doing so Huffman was able to overcome the flaw in Shannon-Fano that did not always produce an optimal tree, though it doesnâ€™t have the same guarantee as Shannon-Fano for â€“logP(s).

As in Shannon-Fano, a frequency/probability list is required in order to built the tree and dictionary.

As the tree is being built from bottom-to-top the recursive method of building is different:

\begin{itemize}

\item

$1^{st}$, sorting the frequency list in an ascending order, with the lowest frequency at the beginning and the larger one at the end. Initially each frequency and symbol would be a leaf.

\item

$2^{nd}$, removing the two lowest frequencies in the list and constructing a node from them, with the left child being the lower frequency and the right child the higher one. Allocating 0 to the left child and 1 to the right child. The node frequency is the sum of frequencies of its children, which unlike in Shannon-Fano it is crucial for the next step.

\item

$3^{rd}$, adding the newly constructed node into the frequency list, noting that the new node can be an actual tree.

\item

$4^{th}$, repeating step 2-4 with the new frequency list until the list contains 1 tree.

\end{itemize}

The code generated by the tree is again, a prefix code, which is unique for each symbol. The running time is effective as it takes O(nlog(n)) operation to construct the tree, where n is the number of symbols, and a tree with n leaves will have 2n-1 nodes.

The compression procedure is the same as Shannon-Fano. One major disadvantage of both compression algorithms is the necessity to store the code tree in the file that takes space, though usually not affecting the compression factor, unless small files need to be compressed.

For Huffman, there are more efficient ways to store minimal information such as the Huffman canonical tree, which will be discussed in a different section.

The decompression process of both algorithms is the same: reading in each bit and taking a route on the tree following the rule 0 left and 1 right. As reaching a leaf the sequence will be terminated and the code output would represent the leafâ€™s symbol. Then starting from the next bit after the last one in the previous sequence will again yield a sequence of bits when reaching a leaf. As the code is a prefix code, each sequence is unique for a symbol then the first leaf that the route is leading to is the right one.

\subsubsection{Comparison Between Shannon-Fano and Huffman}

Although both compression algorithms are based on the same idea, Huffman's tree is slightly more efficient, as it yields an optimal tree. This feature made it the most popular out of the two, and it received many variations as a result as well as implementation in some of the compression utilities.

Huffman and Shannon-Fano compression is not suitable for lossy compression as it rewrites the data instead of removing unnecessary data that is required for lossy compression. They are efficient for lossless compression, particularly for symbol-by-symbol compression, when the probabilities are known or easy to calculate. When larger files are to be compressed or the probabilities are unknown, there are other compression algorithms one can use.

\subsubsection{Lempel-Ziv LZ77}

Lempel-Ziv or LZ77 and LZ78 compression algorithm were published in 1977 and 1978 respectively by Abraham Lempel and Jacob Ziv. Unlike SF and Huffman these algorithms are dictionary coders that works on strings of characters and which can be more efficient then working with individual characters like in SF and Huffman. These two algorithms represent two different approaches to implement the build of adaptive dictionary, a dictionary that is built while reading the file.

LZ77, also called the sliding window compression, consist of two buffers. One is the search buffer which is used as the dictionary, the other buffer is the look ahead buffer which is the next set of data inside the size of the buffer. The size of these two buffer depend on the implementation of the program.

The dictionary is being built based on the comparison of the two buffers, and strings of characters are being added to the dictionary or referenced to it when they already exists in the search buffer. The output of the compress code will be the index of the first character of the matching string, the length of the string (number of matched characters) and the extra character in the end that was not part of the match (NEED TO PUT A FIGURE). Non matching individual characters will be output as zero for index, zero as length and the character (NEED TO PUT A FIGURE), although zero is set in length its the job of the decoder to interpreted as a signal character.

Step by step of the algorithm will be as the following:

\begin{itemize}

\item First check that the look ahead buffer is not empty.

\item Than search the search buffer to find the longest match with the look ahead buffer.

\item If a match is found output: (index of first character in search buffer, the length of the matched string, the next character in the look ahead buffer that is not part of the match). and move the look ahead buffer one step forward.

\item If a match is not found output ( zero as index, zero as length, the character that was no matched in the search buffer). and move the look ahead buffer on step forward.

\item continue until look ahead buffer is empty

\end{itemize}

Choosing the size of the sliding window have direct consequences on the efficiency of the compression. As this approach assumes that the same sequences will occur near each other, if a sequence that was match in another search buffer but is not found in the next search window it will be output again as a new match. However the sliding window is usually big such that occurrences of this type will not have a great impact on compression ratio. The sliding window size is affecting the compression rate in a different way as well, bit representation of each output is calculated as $$ \lceil\log_2{S}\rceil+\lceil\log_2(S+L)\rceil+\lceil\log_2{A}\rceil$$

where $S$ is the size of the search buffer, $L$ is the size of the look ahead buffer and $A$ is the size of the alphabet.

\subsubsection{Lempel-Ziv LZ78}

LZ78 was published as a solution to the above problem with the LZ77 algorithm. By using a dictionary

that grows as matches are found and added to it, so the search window of LZ78 is all of the dictionary. Of course

this have it's own drawbacks as there is no limit for the size of the dictionary which can affect the machine memory resources and furthermore searching the dictionary can be time consuming, a few optimization are suggested to solve these problems, one is to limit the size of the dictionary or completely erase it and start with a new one. For searching optimization the usage of hash code table can speed up the search process. These drawback will be discussed into details in the implementation issues section.

The dictionary is built in a fairly simple way storing a codeword and a word, the code word is the index of the matching string in the dictionary or zero is there is no matching, the word is the string including the next character that is not included in the matched string of the codeword index(NEED TO PUT FIGURE) in the case of a string of length 1 (a single character) the codeword will be 0 as there is no matching in the dictionary(NEED TO PUT FIGURE). The output on the other hand is only the codeword and the last character in the string(NEED TO PUT FIGURE), the bit representation of the output is calculated by its index in the dictionary, Let $n$ be the number of bits required to represent the codeword with index i, then

\begin{equation}

n= \begin{cases}

1 \quad \text{if } i=0 \\

\log_2{i} \quad \text{if } i > 1 \\

\end{cases}

\end{equation}

(NEED TO PUT FIGURE)

Step by step process will be as the following:

\begin{itemize}

\item dictionary is initialized and the first entry is set to null, by doing so one ensure $i$ will always be greater or equal to 1. The algorithm will fail if $i$ can be equal to zero. A string is initialized to empty string.

\item As long there exists an input read that input character by character.

\item If the character + string is in the dictionary then the string is equal the string + the character.

\item If it's not in the dictionary, then if the string is empty the codeword is set to 0(as its a single character) other wise the codeword is the index of the string in the dictionary.

\item output: (codeword, character), add to dictionary (index of string, string+character)

\item string is set to empty

\item if the string is not empty output (codeword for string, null)

\end{itemize}

The decoding process is similar to the encoding process where the dictionary is gradually built based on the input, only here the amount of bits require to represent each codeword and the next 8 bits after the sequence to retrieve to word is necessary. In the same manner the dictionary will start from index 1 and not 0, while it's known that the first codeword will be 0 and the number of bits require to represent

it is 1 as it's the first element in the first index the output will be just that first character in the first index. Going onwards the next number of bit that represent a codeword will be the same calculation as in compression namely $$n = \log_2{i}$$

When encountering a codeword that is not 0, the output need to be the string stored in the index the codeword pointing at plus the character that was stored in next 8 bits after the codeword. In this case the codeword and the string in the codeword index plus the character.

There are a few advantages of LZ over SF and Huffman, mainly, as there is no need to store how the compression had been done, it will save some precious space and saves the complexity of storing such data in an efficient way. Furthermore unlike Huffman that requires two passes over the data LZ requires only one. This feature of one pass is important for saving disk space or when encoding and decoding real time communications(NEED TO REWRITE IN ON WORDS!)

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Increasing Requirements For Storage Space

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time