A Data Deduplication Framework For Cloud Environments

Published Date: 02 Nov 2017

Abstract- Cloud storage has been widely used because it can provide seemingly unlimited storage space and flexible access, while the storage resource is vulnerable to the cost issue since the data should be maintained for a long time. In this paper, we research the benefit and overhead when deduplication technique is adopted to the cloud storage, then we propose a deduplication framework for cloud environments. Our deduplication framework consists of three major components, a front-end deduplication application, an efficient indexing mechanism and Hadoop Distributed File System. The deduplication application divides a given file into smaller chunks, then searches the index table that consists of hash values of chunks to judge duplicate data, finally stores non repeated chunks into the Hadoop Distributed File System. We further conduct a series of experiments in our framework. These experiments demonstrate that our framework can improve storage space utilization and reduce bandwidth consumption.

Keywords: Cloud storage, deduplication , hash value , efficiency, bandwidth

I. INTRODUCTION

Cloud storage has become an effective way to store massive data ,due to its high reliability, high scalability[1]. However, with the increase of data quantity, the storage costs and communication costs increase exponentially, so itâ€™s very important to seek an effective storage methods to improve the utilization of storage space and accelerate the storage speed . Data deduplication is a technology to detect and remove duplicate date based on two different levels of the file-level or block-level[2]. The deduplication can be done on a single user data where redundancy within his/her data is identified and removed, but single user data deduplication is not very practical or cost saving. Cross-user deduplication is used in practice to maximize the benefits of data deduplication, which identifies redundant data among different users and the removes the redundancy and saves a single copy of the unique data, the different userâ€™s access request is redirected to this copy. So it can be applied in cloud storage to save storage space.

Cloud storage has a large amount of user and redundant data. According to [3], an average of 60% of data can be deduplicated. Using deduplication technology to remove these redundant data, not only can achieve higher storage efficiency, saving the cost of cloud storage providers, but can reduce storage costs for the subscribers by the use of flow.

This paper presents a framework for cloud storage deduplication----DDSF, which adds deduplication mechanism to reduce duplicate without changing the original cloud storage framework. It can solve the problem of low utilization of storage space in the existing cloud storage.

In the framework, we use the MD5 and SHA-1 hash algorithm to determine the fingerprint of the data block, which can greatly reduce the chance of accidental collision[4][5]. When storing massive amounts of data, the number of these fingerprints is increasing with the stored data increased, so itâ€™s quite time-consuming to retrieval the index table that stores these fingerprints. We design an efficient indexing mechanism based on Bloom Filter[8], which can effectively improve retrieval efficiency. While we develop a hash file system, it is the center of the metadata management and access. The hash file system use tree structure to organize directory and maintain the entire file system directory structure in memory. The file is no longer true data, but the locations of all chunks that belong to the file in the fingerprint table, all of fingerprints and their addresses in the storage system are stored in the fingerprint table. Due to the characteristic that separately storing the file meta-information and real data, the hash file system is easy to expand through simply increasing the number of the storage node, without changing the entire system. The hash file system uses the system image and operation log to enhance the persistence and robustness. In our framework, we use HDFS (Hadoop Distribute File System) as a storage system[6]. HDFS is a distribution file system run on commodity hardware and it was developed by Apache for managing massive data. The advantage of HDFS is that it can be used in a high throughput and large dataset environment.

The rest of this paper is organized as follows. Related works are discussed in Section â…¡ introduces. The proposed

framework architecture, its components are explained in detail in Section â…¢ .Some experiments are conducted in Section â…£.Conclusion is in Section â…¤.

II. RELATED WORKS

A key issue in cloud storage deduplication is that it will cost a lot of time to search the index table that consists of chunk fingerprint, with the increase of stored data. In [7], the authors described a Data Domain deduplication system, in which proposed to use multiple indexes and search the entire index table step-by-step to improve efficiency. In this paper, we develop a slice indexing mechanism based on the Bloom Filter technology. The Bloom Filter can use the memory space efficiently, it maintains a binary vector data structure in memory and can quickly judgment whether a certain element in a collection. This judgment will only be a miscarriage of justice on the data that in the set, rather than the data not in the set. So each test appears only two cases: in a collection(possible make a wrong judgment), and not in a collection(definitely not in the collection). For judgment that may exit, we further do the second retrieval to speed up the retrieval speed.

Several deduplication storage systems have been previously designed, including Venti [9], Extreme Binning [10]ï¼ŒDeDe[11]ï¼Œ DeDu[12]. Venti is a network archiving system, it uses a unique hash values to identify data blocks and an append-only way to store data, and lacks of support for data delete function, therefore it is not suitable for the cloud storage system. Extreme Binning stores a complete file to a separate backup node, each backup node is autonomous and manages its own index table and data, without sharing information and knowing

each other. It does deduplication only in its node, so itâ€™s not suitable for massive data global deduplication. DeDe is a block-level deduplication cluster file system without central coordination, however itâ€™s only suitable for a virtual machine environment to do deduplicate on the virtual image, not suitable for general data storage system. DeDu is a file-level deduplication system, completed file similarity detection in the client, but the repetition rate of the file-level is not very high, so the deduplication rate is relatively low.

In summary, this paper proposes a block-level deduplication data framework. The file chunking and data block fingerprint calculation is placed in the client, which can reduce the server calculated pressure. Simultaneously it identifies duplicate date before data transmission, which can reduce the amount of data transmission and bandwidth consumption.

III. FRAMEWORK ARCHITECTURE

The architecture named DDSF for the proposed framework is shown in Figure 1. The DDSF framework is composed of three components. These components are the client component, the server component and the storage component, their respective functions are as follows.

â— Client Component: In this component, file is chunked and the hash value(fingerprint) of every block is calculated, then it sends these fingerprints to the hash server component, which determines whether the data block is repeated and returns the result of judgment. The client component then decides whether send a data block based on this results.

â— Hash Server Component: There is a hash file system that includes two main function: directory management and index table management in this component. The directory management mainly manages user directory and operations on files. The index table management maintains the fingerprint index table of all chunks and efficiently access to the index table, each index table entry contains the block fingerprint and the actual address of data block in the storage system.

â— Storage Component: We use HDFS as storage system, it is the real place to store data.

Fig. 1. Framework architecture

This framework achieves online deduplication, the client component can know which data block is repeated before sending the data through interaction with hash server component, so it can save storage space and reduce bandwidth consumption with only sending the not duplicate data to HDFS. The main function components of the framework are client component and hash server component.

A. Hash Server Component

The main function of this component is achieved through a hash file system. The hash file system is the center of the metadata management and access, all operations on files must through it. The metadata includes file attributes and all the data block IDs that belong to this fileâ€”the id represents the position of the data block in the index table.

The hash file system realizes user directory management and index table management functions. The user directory management uses a tree structure, each user corresponds to a child node and can only access the files in his/her own directory. The index table management is the key to hash file system.

The hash file system maintains a hash index table, which stores all data blocks addresses and fingerprints. With the increasing number of data blocks, both storage and retrieval the index table are very complex problem, we use Bloom Filter and slice index technology to optimize it. The multi-level index mechanism is shown in Figure 2, the primary index is completed by the Bloom Filter, and the secondary index is completed by the slice indexing mechanism.

Fig.2.The multi-level index mechanism

The Bloom Filter table can reside in memory, so it has a higher query efficiency. A bit in Bloom Filter represents the status that weather a fingerprint exists in the index table. We use the fingerprint value calculated by the MD5 algorithm as main index value, the fingerprint length is 128 bits. Obviously, if we directly use this fingerprint to correspond to a bit in the Bloom Filter table, its size will be 2128, itâ€™s too big to load into memory. So we design a compressed fingerprint indexing technique--which divides this 128-bit fingerprint into 8 parts of

16-bit, and then do XOR calculations in these 8 parts in order to get a new 16-bit value, and use this new 16-bit value to correspond to a bit in the Bloom Filter table, the Bloom Filter table size is 216(64k), so it can be loaded in memory.

Once the server component receives the fingerprint that sent by the client component, it judges the status of the specific bit that the fingerprint corresponding to in the Bloom Filter table. If the status is not exist, then to store this data block and add its address and fingerprint as an entry to the index table, finally set the status bit ; if the status is exist, then to perform the next level index operation.

In second level indexing, the fingerprint index table is divided into multiple parts and separately stored on the disk. In the indexing process, the slice to which the fingerprint may belonging is loaded into memory, so it can ensure not only the size of the index table in memory , but the performance of the indexing. The fingerprint can be judged a new one if itâ€™s not exist in the slice, so the data block should be stored in HDFS, then the storage address and fingerprint should be stored in this slice.

The fingerprint index table can be divided into several fixed-intervals slices according MD5 range, each slice is a index table. In order to prevent a single index table is too large, we can set up a maximum size, such as 16M. When a index table size exceeds this maximum, then create a new index file. This will ensure the appropriate size of the index table and determine the index table in which the fingerprint may exist.

We develop the system image and operation log file into the hash file system to enhance its robustness to resume from mistakes. System image saves the structure of file directory, operation log file is used to record userâ€™s operation information.

When activated, the system simultaneously reads the system image and operating log from the disk , then combines these two files to generate new system image and simultaneously generates the directory structure of the file system in memory. Finally it deletes the old operating log and recreates a new blank log file.

For the concurrency issue due to multi-user simultaneous access to the log file, we set two locked data buffer and allow only one user to access data buffer at a certain moment. When one user access to the first buffer to record operation information, the buffer will be locked, then the first buffer send the operation information to the second buffer, meanwhile the first buffer will be unlock, other user can access the first buffer, the second buffer write these information into the log file. Because the I/O operation is performed in the second buffer, the second user can immediately write operation log when the previous user exits the first buffer, so it can greatly improve the response speed for multiple users to write log file.

B. Client Component

Client component is a tool through which user accesses to the cloud storage system, which consists of file chunking and chunk merging components

ï‚· File Chunking Component: In this component file is broken down into chunks of fixed size, and the fingerprints for each of the chunks are calculated using both MD5 and SHA-1 algorithm, then this component sends the command that store file and the fileâ€™s fingerprint list to the hash server component. The hash server component creates a new file in the hash file system, and search the fingerprint index table to judge whether the fingerprint of the previously obtained fingerprint list exists. If all fingerprints already exist, then store all chunk ids into the new file, and inform the client component not send chunks; conversely if some fingerprints are not exist, then inform the client component send these chunks.

ï‚· Chunk Merging Component: When the user downloads file, this component will consult the hash server component to obtain all chunksâ€™ address of this file, and then sequentially access the HDFS to get data blocks according to these chunksâ€™ address, eventually this component merges these data blocks into a complete file.

In all, the client component can know whether the chunk is repeated before transmission and only sends the unrepeated chunk, which can reduce traffic and bandwidth consumption.

IV. EXPERIMENTS

Our experiment platform was set up on 6 machines. Four machines serve as the end cloud storage system, including one NameNode and three DataNode. The fifth machine serves as the hash server component, and the last one serves as the client component. The detailed configuration information is listed in Table 1.

Table 1.Configuration of six machines

A. File Consistency

In our experiment, we uploaded some files into DDSF, then downloaded these files and compared with the source files to check whether these files are consistent after de-duplication. We did three experiments and set the data block size to 1M. Details of the source file size, the space used after de- duplication and file consistency are given in Table 2.

Table 2. File consistency

According to the results shown in Table 2, three tests have a different number of files, file type, file size and repetition rate, and all files are consistent in DDSF. Therefore our framework can maintain file consistency and completeness regardless of the file type.

B. Writing Efficiency

In this part, we will compare the writing efficiency with

DDSF and the writing efficiency without DDFS. We used 6 groups of data sets each of size 2.3G, the chunk size was 10M, their respective repetition rates were 0%, 20%, 40%, 60%, 80% and 100%. We uploaded each data set and calculated its loading time with and without DDSF. The time taken to upload files with and without DDSF are shown in Figure 3.

Fig. 3 Time Taken to Save files with and without DDSF

In figure 3, data repetition rate is plotted on the -axis and time taken to save the file on the -axis. As shown in the figure

above, with data repetition rate increasing, the taken time to

save files with DDSF reduces rapidly. When the data repetition rate is 5%, the time taken to save the same files with DDSF and without DDSF is the same.

C. Reading Efficiency

This part introduces the results of reading efficiency in two situations, with and without DDSF. We separately download the previously stored data sets and tested each download time, the results are shown in figure 4.

Fig. 4. Time Taken to Read files with and without DDSF

As shown in figure 4, the time taken to read files without

DDSF is average 253s, and the time with DDSF is average

260s, the time increases only 2.4%. As mentioned before, there is an average of 60% of duplicate data in dropbox. If using DDSF, we can save nearly 60% of storage space, although the time taken to read files will increase by 2.4%, the time taken to upload files will reduce by 56.7%. In summary, the

performance of the entire storage system can be largely improved by using DDSF.

V. CONCLUSIONS

In this paper, we propose a deduplicated cloud storage framework---DDSF. The framework is designed to save storage space and reduce bandwidth consumption. The client component consults the hash server component to determine whether the chunk is repeated before sending the chunk, and it only sends the unrepeated chunk to the underlying storage system, thereby it can reduce the amount of communication and improve the storage space utilization. With the increase of the amount of data in the storage system, the size of hash table also increases, so this paper presents an efficient two-level indexing mechanism in order to improve the retrieval efficiency of the hash table, which slices the hash table into multiple parts and stores some slices in the memory, uses LRU algorithm to replace slice, so it can further accelerate the retrieve speed and improve efficiency. This paper also introduces the system image and operation log, the system image preserves the directory structure of the entire system, the operation log records user operations, the robustness of the hash file system can be enhanced by collaboration between the two. The retrieval of hash tables is the system bottleneck. In future work, we will develop more efficient indexing mechanism to reduce the indexing time so as to further optimize performance.

ACKNOWLEDGMENT

The author would like to thank Wenwei Zhang for his assistance and helpful comments.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now