Paste your ENERGY EFFICIENT REPLICATION:
Cloud Computing is the collection of data centers or in other words, data centers are what is described as cloud. Cloud computing aims at providing three basic resources: network, storage and computation as services over the internet. Replication is used as a means of enhancing the availability of services and hence energy efficiency is applied across the three domains as shown below:
1.1. STORAGE:
The growth in cloud computing trends has resulted in the emergence of large scale cloud based applications. These applications generate massive amount of data posing a need for efficient data storage ability in data centers. For instance, Facebook announced 2.5 peta-bytes of user data at a growth rate of 15 tera-bytes per day [1]. Moreover, data centers replicate data to ensure high availability and fast access to users. But an alarming issue is high energy consumption owing to the large scale use of storage devices. In recent study, it is reported that storage devices alone account for 27% of total energy consumed by a data center [10].
Mining down to the hardware level, the data center storage system consists of array of disks which rely on RAID (Redundant Array of Independent Disks) to store data and hence consume significant amount of energy in operating these redundant disks. The greater energy consumption results in greater heat dissipation and hence greater cost in cooling equipments.
Therefore, it becomes equally important to manage both the data replicas to be stored in the data center and number of powered disks to bring down energy consumption involved in the process of replication. Now we will study these two aspects one by one:
1.1.1. DATA REPLICATION:
In cloud computing, stripping is one of the most commonly used phenomenon whereby data files are divided into blocks or chunks and stored in different nodes. This technique is very commonly used in google file system [3], Hadoop distributed file system [4] and Amazon Simple Storage Service (S3) [5].
To understand stripping better, let us discuss how it works in GFS and HDFS. The working mechanism of GFS and HDFS are similar except that HDFS is light-weighted. In case of HDFS cluster, three main components are a single name node, a master server and a number of data nodes. The master server manages the whole file system namespace and regulates access to files by the clients. There are a number of data nodes, usually one per node to manage storage attached to nodes they run on. A file is internally stripped into blocks and stored on the data nodes. The data nodes are responsible for serving client’s read or write requests and also perform operations like block creation, deletion and replication upon instruction of namenode. The namenode is responsible for managing namespace operations like opening, closing and renaming files. It also determines the mapping of blocks to the data nodes. [6]
Fig 1. Architecture of HDFS
In case of GFS, the three major components are multiple clients, a single master server and multiple chunk servers. Files are sliced into equal sized chunks and stored in data centers which are managed by chunk server. The master server keeps in hand the meta data associated with the file system which includes the namespace, the access control information, mapping from files to chunks and location of chunks. Clients interact with master server for meta data operations. However, all the data bearing communication is handled directly by the chunk server [3].
Fig.2. Architecture Of GFS
The above discussion explains how blocks of file are stored in different data nodes in a cloud storage system where failure is a norm rather than exception, failure of a single node due to node crash, network failure and limited bandwidth will render the entire file unavailable. This necessitates the need to replicate the block of files onto different data nodes to increase availability. However, greater are the number of data replicas, greater will be the resource consumption in terms of storage and greater will the amount of power required to manage these resources. Thus, managing data replicas becomes of utmost importance to conserve energy. The management of data replicas can be done in three ways:
Determining exactly what data to replicate.
Determining the number of data replicas.
Determining where to place the replicas.
1.1.1.1. DETERMINING WHAT DATA TO REPLICATE
The first step towards energy efficient replication is to determine what data to replicate and when to replicate the data. Replication based on content popularity is gaining importance day by day. There are a number of files that are generated on daily basis. However, these files differ from one another in terms of number of accesses and intensity of accesses to these files. A popular data file is the one with greater number of accesses to it than all its counterparts. Energy efficient replication aims at replicating either only popular files or greater number of replicas for popular files than unpopular ones. This is because it is believed that popular files from the past will be accessed more than others in the future, also known as temporal locality. In order to determine which file to replicate, it is important to determine the access patterns to the file. This is done by analyzing historical usage patterns, online predictors based on recent past and information about the jobs submitted for execution. [7] analyzed logs from large production cluster that supports Microsoft Bing over a period of six months and concluded that 12% of the most popular data is accessed over ten times more than bottom third of data. Moreover, it reports that data popularity changes over time with only 50% of files accessed on a given day being accessed in the next or previous days, proving that recent logs are good indicators of future access patterns. A similar pattern is found in [8] who studied Hadoop logs of Yahoo and observed that 80% of accesses to a file occurs during first day of file, giving more importance to recent data.
Apart from using popularity as a measure of replication, certain others focus on high availability of scientific data [9]. The aim is to create more replicas for such data so as to circulate it around the world among the scientific community. Certain scientific projects related to genetics (Human Genome Project [10]), astronomy (Sloan Digital Sky Survey Experiment [11]) and collaborative projects in the field of computer graphics, molecular biology, digital optical microscopy, model and control theory (Human Brain Project [12]) generate large sets of data that is of great importance to the scientific world and poses a great need to enhance its availability and preserve it lifelong. One obvious way of doing so is to replicate such data.
Now, the next important question is to determine which file is popular and when to initiate the
replication process?
[13] presented a hierarchical architecture that supports dynamic replication mechanism for a grid of clusters. Each cluster maintains a single header and group of sites. Cluster head manages the site information in each cluster. Each site maintains a detailed record of each file in the form <timestamp. File_id, cluster_id> where file_id depicts the file accessed by the site in a given cluster denoted by cluster_id at timestamp. These records accessed by various sites in a cluster are regularly aggregated and summed at the cluster header in the format <file_id, cluster_id, number> which indicates that a particular file with file_id in a cluster denoted by cluster_id is accessed number times.
Each cluster managed by cluster header sends the information regarding the files it accessed (file_id) and number of times a particular file accessed (number) to the central policy maker. The policy maker on collecting information from different clusters decides which file is popular by counting on the number of accesses made to the file. This process is highlighted below:
Fig.3. The Hierarchical Architecture and Aggregation of Records
In the above case, the file A is accessed twice in cluster 1 with 9 and 10 accesses and once in cluster 4 with 7 accesses. The aggregated sum (9+10+7 = 26) is passed onto the central policy maker. Similarly accesses to other files are aggregated and recorded by the policymaker.
The architectural form is further extended by [13] [14] to design Latest- Access- Largest- Weight (LALW) algorithm based on the concept of half life which indicates that an entity decays to half of its initial value over a period of time. Based on this theory, LALW assign weights to the records in such a way that older history records have smaller weights than recent history records. As the record decays with passing time intervals’ T’_1,T_2′..T_n the weights assigned to the record also decays such that the weight assigned during the time interval ‘ T’_1 is 2^(-(n-1`) ), time interval T_2 is 2^(-(n-2`) )and so on.
Based on this, access frequency is calculated as shown in equation 1 that exhibits the importance of access history in different time intervals. If N is the number of time intervals passed, F is the set of files that are requested and (ai)k is the accesses to the file I in time interval k, then the access frequency of a particular file is calculated as:
AF(f)=’_(k=1)^N”(a_i^k*2^(-(N-k)) ),” f’F’.(1)
For instance, if a file A considered earlier is being accessed 9 and 10 times in the first and second time interval. Then the access frequency is calculated as:
AF(A) = (9 * 2-(2-1)) + (10 * 2-(2-2))
= (9 * 2-1) + (10 * 20)
= 9/2 + 15
= 20
In the similar way, access frequency for all the files in different time intervals are calculated and compared. The file with the largest access frequency is selected as a popular file.
Considering the fact as illustrated above that the importance of an entity decays over time, [15] designed time based forgetting function ” within the interval [0,1] shown below in equation 2:
”(t_p ,t_s )= e^(‘-(‘t)’^k )= e^(‘-(t_(p )-t_s )’^k )'(2)
Where t_p is the present time and t_s is the start time.
This function is then used to calculate popularity degree ‘pd’_(k ) (Equation 3) of the block of file b_k at the present time t_p relative to the access frequency at the start time t_s such that
‘pd’_(k )= ‘_(t_i=t_s)^(t_p)'(‘an’_k (t_(i ),t_(i+1) )*(t_i,t_p ) ) ‘.(3)
Where ‘an’_k (t_(i ),t_(i+1) ) are the number of accesses during the interval t_(i ) to t_(i+1) .
Based on the popularity degree of the file, replica factor is calculated as follows in equation 4 below:
‘RF’_i= ‘PD’_i/(‘RN’_(i )*’FS’_i )'(4)
Where ‘PD’_i,’RN’_(i ),’FS’_i is the popularity degree, number of replicas and file size of data file f_i. The replica factor is then compared with the threshold that is [min((1+”)* ‘RF’_sys,max(‘_(k'[1,2’.l] ) ‘RF’_k ) ),”[0,1] ] . Here ” is an adjustable parameter according to different system performance. If ‘RF’_(i )is greater than or equal to this threshold, the file is replicated otherwise not.
Furthermore, [16] also discussed the replication strategy called best client that marks a file as popular if the number of accesses for that file exceeds a threshold. All these approaches concluded two things; firstly a file is marked popular if it passes a certain threshold value and secondly a popular file can be analyzed by comparing their access frequencies over time and then selecting the one with the largest access frequency.
text in here…