The rapid development in outsourcing data processing and storage by distributed computing framework, and in addition to complex and huge data collection mining have extended the availability of useful data to various organizations of modern society in the exponential way. But, data privacy insurance is a principal issue in huge dataset management on cloud environment, as the dataset proprietor has not any more physical control of his dataset as per the Cloud Security Alliance (CSA)  . For instance, the remote body sensed data monitoring system contains delicate data like patient identity and address that can lead to harmful consequences in case of disclosure. Therefore, Mapreduce which is a computing framework that process and create substantial distributed information must be strongly secure for data privacy preserving in cloud. Security assurance issues identified with MapReduce and cloud have started to draw serious consideration. Puttaswamy et al.  presented a set of tools called Silverline that can isolate all practically encryptable information from other cloud application information to guarantee data protection. Likewise, Zhang et al.  proposed a privacy leakage upper-bound limitation based solution to deal with data privacy preserving issue by just encrypting part of available data on cloud. Roy et al.  proposed a framework named Airavat which constrains obligatory access control via differential privacy technic. Blass et al.  proposed a data privacy model named PRISM for the Map Reduce structure on cloud to perform parallel word search on encrypted data collections. Ko et al.  proposed the HybrEx Map Reduce model that processes very sensitive and private information by a private cloud, while others data can be securely processed in the public cloud. Zhang et al.  proposed a system called Sedic which partitions Map Reduce computing tasks in terms of the security labels of data. Thereby, they work on and then assign the computation without sensitive data to a public cloud. However, conventional data encryptions that are considered as the essential technique to address security issue in cloud can’t be straightforwardly implemented in Mapreduce framework. Therefore a new cryptosystem called homomorphic encryption which could process encrypted data is first introduced in  to find an effective solution to this challenge. This fully homomorphic encryption (FHE) can compute arbitrary function on encrypted data without using secret key. However, FHE rose two major challenges regarding its implementation and computation cost   . First, the number of available FHE schemes in the literature is very limited (few). Second, the efficiency of fully homomorphic encryption is the biggest challenge to address. Therefore, the efficiency of FHE has been the key problem since its invention, which hinders its myriad of potential applications such as private cloud computing in practical. Specially, the size of key in FHE scheme is big. Thereby, a FHE scheme based on Learning with Error (LWE) doesn’t only include public key and private key but also includes some evaluation keys. For an L-Leveled FHE scheme, there are L evaluation keys. Each evaluation key is a matrix. A public key is at least matrix. Clearly, these matrixes are high dimension, which not only need a lot of space to store but also affect the efficiency of computation. In 2009, an updated version done by researcher  is by all accounts an effective
alternative to address data privacy assurance issue which is considered by information security community as a paramount topic. Thereby, schemes like DGHV  and Gen 10  are introduced to securely compute data through homomorphic approaches on Mapreduce environment. Unfortunately those mentioned solutions experience some basics shortcomings as far as security concerns   . The DGHV scheme  uses many of the tools from Gentry’s construction. But this model does not require ideal lattices. Moreover, they prove that the somewhat homomorphic component from Gentry’s ideal lattice-based scheme  can be replaced with a very simple somewhat homomorphic scheme that uses integers. Therefore their model is conceptually simpler, but the private key should be transferred to cloud server, which is very insecure. Gentry proposed a homomorphic encryption Gen 10 scheme  , applicable in cloud environment and conceptually simple. In this scheme the encryption function is homomorphic with respect to addition, subtraction and multiplication. The relationship between c and m is that m is the residue of c with respect to modulus p, that is . To retrieve the ciphertext, the Gen 10 scheme doesn’t present the private key as  . But a random private key parameter q instead , which seems quite more secure. Yet the parameter q is sent to the server, by using the formula , the plaintext m may leaks out. So this scheme suffers from security weakness too. The goal of this paper is to construct an efficient FHE scheme with better key size. Thereby in this work, we introduce our solution by presenting an optimized scalar homomorphic scheme (Op_FHE_SHCR) that first addresses the above mentioned data privacy protection shortcomings through efficient operations over ciphertexts without compromising the cryptosystem like the existing models  and  . Furthermore, we demonstrate how fast our proposed solution retrieves ciphertexts at Reduce stage in an optimized and secure way. At that point of this work, we present the programming and cryptographic primitives in Section 2. A short discussion on related work is presented in Section 3. The Section 4 will give details of the concrete execution and security analysis of our proposed solution. At long last we close this work in Section 5.
2.1. Map Reduce
Map Reduce in  is defined as a computation framework to process and generate large datasets. In this programming environment, users specify a Map function that processes a key/value pair to generate a set of intermediate key/ value pairs, and a Reduce function that merges all intermediate values related with the identical intermediate key. We have such a variety of problems solving that are expressible by Mapreduce, like the problem of counting the number of occurrences of each term in a large dataset. The Mapreduce architecture can be viewed as below (Figure 1).
2.2. Homomorphic Encryption Scheme
Figure 1. Mapreduce execution overview in  .
A homomorphism between two algebras, A and B, over a field (or ring) K, is a map F: A → B such that for all k in K and x, y in A;
If F is bijective then F is said to be an isomorphism between A and B.
¨ Homomorphic Encryption
As an ever increasing number of data is outsourced into distributed storage, frequently unencrypted, considerable trust is required in the cloud providers. The CSA records information breach as the top issue to cloud security  . Encrypting the data with conventional encryption addresses the issue. But in this case, the end user can’t work on the encrypted data and must download them and performs the decryption locally. Therefore, there is a need to allow completely the public cloud server to perform calculations in the interest of the end users and return just the encrypted result. Thereby, the development of homomorphic encryption is a very impressive advance, incredibly amplifying the extent of computation which can be applied to process encrypted data homomorphically. Thus, the enthusiasm in the research community is justified by the various applications in the real world (like medical applications, consumer privacy in advertising, data mining, financial privacy) of this theme.
Homomorphic encryptions allow complex mathematical operations to be performed on encrypted data without compromising the encryption. This homomorphic encryption is expected to play an important part in cloud computing, allowing companies to process and store encrypted data in a public cloud and take advantage of the cloud provider’s analytic services. It is first designed in 1978 by Rivest et al.  and upgraded by the researcher’s community. Thereby, Craig Gentry  hypothetically shows the possibility of implementing this kind of encryption scheme  . In the same way, researcher Jaydip Sen represents homomorphic encryption clearly as a quadruple in  . However, Homomorphic ciphers typically do not, in and of themselves, do not provide verifiable computing and some variants are not semantically secure. Furthermore, the poor performance is the big disadvantage of this scheme. Ciphertexts are much larger than the plaintexts, so communication requirements typically go up. The computations on these large ciphertexts are typically slower than if you just performed the computation on the plaintext itself. Because of this, in the outsourcing computation model, we typically see a requirement that encrypting inputs and decrypting outputs should be faster than performing the computation itself. Therefore there is a high need to optimize the data processing in order to reduce efficiently the computation and communication costs.
3. Related Work
Security insurance issues on Mapreduce framework have started to draw escalated consideration. In this manner data confidentiality protection issues have been widely examined and productive progress have been accomplished by the security community practitioners. We quickly audit few existing models about security protection on Mapreduce framework.
3.1. Xu Chen and Qiming Huang Scheme in 
The authors in  presents a data privacy insurance scheme on Mapreduce utilizing homomorphic encryption. It is a modified Mapreduce model, to guarantee data secrecy, and additionally processing the data in encrypted form. They pick two major prime numbers A, B, and make P = A * B, and afterward generate a random positive integer A, which is the private key, and B ought to likewise be confidential.
But, to apply homomorphic encryption in this scheme, authors do few modifications on ciphertexts to allow the Reduce function to find the identical keys and afterward group them like following:
For a ciphertext , authors obtain , with R as random positive number.
Then, the authors compare ( ), rather than C to retrieve similar keys. Thereby, it’s obvious that their proposed solution needs an additional computation ( ) at reduce phase which is costly in term of computation in order to get a probabilistic homomorphic cryptosystem. Review that this homomorphic cryptosystem is exceptionally expensive, therefore their model is inefficient.
3.2. FHE_SHCR Scheme in 
As discussed in  the related work  requires additional computation at the reduce stage, the DGHV  and Gen 10  schemes send respectively their private key and sensitive security parameters to unreliable public cloud server (compromising the cryptosystem). Additionally, the above mentioned models don’t address the security and efficiency ciphertext retrieval issues. Therefore, authors in  present their contribution FHE_SHCR, which is based on schemes   and  to address the privacy shortcomings of models   and  . Subsequently, the main objective of this model is to securely retrieve ciphertexts at reduce stage and enhance the retrieval algorithm accuracy without getting any information about the content of intermediate searchable ciphertext to fix the security shortcomings in  and  . The FHE_SCHR scheme is an efficient candidate for homomorphic encryption to preserve data privacy in cloud by a strong hybrid encryption  .
Note that, the improvement on this paper is mainly on the optimization of the input file decomposition (map phase) and ciphertext retrieval algorithm (reduce phase) by addressing the metadata dynamics and authentication path through a logical Merkle tree repository structure (optimized space-time cost).
4. The Optimized FHE_SHCR (Op_FHE_SHCR)
As clearly proved by the research community; the homomorphic encryption can carry some operations over encrypted data effectively, but it is very expensive scheme in terms of computation and communication costs    . Therefore, we introduce a new Logical agent: Anonymiser (with its three components: Decomposition Table, Query Processing, and the Metadata Repository) at the user side under the control of the master program. Doing so, the user program can efficiently send to the master program the optimal decomposition (Splitting: Key/value) of a given input files before encrypting the data (Shuffle). Thereby, we use our optimized ternary search tries (TST)  in a logical Merkle tree structure to optimally address the metadata authentication and dynamics through the Metadata repository component. The architecture overview of the proposed solution can be depicted by Figure 2.
Thus, our optimized algorithm (Op_FHE_SHCR) through successful experiments (see section below) performs 3 times faster the original FHE_SHCR scheme  and effectively addresses the metadata dynamics and authentication issues through a secure and efficient metadata repository (Optimized ternary search tries to address the time space constraints). Furthermore, we speed up the ciphertext retrieval algorithm (accuracy and efficiency) at the reduce phase by using the optimal Lagrange multiplier (m*) as the optimum number (See section below). Note that, the implementation of this Anonymiser as our Trusted Front End Database Management (TFE) is to enhance the security and speed up the data processing of our proposed solution and represents the key point of this extended work.
4.1. Optimized Metadata Authentication and Dynamics
As mentioned in the previous section, our proposed scheme further addresses metadata authentication and dynamics issue for strong data privacy protection. Therefore, we introduce a logical agent: Anonymiser in the master control pro-
Figure 2. Architecture Overview of optimized FHE_SHCR.
gram. The Anonymiser has three pieces namely: Decomposition table, Query Processing, and Metadata Repository. Their functions can be briefly described as following:
¨ Decomposition table: It is responsible for defining the exact set of attributes (A ) for particular input files in the optimal number.
¨ The Query Processing: It filters the candidate map workers queries request generated by the master program to produce anonymous query-based request on data location for processing.
¨ The Metadata Repository: It keeps data decomposition done by the decomposition table and forwards them to the Query Processing unit to generate new anonymous query request. For the efficiency of the proposed scheme, we use Merkle hash tree structure to deal with metadata authentication and dynamics  . Thereby the master program assigns a particular input files decomposition table (A) to map workers for processing logically as below:
All input files are transformed into a set of symbols (A) as , , , as depicted by the above Figure 3. The data matching and authentication starts from root node in a top-down manner and its dynamics process can be described as follows.
File uploading: Suppose that a data owner wants to process a file F identified by (leaf node) with public cloud server whose attributes satisfy an access policy defined by the private cloud server. Assume that the
Figure 3. Efficient Symbol-based Tree Retrieval.
file F consists of a keyword set W. Then, the owner randomly chooses a symmetric key from the key space and encrypts the file F with to obtain the ciphertext . Subsequently, data owner runs the algorithm Encrypt(AP, ) to obtain the ciphertext which is the encryption of the symmetric key with respect to the access policy AP. The owner uploads to the public cloud. Furthermore, to generate the trapdoors for keywords in W, the owner also sends ( , AP, W) to the private cloud. Upon receiving ( , AP, W), the private cloud transforms the access policy AP into a set of privileges. Then, for each and , it computes , where is the symmetric key for each . Finally, the private cloud sends to the public cloud as well.
To enhance the searching efficiency, a symbol-based tree is utilized to build an index stored in private cloud (metadata repository). More precisely, divide the output of one-way function f into l parts and predefine a set consisting of all the possible values in each part (an example of such tree can be shown in Figure 3). Initially, the index based on symbol-based tree has only a root node (denoted as 𝑛𝑜𝑑𝑒 0) which consists of ∅ (an empty set). The search process in a symbol based tree is a depth first search. The tree can be updated and searched as follows.
Update: Assume the data owner wants to outsource a file F identified by with keyword set W, the public cloud will receive and where are the corresponding privileges and are keywords in W from the data owner and private cloud respectively. Then, for each the public cloud will add it into the trie index as the following steps.
i) Step 1: Public cloud parses as a sequence of symbols .
ii) Step 2: Public cloud starts with the root node of tree: it scans all the children of the root node and checks whether there exists some child node 1 such that the symbol contained in node 1 equals . This action is performed in a top-down manner. In general, assuming that the subsequence of symbols has been matched and the current node is , the public cloud will examine all the children of and attempt to find out some node, for example such that the symbol contained in equals . If such node exists, the current node is set as and is the next matching object, otherwise jump to step 3.
iii) Step 3: Assume that current node has no children to match the symbol , the public cloud will build nodes for all the rest of the symbols (i.e. ) respectively and link them as a node list appended with . Finally, add another node identified by as the leaf node appended with .
Search: Assuming that the legitimate user wants to search outsourced files with keyword w and privileges , the public cloud will receive from the private cloud. For each the public cloud will perform actions similar to the three steps described above. One exception is that if matching fails (i.e. the current node has no children which can match the symbol), the search for is aborted. Otherwise, get the corresponding ) through the identifier in the leaf node.
So to address Merkle tree traversal problem, our scheme uses some tools from the efficient algorithm in  to overcome the space-time issue. Furthermore, to optimize the time space constraints in the Merkle hash tree traversal process, we designed an optimized ternary search tries (TST)  which is a sorting algorithm that blends quicksort and radix sort. Thereby, it is competitive with the best known C sort codes. It is faster than the traditional hashing and other commonly used search methods as shown below (Figure 4):
The TST is space efficient, but increases with the number of strings (N). Therefore the traversal problem is how to calculate efficiently the authentication path for all leaves one after another starting with the first leaf up to the last leaf, for minimum amount of space-time cost. Hence, it implies to analyze an optimal distribution of singleton attribute ( ) to enhance the efficiency of the proposed solution; that is to find the optimal number of strings or attributes (N) to populate the tree. In this work we use the Karush-Kuhn-Tucker (KKT) condition of constrained optimization problem  to solve the above mentioned issue in the section below. Practically, we design our solution using some mathematical tools
Figure 4. String symbol table implementation cost summary from  .
from the scheme in  to find the minimum number of singleton quasi-identifier that gives the optimal security level for the proposed traversal algorithm efficiency.
Let be the number of values along the different columns of the Anonymiser decomposition table to populate the distinct levels set of the Merkle tree respectively. The total number of distinct values taken by the different levels set ( ) is N. We assume that each column set ( ) of the decomposition table takes ( ) different values with the probability ( ) respectively, where .
Therefore the probability for the (ith) element to be a singleton in the universal decomposition table by selecting one of the (n) choices (entries) is .
Let the variable be the indicator representing whether (ith) element is a singleton, then its expectation is calculated as below:
Let , be the variable that counts the number of singleton; its expectation is given by:
We aim to find the smallest number of singleton to populate efficiently the Merkle tree in the metadata repository.
It implies to minimize ;
Subject to , with , .
Therefore we get the optimized number of singleton by rewriting the above distribution as a constrained optimization problem  . Doing so, we find the dual solution of the primal problem which is fast and reduces the space-time costs as follows:
KKT condition: Primal variable: ; Lagrange Multiplier:
Let us considering an optimization problem of forms Minimize f(x) Subject to ;
Using the Lagrange multiplier and the duality theorem, the solution of the problem (P) is determined as following:
is a smooth function, then its gradient equals to zero at the optimal number (x*):
, then . So we have:
, then ;
Finally the optimal number of singleton quasi-identifiers for a decomposition table of (n) entries with maximum total number of distinct values (N) is . Using this optimum number ( ) to populate the ternary search tree, we improve the performance of the Merkle tree traversal algorithm by addressing the space-time cost issue. Thereby, the overall ciphertext retrieval time at reduce phase for our optimized algorithm Op_FHE_SCHR is almost three times less than the exiting one FHE_SCHR refer to Figure 5.
4.2. FHE_SCHR Efficiency and Implementation Analysis
Regardless of the advances in remote sensor network (WSN) to controls systems into cloud, there are still enormous challenges in term of security insurance over
Figure 5. Overview performance comparison of cryptosystems.
outsourced data processing and storage  . Therefore, we work for experiment purposes over trained sensed dataset for cancer pattern monitoring project. So our main goal in this paper is to securely optimize the map phase (input files decomposition) and ciphertexts retrieval (reduce phase) process. Thereby, we implement an optimized scalar homomorphic based Mapreduce scheme Op_FHE_SHCR, which contains four algorithms: KeyGen ( ), Encrypt ( , , s.m) Decrypt(c) and Retrieval(c). Note that these four algorithms are quite similar like those in  , with modified mapping and hashing algorithm like below (Algorithm 1).
This algorithm initializes the selected feature subset (splitting the input file into subsets) denoted by , with the empty set. A candidate feature subset, denoted by , will be produced by adding a feature, denoted by , with , to the selected feature subset, and the hashing index will be computed by Algorithm 2. Then, the algorithm removes this feature and adds
another feature to generate a new candidate. That is, the new feature in the chosen candidate will be added to the selected feature subset. Thus, this algorithm iteratively adds one feature (or the fixed number of features if the floating strategy has been used) to increase the selected feature subset until the threshold is met. It should be pointed out that the main difference between the proposed algorithm and the existing ones in the literature is that our algorithm produces high correlated data subsets based on the hashing index value. Therefore the ciphertext retrieval process at the reduce stage will be more efficient in terms of speed.
The design of our OP_FHE_SCHR cryptosystem is done using the HElib- master-2015.03 library in Dev C++ IDE. We utilize the WDBC Test training dataset for cancer management project. Our security algorithm is implemented in four steps using Gentry cryptosystem   and  .
The efficient analysis of the candidate solution is proved by its experiments results that are compared with the existing blinding fully homomorphic FHE_DFI_LM algorithm, previous FHE_SCHR, and our new optimized Op_FHE_SCHR algorithm. Recall that, the improvement on this paper is mainly on the optimization of ciphertext retrieval time and metadata dynamics and authentication path in the logical Merkle tree repository (optimized space-time cost).
Table 1 and Figure 5 show the average performance of our proposed solution (Op_FHE_SCHR) in comparison to related works (FHE_DFI_LM & FHE_SCHR). Recall that, the experimental requirements are to optimize the outsourced data processing at the map stage and prevent intermediate data disclosure at the reduce phase in order to reinforce data privacy on Mapreduce framework. Therefore our Op_FHE_SCHR processes almost the data at Map stage (setup phase refer to Table 1 and Figure 5) three time less (5932 ms) than FHE_DFI_LM (13078 ms) and two time less than our previous work FHE_SCHR (11684 ms). This result is obtained by an optimized map workers selection using the optimal number , for a given decomposition table of (n) entries at the splitting step. Thereby for a given dataset N, our algorithm calculates in advance the exact optimal number of subsets (feature selection) and map workers to speed up the splitting and data allocation process at the Map stage. Furthermore, each element (feature) of a subset is selected by an efficient feature selection algorithm (refer Algorithm 1).
Table 1. Average performance Comparison.
Based on the result of effective experiments directed, it is unmistakably certain that the proposed optimized scheme Op_FHE_SCHR speedups the setup (input files decomposition) and ciphertext retrieval time without compromising the cryptosystem. Thereby, the graph 5 shows that the proposed alternative is more efficient regarding the ciphertext retrieval and computation cost reduction. Note that homomorphic cryptosystem is extremely expensive  . Therefore, our solution contribution is a reliable alternative to minimize the overall computation and communication costs.
4.3. FHE_SCHR Security Analysis
Two security requirements are to be achieved, that is data confidentiality and integrity. Therefore, our security scheme is based on a hybrid encryption design. Since the file is encrypted with a hybrid encryption as , the adversary should first decrypt . However, such a session key is protected by the ABE scheme. Thus, data confidentiality can be reduced to the confidentiality security of ABE. Moreover, the keywords which are needed to be protected against the public cloud are encrypted with a one way trapdoor function. The underlying ABE scheme for this work is known to be semantically secured. This private key is under the security proof of the ABE scheme and reduced to the bilinear decisional Diffie-Hellman (BDDH) problem  . The bilinear decisional Diffie- Hellman (BDDH) problem is such that given g, for unknown random values x, y, z ∈ R , and T ∈ R, , prove the distinguish ability of from any random number in the target group is very hard to decide if . We say that the (t, ϵ)-BDDH assumption holds in G, if no t-
time algorithm has the probability at least in solving the BDDH problem for non-negligible ϵ.
Recall that this paper is based on  , and therefore we use the same cryptosystem to fix the data privacy preserving challenge. Furthermore, to enhance the security of our proposed solution in terms of data authentication paths and dynamics, we adopt the Merkle’s hash tree to store the metadata decomposition ; which is freely secure from any number theoretic conjectures  . Indeed the Merkle’s hash trees are very useful because they allow efficient and secure verification of the contents of large data structures. This security lies mainly on two properties of hash functions:
i) Pre-image resistance: that is, given a hash value (h), it is difficult to find a message m such that h = hash (m).
ii) Collision resistance: that is, finding two messages m1 ≠ m2 such that hash (m1) = hash (m2) is difficult.
This data structure is a complete binary tree with an n-bit hash value associated with each node. Each internal node value is the result of a hash of the node values of its children. Merkle trees are designed so that a leaf value h ( ) can be verified with respect to a publicly known root (A) value given the authentication path of the respective leaf as:
; refer to the Figure 3. So an attacker holding a hash value in order to reconstruct (A); needs some additional values called Auxiliary Authentication Information (AAI) which are kept secret by the metadata repository administrator under the supervision of the Anonymiser Query system. Therefore it is very hard for the public cloud server or outside attacker to reconstruct the input files by the decomposition table (A).
To summarize the security analysis, we can say by implementing a secure front end database management agent (Anonymiser) on top of FHE_SHCR security mechanism  that the data privacy insurance has been greatly reinforced in our proposed solution (Optimized FHE_SHCR).
In this paper, the requirements are to optimize the outsourced data processing at the map stage and prevent intermediate data disclosure at the reduce phase in order to reinforce data privacy on Mapreduce framework. Therefore, we implement a secure Front End Database Management agent: the Anonymiser with its three components (Decomposition table, Query Processing, and Metadata Repository.) to enhance the data security mechanism of our proposed solution. The cryptosystem tool is a scalar homomorphic encryption that performs some sorts of calculations over encrypted data in more secure and optimized design. The optimized cryptosystem Op_FHE_SCHR is by the experiments results an efficient candidate for the communication and computation costs reduction. Practically, it takes as input files an optimized decomposition table (for map workers), and improves the speed and accuracy of ciphertext retrieval process (for reduce workers) on Mape Reduce environment. Furthermore, we address the metadata dynamics and time space cost constraints for the traversal of Merkle tree structure in our metadata repository by applying an optimized ternary search tries (TST) algorithm.
This work has been supported by MoE-CMCC (Ministry of Education of China- China Mobile Communications Corporation) Joint Science Fund under grant MCM20130661.