In the following we describe the protocol[25] to be used in RAIN in greater detail. The scheme is dependent on creating a small mix-net[26] in the cloud, and creating a collection of autonomous cloud processing agents explained below (see Figure4). The autonomous agents will retain a one-to-one relationship with a cloud storage provider. Each agent will have a unique ID that is known to the C&C node, but this knowledge should in principle not allow the C&C node to locate the agent. Furthermore, we need to create a cloud service that can serve as a broadcast medium similar to an IRC channel; for simplicity we will call this the IRC node.
Protocol for storing data
First, we will describe briefly the main thrust of the protocol, with more details in the following subsections.
Let D be a piece of data to be split and stored in the clouds. The user U will send D to the C&C node, encrypted with that node’s public key:
Here, “auth” is an authentication token used to verify the user’s rights to the datasetc. If this message is replayed, it will be ignored; the only way to re-use a data ID is to first delete the data-set that uses it.
The C&C node performs the split[18] such that H can be represented as H = < D
s
R
s
> where
Here, R
s
specifies how the segments are related to each other; this knowledge is necessary for reassembly. The reulting sequence H can thus be written H = (d1,d2,..,d
n
). We need to assign a unique ID (or pseudonym) to each data item, which we in the following refer to as I D
di
.
The C&C node then distributes H among the cloud storage providers by assigning their respective identifiers (C S
x
,…,C S
y
) to the corresponding d
i
, this through the mix-net (see section 4.2 for details) to IRC:
The C&C node needs to maintain a table mapping which data items have been sent to which cloud storage service. Furthermore, it is important that the pseudonyms are unique within a given C&C provider, but created in such a manner that it cannot be determined that two different pseudonyms refer to data items from the same file. As mentioned in Section 3.1, we’re assuming that the traffic volume will contribute to hide which items belong to which datasets, so although we are effectively broadcasting the mapping tables, this should not matter: An adversary can tell that the data item with pseudonym X is stored with cloud storage service Y , but this information is of little use if there is no way to tie the pseudonym to a dataset (or user). Furthermore, due to the use of the mix-net, the identity of the cloud storage provider is effectively a pseudonym as well.
The IRC node then publishes this information on its broadcast medium, where all the autonomous agents are listening. When an autonomous agent sees its own ID, it copies the associated d
i
and stores this at its associated cloud storage provider.
Mix-net in the cloud
Chaum’s original mix-net idea[26] has been employed with some success in the TOR network[19, 27]. In the following, we will describe a simplified scheme tailored to the task at hand.
We assume we have a set of n autonomous agents m1,m2,…,m
n
. The agents are running as cloud web services, and their addresses and public keys are known to the C&C node.
By a slight paraphrasing of Chaum[26] we have that when using a single mixer node m1, communication from Bob to Alice via m1 occurs as follows:
Here, R
a
and R
m
are nonces that are discarded upon decryption, and d
i
is the item of data to be sent. The only purpose of the nonce here is to prevent repeated sending of identical plaintexts from generating the same ciphertext. The parameter C S
j
identifies the storage agent, which is effectively a pseudonym for the Cloud storage provider below. By recursively applying the scheme above, it is possible to extend it to an arbitrary number of mixer nodes.
In our case, the sender is the C&C node, and the final recipient is always the IRC node. The sole purpose of the mix-net is thus to hide the identity of the C&C node from the IRC node. Naturally, this only makes sense if there are multiple C&C nodes in the system as a whole. In the protocol descriptions, we will use the notation mix(…) to indicate that a message is sent through the mix-net.
IRC node
The IRC node receives a large amount of data items from multiple C&C nodes, and for each data item, the parameter C S
j
identifies which storage agent should handle the item. The CRC then simply sends the following to all storage agents, using a fixed multicast address Rain:
The multicast traffic is UDP-based, and there is thus no acknowledgment or retransmission at the transport level. Although not explicitly shown here, an important feature is then that each data item must be sent to multiple storage agents; this redundancy both ensures duplication of storage, and introduces error tolerance in case of bit errors in the transmission.
Storage agent
Each storage agent subscribes to the Rain multicast address, and thus receives all the data items, but discards all messages that are not addressed to it. Note that the Storage agent ID (C S
j
) can be viewed as a pseudonym, since it is never used as a return address in any way, and can thus not be directly linked to the storage agent.
The storage agents need to maintain a record of data items and associated IDs; the IDs are not actually revealed to the cloud storage providers. However, the storage agents are not anonymous to the cloud storage providers, i.e., the cloud storage providers can log the real addresses of the storage agents, but they do not know their pseudonyms.
Data retrieval
The user may ask the C&C to retrieve a dataset:
When asked to retrieve a dataset, the C&C node will need to ask each storage service via the IRC to return their respective data items:
Note that we do not need to authenticate when retrieving individual data items in order to fulfill any security claims made by RAIN.
Unfortunately, simply running the storage process in reverse by asking for the data does not work, since an observer then quickly could make the link between storage agent pseudonym and its address. Instead, when we need to retrieve a data set, the C&C will instruct the IRC node to issue a “call for data items”, listing the IDs of the data items. In order to complicate traffic analysis, the IRC node will also ask for somed bogus IDs; these will simply be discarded.
The storage agents that find matching data item IDs in their records, will retrieve these from the storage providers. In addition, they will also retrieve some other random data which will be discarded. Storage agents who don’t find any data they have stored in the list will periodically retrieve random data, and send this on as explained below.
The retrieved data is then sent back to the IRC node, but this time via the mix-net. The IRC node then sends each data item back to the C&C node, again via the mix-net. This operation is dependent on the C&C node providing the IRC node with an anonymous return address[26].
Each storage agente responds with its piece of the puzzle, and the IRC node forwards everything to the C&C node:
The C&C node then re-assembles the data, and either returns it to the user:
or sends it off to be processed as explained in the next section.
The complete picture is illustrated in Figure5.
Processing data in the cloud
When the user wants to do something with the data, it will tell the C&C node:
Here, “operation” identifies what should be done, I D
D
identifies the dataset, and N
u
is a nonce chosen by the user.
The data will first have to be retrieved and re-assembled as explained above. The C&C node then selects an appropriate number of cloud processing providers, depending on the type of data and what is to be done with it. If the data is, e.g., a digital image, and the user wants to manipulate it using a Cloud-based image editor, then the complete data set typically needs to be sent to a single processing provider.
The data, the nonce chosen by the C&C node, a symmetric key K
PC
for encrypting the response, and the rest is encrypted with the public key of the cloud processing provider, and sent through the mix-net.
Note that since the C&C node keeps track of requests to processing providers, the operations are idempotent; replayed responses are ignored, and in case of response failures, a new request will be sent, canceling the former.
The result is returned to the user:
Again, the user will reject any spurious responses with a nonce that doesn’t match that of an outstanding request.
If the result is a change in the dataset, it will either have to be re-stored or delivered to the user, depending on the user’s wishes. If data items need to be updated or deleted, the authentication mechanism comes into play again. In any case, a confirmation is sent to the user, closing the outstanding request.
An example of an editing operation is shown in Figure6. In this case, an image of a rodent (Figure6a) is to be modified to become a feline (Figure6d). This example also highlights an optimization opportunity; Figure6b and 6c identify the modified areas of the image, and on completion only these parts need to be re-stored. The exact mechanisms of how to determine which parts have been changed are beyond the scope of this article, however.
Implementation considerations
Space does not permit a full implementation specification, but in the following we will illustrate in a little more detail how the actual storage and retrieval process may be realized from the C&C node’s point of view.
Although we do not go into specifics here, it is clear that the actual splitting must depend on the type of document. The process is illustrated in Figure7. A user (or a client running e.g. in a cloud environment) initiates writing of content to the system. The user can configure which storage providers to use for certain file types or content. Part of the config contains information on how each of the storage providers can be used, i.e. description on how to access, write and read content. Typically this can be a proprietary web API. Based on the selection of storage providers available and the content type a recipe is generated. The recipe states the size of blocks the original file is to be split into and a sequence for writing the blocks to the various storage providers. Based on the recipe the content is divided in blocks that each is stored at a storage provider. The recipe is stored, and using the recipe, the content can be retrieved from the storage providers and assembled. The fileID is returned to the initiating part.
The retrieval process is illustrated in Figure8. A user (or a client running e.g. in a cloud environment) initiates reading of content from the system. The recipe is retrieved based on the fileID. Based on the recipe the file is read from storage providers and assembled. The assembled file is returned to the initiating party.