Implementation of a secure genome sequence search platform on public cloud-leveraging open source solutions
© Saxena et al.; licensee Springer. 2012
Received: 30 January 2012
Accepted: 20 June 2012
Published: 19 July 2012
With looming patent cliffs, resulting in no patent protections for several block buster drugs, several Life sciences organizations are looking at ways to reduce the costs of drug discovery. They are looking to change business models from having all drug discovery activities being done in-house to a more economical collaborative innovation model by forming ecosystems through consortiums and alliances with several other partners to collaborate especially in the pre-competitive areas of drug discovery. They are considering leveraging cloud computing platforms to create the collaborative drug discovery platforms needed to support these new drug discovery models. Another area of focus is to improve the success rate of drug discovery by creating more complex computer models and performing more data intensive simulations. Next generation sequence sequencers are also providing unprecedented amounts of data to work with. Cloud computing has proven to be scalable and capable of meeting the computation needs in life sciences domain but a key inhibitor has been security concerns. This paper is an extension of an earlier paper we had written that describes how to leverage a public cloud to build a scalable genome sequence search platform to enable secure collaboration among multiple partners. This paper describes a few additional techniques and open source solutions that can be leveraged to address security concerns while leveraging public cloud platforms for collaborative drug discovery activities.
KeywordsGenome sequence search BLAST Ensembl Cloud security Encryption Federated identity SAML OpenVPN ACL Hadoop Hadoop security
Several block buster drugs will go off patent protection by 2015 . This means several life sciences companies will have cost pressures and so will be looking at ways to reduce costs. Current drug discovery business models involve significant redundancies among the various life sciences organizations. They all duplicate effort is the early stages of drug discovery which are considered pre-competitive and non-differentiating. With the increasing cost pressures, several life sciences organizations have come together through industry alliances like Pistoia Alliance  to look at ways to increase collaboration among the various players, in the pre-competitive areas of drug discovery to reduce costs. As part of one such initiative, the members of the Pistoia Alliance  have mutually agreed to define and document the standards for a secured sequence service. Ensembl  is chosen as one of the relevant public sequence services to be made available as a secure service that can be used by multiple life sciences organizations instead of duplicating the efforts in deploying it in-house, maintaining it and keeping it in sync with the constant new releases. Ensembl is a joint project between EMBL-EBI and the Sanger Centre. Ensembl produces genome databases for vertebrates and other eukaryotic species and makes them available for free on over the internet and also enables search leveraging BLAST  algorithm. Though Ensembl is available for free over the internet, several life sciences organizations are not able to use it because of security concerns. Ensembl doesn’t currently offer adequate security for the search operations so there are concerns about competitors could eavesdrop on the sequence searches being performed by an organization’s scientists and use that to infer several confidential and proprietary information. Another challenge with use of Ensembl is lack of SLAs around performance and support. The response times are not predictable and depend on the number of users currently performing searches and the complexity of the search operations they are performing. This can result in scientists wasting their precious time. As a result most life sciences organizations resort to hosting the Ensembl applications and the datasets in-house but that results in increased costs as each organization has to invest separately on the infrastructure and people needed to keep it operational. This problem is not jus with Ensembl, there are several other such popular life sciences applications and datasets that are available in public domain but there are hosted and managed internally by most organizations resulting in redundancies. Pistoia Alliance members therefore wanted a solution that offers a shared platform that is secure and offers several such applications that are used in the pre-competitive activities of drug discovery on-demand with predictable SLAs. They wanted to evaluate public cloud platforms for this with Ensembl as the pilot application to be hosted and made available on-demand with adequate security. Infosys is one amongst the IT vendors that have been invited to implement a proof of concept for developing a secured sequence search solution. This paper and the one we published earlier  are based on our experiences in implementing the proof of concept.
Our earlier paper  explained how to implement a Secure Next Generation Sequence Services business cloud platform that is highly scalable and can be shared by multiple life sciences companies securely. The earlier paper described a few techniques for securing web applications and data hosted on a public cloud such as Amazon AWS leveraging open source security components and how they have been leveraged to secure the Ensembl solution. In this paper we expand on the earlier paper and describe a few additional security solutions and best practices that can be leveraged in offering a secure life sciences business cloud platform.
Background and related works
Life sciences organizations have been forming several alliances and consortiums to enable collaboration in the pre-competitive areas of drug discovery. The Pistoia Alliance (http://www.pistoiaalliance.org), Open Source Drug Discovery (http://www.osdd.net) , the European Bioinformatics Institute (EBI) Industry Program, the Predictive Safety Testing Consortium, Sage Bionetworks, Innovative Medicines Initiative (http://www.imi.europa.eu) are examples of such alliances. In the paper - Implementation of a Scalable Next Generation Sequencing Business Cloud Platform , the authors describe how to address the scalability of a Next Generation Sequencing solution and a strategy to port a pre-configured Sequence Search application such as BLAST  onto a scalable storage and processing framework like Hadoop framework to address scalability and performance concerns. Our previous paper  was an extension to the same and focused on several security aspects of such business cloud architecture. It gave an overview of Amazon AWS Cloud, what cloud security is about and a few open source security products like OpenAM, Truecrypt etc and then described how to implement firewalls in AWS cloud and techniques to address security of data at rest and in transit and how to implement Federated Identity and form Circle-Of-trusts to enable collaboration using OpenAM and WS-Federation Standards. This paper is an extension to previous papers [5, 7] so, the authors assume that the reader has gone through those to get the understanding of the context, the solution overview and the solution components used.
Describe how to secure a Hadoop Cluster to prevent impersonation and unauthorized access.
Provide controlled access to data residing in Hadoop Cluster through Access Controlled Lists (ACLs).
Demonstrate how to implement a virtual private cloud with secured access to the machines on a public cloud through virtual private networks (VPNs).
Introduction to hadoop and security concerns
It is free and open source.
Hadoop Distributed File System (HDFS) offer a highly scalable storage solution that has been proven to scale to petabytes of data. So, it can be used to store the genome data.
HDFS besides being scalable also offers fault tolerance by replicating data across multiple machines there by addressing availability and reliability concerns.
Hadoop Map Reduce framework offers a highly scalable distributed processing solution that leverages several commodity servers to parallelize processing. It is therefore offer a good solution to parallelize BLAST search.
Another paper  describes how BLAST search has been parallelized using Hadoop.
Impersonation: Hadoop does not have any inbuilt authentication mechanism of its own. Hence a malicious user can easily impersonate as the superuser or any valid user of Hadoop Cluster and can access the HDFS cluster from any machine.
Default permissions in HDFS file system: The default permissions on HDFS file system are -rw-r--r-- for files and drwxr-xr-x for directories. This gives sufficient privileges to users to view other users’ files and directories. In some case such as a shared infrastructure between competitors, this may not be desired.
Direct Access to Data Blocks: DataNodes do not enforce any access control on access made to the data blocks they are storing by default. This allows an unauthorized user to read a data block by supplying the blockid. This also allows an unauthorized user to write arbitrary data to data blocks on the DataNodes.
Introduction to openVPN
OpenVPN is an open source solution that enables the implementation of virtual private network (VPN) for creating secure point-to-point connections in routed or bridged configurations for secured access to remote machines. It makes use of a custom security protocol that employs SSL/TLS protocol suite for key exchange and is capable of traversing firewalls and network address translators. It supports authentication amongst peers by means of a pre-shared secret certificates, key, or userid/password. When used in a multi client–server scenario, it allows server to issue an authentication certificate for each client. It utilizes OpenSSL encryption library and SSLv3/TLSv1 protocol suite, to provide multiple features pertaining to security and control.
Introduction to Access Control List(ACL)-based security model
In an ACL based-security model, ACLs are defined and enforced to control the access of subjects to objects. The term subject can refer to a real user, system process or daemon while the term object can refer to the resource(s) that a subject tries to access such as files, directories, system ports etc. When a subject requests an operational access on an object, the Access Control Manager (ACM) checks the ACL data for a matching ACL to decide whether the requested operation is permitted or not.
Given this background of Amazon AWS Cloud, OpenAM, Truecrypt, Hadoop Security, ACL-based security Model and OpenVPN we describe the security techniques in the next section.
In this section we describe how to implement a secure collaboration platform using a public cloud (Amazon AWS Cloud) and then describe how to migrate a popular genome sequence search application called Basic Local Alignment Search Tool (BLAST) and provide web based secure access to collaborating groups of life sciences organizations.
Next generation secure sequence search business cloud platform – solution design
The solution consists of a 3 layers. At the bottom is Cloud infrastructure layer that provides scalable compute and storage capabilities. Over that is a Life Sciences Services platform which provides domain specific services like sequence search, distributed annotations management etc with API based on standards. The next layer is domain specific applications supporting business processes like target management, genomics, clinical trial management, drug sales etc in the various business areas like drug discovery, development and marketing.
A prototype of a scalable elastic, pay-per-use genome sequence search platform based on BLAST algorithm over public genome datasets with a web based genome browser application on Amazon public cloud infrastructure is developed to validate the concept. Hadoop was chosen as a scalable processing framework and BLAST processing was parallelized which was described in our earlier paper . Ensembl is a popular genome sequence search application and it was used for the prototype. Our earlier paper  describes several security concerns with using public cloud infrastructure and some of the techniques we used to address those concerns.
The figure below shows various security layers of the “Next Generation Secure Sequencing Business Cloud Platform” solution:
The components in red color have been described in our previous paper . The components in blue color are the additions that will be described below.
Virtual Private Network (VPN): A Virtual Private Network provides secure access to instances on Amazon AWS Cloud especially for scenarios where they are accessed through mechanisms that not based on HTTP. A secured direct access may be needed for the desktop based applications trying to acccess the services or data residing on Hadoop Cluster in the Amazon Cloud. Additionally, while uploading private data to Hadoop Cluster, one may again need a secured access to Hadoop Cluster. Also for maintainence, troubleshooting or upgradation purposes one may need a secured acccess to Amazon instances. This helps in achieving security of data in transit
Secured Hadoop Cluster: Hadoop Cluster is used to store the genome data so, there is a need protect the data from unauthorized access from malacious users. This is required to take care of Hadoop related security issues such as impersonation (section II.A.1), unrestrictive file permissions on HDFS (section II.A.2) and direct access to data blocks (section II.A.3).
Access Control Lists(ACLs): ACLs add another level of security for data at rest from unauthorized access (section II.C)
The subsequent sections explain the deployment architecture and implementation of various components described above.
Next generation secure sequence search business cloud platform – deployment architecture
Each Amazon Elastic Compute Cloud (EC2) node provisioned has equivalent of 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform.
Web Clusters: The VM has Apache web server and is load-balanced through the Amazon ELB.
Federation Server: This node runs OpenAM server deployed over Apache Tomcat. This machine additionally runs an OpenDS LDAP server which serves as OpenAM user store.
Hadoop Cluster: The cluster contains a master node and two datanodes.
Securing hadoop cluster
In order to address security issues with Hadoop default settings described earlier, the following security measures were implemented.
Prior to version 0.20, Hadoop had no in-built implementation of any authentication mechanism. It trusted the underlying operating system for user’s authentication and used `whoami` utility to identify users , and `bash –c groups` for groups. This was the weakest link with respect to security as one can write his/her own whoami script or groups script and add it in his/her path to impersonate someone else including super user. This is a major threat.From version 0.20 onwards Hadoop supports Kerberos protocol for authentication as a security measure against impersonation. The complete step by step procedure for securing a Hadoop Cluster is described in Cloudera Security Guide .
One can run a Kerberos KDC and a realm local to the cluster, create all service principals in this realm and then set up one-way cross-realm trust from this realm to the Active Directory realm. If this approach is used, there is no need to create any additional service principals in Active Directory and Active Directory principals (users) can be authenticated against Hadoop.
Alternatively, one can use Active Directory as the Kerberos KDC, create all service principals in the Active Directory itself and configure Hadoop/Linux machines as Kerberos clients to use Active Directory directly.
Cloudera provides an exhaustive doumentation  for both the approaches.
It is easy to configure.
There is no overhead of maintaining master and slaves KDC (as compared to approach 1).
Maintenance of user accounts becomes quite easier.
User tries to authenticate to the machine by supplying his/her credentials.
The machine contacts the Kerberos KDC (in our case the Active Directory server of the organization) for verification of credentials.
The KDC verifies the credentials and issues a Kerberos ticket to the user.
The user with the Kerberos ticket granted by the Kerberos KDC can access Hadoop services till his Kerberos ticket is valid.
For restrictive permissions on hdfs file system
Properties to restirict permissions on HDFS filesystem
Value to be set
<depends on requirements>
umask value for HDFS file system
For controlling direct access to data blocks
Properties to control direct acess to data blocks
Value to be set
If "true", access tokens are used as capabilities for accessing datanodes. If "false", no access tokens are checked on accessing datanodes.
Depends upon your requirement, default is 600
Interval in minutes at which namenode updates its access keys.
Depends upon your requirement, default is 600
The lifetime of access tokens in minutes.
Implementation of a Virtual Private Cloud using OpenVPN
The user authenticates to the VPN access gateway using web interface or OpenVPN Connect Client.
The request reaches the OpenVPN Access Server.
The OpenVPN server queries the authentication source (ADFS Server).
On successful authentication, a VPN tunnel is created between the infrastructure on Amazon Cloud and the User’s machine thus creating a Virtual Private Cloud.
Pros and cons of implementations of VPN over various topologies
OSI Layer 2
· Most appropriate for smaller networks.
· Because LAN broadcasts are propagated to all VPN clients, this topology doesn't scale well to LANs that have a larger amount of broadcast traffic.
· Easy to configure.
· VPN clients receive their network
· Doesn't scale well with larger numbers of concurrent VPN clients.
· properties from the same DHCP server as machines that are
· Can only be used when the Access Server is connected to a LAN that provides DHCP services.
· physically connected to the server-side LAN.
· Works well with application-layer
· Should only be used when the Access Server has a fixed IP address on a private LAN.
· protocols that depend on LAN broadcast resolution.
· Can tunnel non-IP protocols.
· Currently only works with Windows Clients
OSI Layer 3
· More efficient and scalable.
· More complex to configure.
· Greater control over IP and routing configuration.
· Doesn't work well with application-layer protocols that depend on broadcast resolution.
· Better fine-grained access control.
· Works on all client platforms that support OpenVPN.
It is more efficient and scalable
Provides better control over IP and routing configuration
Provides more granular control
Supported on all platforms such as Windows, Mac and Linux.
Implementation of ACLs-based Security Model
Subject: The term subject refers to a real user, system process or daemon.
Object: The term object in an ACL based security model refers to the resource(s) that a subject tries to access.
ACL Server: This is the server which contains the list of permissions that controls the access of a subject over an object.
The purpose of using this model in our PoC was to ensure that a user is able to access only the data that he/she is authorized to access. Thus, limiting a user to use only the public data and only that part of his/her organization’s private data that he/she is authorized to access.
The enforcement of ACLs was de-centralized so that there are multiple checks across various layers.
The decision logic of ACLs was centralized: The ACL data resided on a centralized ACL server that determines whether the user is authorized to access the selected data or not.
Enable user input a genome sequence and select public data stores and only those company specific private data stores that he/she is authorized to view, in order to find matching genomes
Present matching genome sequences with ranking that is combined across the public and private data stores
Enable providing company specific aliases to genome IDs so that it is easier to cross-link the genome annotations and information available in public domain with the confidential information available for the same genome but with a company specific internal ID.
Access control enforcement was first done at the UI layer by enabling selection of only relevant genome data stores.
Next, there was again access control enforcement in the processing layer during the execution of the BLAST search to restrict search to only those data stores that the user is authorized to view. This is to ensure that if the processing layer is reached through any other channel, access controls are still enforced.
An Aliasing service was implemented that maintained a map of public gene IDs from National Center for Biotechnology Institute(NCBI) with those internal to each organization. The organization specific aliases that the user is entitled to shown in the search result along with the NCBI IDs for the public genome data as shown in the screenshot below (Figure 6).
The experience report along with the earlier reports [5, 7] described how to use open source tools and solutions to create a secure drug discovery collaboration platform in a public cloud which provides several features like protection against possible Denial of Service attacks, security of data in transit, security of data at rest, implementation of Federated Identity and creation of secure Circle-Of-Trusts for collaboration, parallelization of processing using Hadoop and securing data stored in Hadoop, enabling secured access to Amazon AWS instances through VPNs, correlating public and private data sets and providing access controls. We believe our work can be leveraged by practitioners and researchers in life sciences domain who plan to use public clouds.
Our study, addressing a few security aspects through PoCs.These are first steps in building a foundational business cloud platform for collaborative drug discovery. In this experience report, we have described solutions for securing data stored in Hadoop, enabling secure access of genome data and services through non-HTTP channels also leveraging VPNs, enabling security of private genome data through application specific access control components. In future we look forward to expansion of this work to address other public clouds, addressing more cloud security vulnerabilities and enabling more drug discovery related applications on to the foundational platform and expanding the platform with additional layers of capabilities like high performance computing, collaborative workflows, social collaboration workspaces and addressing the security aspects of those components.
Shyam Kumar Doddavula
Shyam works as a Principal Technology Architect and heads the Cloud Centre of Excellence at Infosys Ltd.
Akansha is a Technology Lead at Cloud Centre of Excellence at Infosys. She has around 5.5 years of experience in Java, Spring, Hibernate, Cloud Computing and Hadoop.
Vikas is a Systems Engineer at Cloud Centre of Excellence at Infosys. He has around 2.5 years of experience in Web Security, Application Security, Network Security, Cloud Computing, Hadoop and Cloud Security.
Basic Local Alignment Search Tool
Access Control List
Virtual Private Network
Amazon Web Service
National Center for Biotechnology Institute
Proof of Concept.
Authors would like to thank the Pistoia Alliance Sequence Service team members – Simon, Claude, Ralf, Cary, John, and Nick for their help while defining the solution. Authors also wish to thank the Infosys sponsors and project team members – Arun, Subhro, Rajiv, Shobha, Kirti, Krutin, Ankit, and Nandhini.
- Barnes MR, et al. Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery;http://www.nature.com/nrd/journal/v8/n9/abs/nrd2944.html .
- Simon T, Cary OD, Claus Stie K, John W: The Pistoia Alliance. The Sequence Service Project. G.I.T. Laboratory Journal, Trends in Drug Discovery Business 2011, 1–3.Google Scholar
- Ensembl Genome Browser;http://ensembl.org/index.html .
- Basic Local Alignment Search Tool (NCBI);http://blast.ncbi.nlm.nih.gov/Blast.cgi .
- Implementation of a Secure Genome Sequence Search Platform on Public Cloud: Leveraging Open Source Solutions by Shyam Kumar Doddavula and Vikas Saxena published at 2011 IEEE Third International Conference on Cloud Computing Technology and Science (IEEE cloudcom 2011);
- Cloudera Security Guide;https://ccp.cloudera.com/display/CDHDOC/CDH3+Security+Guide .
- Doddavula SK, Rani M, Sarkar S, Vachhani HR, Jain A, Kaushik M, Ghosh A In: IEEE CLOUD: IEEE. Implementation of a Scalable Next Generation Sequencing Business Cloud Platform-An Experience Report 2011, S598-S605.Google Scholar
- Integrating Hadoop Security with Active Directory;https://ccp.cloudera.com/display/CDHDOC/Integrating+Hadoop+Security+with+Active+Directory
- Active Directory Winbind Howto;https://help.ubuntu.com/community/ActiveDirectoryWinbindHowto .
- Linux-AD Integration with Windows Server 2008;https://ccp.cloudera.com/display/CDHDOC/Integrating+Hadoop+Security+with+Active+Directory .
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.