Thursday, December 3, 2009

Clarifying the Cloud

In vacation weeks and those that follow, the group working on the Data in the Cloud do not usually meet. But we are always working on something....

A couple things of interest for this project include (1) what other services at ICPSR we have put in the cloud and (2) what resources Steve Burling (our 70% man ....we call him that because 70% of his time is dedicated to the project....it isn't meant to imply that he is only 70% of a man ...I think) are using to build the applications and images for this project.

First, a link to Bryan's ICPSR Tech blog about the other ICPSR services running in the cloud. On-going experiences with these service will be useful in monitoring service performance primarily. Bryan will keep us informed about problems or triumphs so we can evaluate other sources of information that will inform what we do. Synergy is fantastic, don't you think?

Second, Steve reports that these are the specific resources he has been using to build our ACI (remember this is our fancy acronym for the customized analytic space we are creating for users) in the cloud. The primary source is Amazon's documentation on the AWS service. In particular from there, he has used the Getting Started Guide, the Developer Guide, and the Command Line Reference.

Tuesday, November 17, 2009

ACI Chooser --- what a terrible name

We met today to decide how to put together the place where we gather information from the contract holder so that we can configure their space. I think that's what Bryan means by "Chooser" ---ugh. The inelegance of tech speak is legendary but this one is a real winner. The restricted contracting portal will already provide us with a great deal of information about the primary investigator that we can "populate" (another elegant use of a term) the forms with but we need to confirm the identity of the primary investigator and his/her research team. We will do that with an entry screen that confirms the identities through a MyData registration on the PI and a list of research team members and their emails. We then need to identify their choice of stat package and OS (the on-going argument between Steve Burling --our 70% man and Bryan is that almost 90% of the folks coming through this will choose a Windows environment as an OS and thus it makes both the OS choice and Burling an anachronism). We will also bring information from the RCS to determine the structure of the security for the data system the PI and team are accessing.

The things left to research are (1) how to bring users through to a firewalled entry to the site so that they do not have find and identify their IP addresses; (2) whether and how the PI would set up read/right permissions within the workspace for their affiliated researchers; (3) whether to allow content to be uploaded to the ACI without scrutiny. This research belongs to Steve.

Thursday, November 5, 2009

Project Deliverables - Software

So, what do we need to build to start our science experiment of building a secure data enclave in a public computing utility cloud?

Felicia and I have been talking about the list of software deliverables for the project. These flow from the high-level architecture diagram from my earlier post; if the high-level diagram is at the 10,000 foot level, then these items are at the 1,000 foot level. (And the Cloud Developer we are recruiting will take them down to the 1 foot level.)

I think we'll need six systems:
  1. ACI Chooser - This is a public-facing webapp where the researcher selects options to configure the ACI. This might include the desired platform (Windows), desired analytic software (SAS), the allowable users (say a subset of people on the restricted-use contract), a zipped bundle of miscellaneous tools the researcher wants pre-loaded on the ACI, etc.
  2. ACI Pre-Launcher - This is an ICPSR-facing (command-line?) utility for taking configuration information supplied by the researcher + restrictions associated with the dataset(s), and launching a customized ACI in the cloud for the researcher
  3. ACI Post-Launcher - This is also an ICPSR-facing utility for customizing the ACI, but performs post-launch configuration of the instance. We may decide to couple #2 and #3 into a single tool.
  4. ACI Watcher - This is a tool that monitors the availability and performance of each ACI. If the ACI is unavailable or sluggish, this will tell us.
  5. ACI Dashboard - This is a tool that aggregates views of all ACIs, giving an overall view of the cloud provider and. Perhaps this would be public-facing if properly anonymous?
  6. ACI Waste Manager - This tool securely and completely cleans up an ACI once the research has been complete. This tool will have done its job if there is no trace of the research or the data left in the cloud once the ACI has been terminated.
There are undoubtedly other items we'll also need, but this is a good starter list.

Monday, November 2, 2009

High-Level System Architecture

[ click the image at the left to navigate to a larger version ]

Our high-level architecture for our Enclave in the Cloud starts with a researcher who is interested in using restricted-access data.

Step #1: The researcher uses ICPSR's contracting portal (which is nearing completion) to submit a request for access. This portal pulls together information from the researcher (who will be using the data, what's the research plan, institutional approval) and information from the dataset (licensing terms, data protection requirements).

Step #2: ICPSR reviews the application, and if everything is in order, approves access to the data.

Step #3: The researcher uses a (yet to be built) portal to configure choices about access: platform (Linux or Windows), required statistical software, etc. This portal also pulls in requirements from the contracting system which may influence available options.

Step #4: ICPSR uses this configuration as a template to a (yet to be built) utility that launches a virtual machine in the cloud. This system - an Analytic Computing Instance - contains all of data and software that the researcher or research team will need, and is protected by firewalls and host-level security to prevent unauthorized access.

Step #5: The researchers download a copy of the Citrix client (if the ACI platform is Windows). This is the tool they will (likely) need to use to login, and which can restrict functions such as cut and paste between the ACI and the local desktop. We'd like to make this download and install as easy as downloading Acrobat Reader.

Step #6: Research happens, and while it is happening....

Step #7: ICPSR monitors both the cloud provider and the ACI for performance and security. Some of the tools we'll use for this already exist because ICPSR uses Amazon's cloud for several extant portals and systems.

Step #8: The research has concluded and ICPSR destroys the ACI in a secure manner such that no trace of the research or the data lingers in the cloud.

Monday, October 26, 2009

Things we are thinking about

In the most recent meeting of the "Cloud" team, 2 issues came up that need further research and set us into some of the gray areas of cloud computing. The first is software licensing in the cloud. Our goal is to provision the analytic instances with the software of choice (within reason) for analysts. The question is whether ICPSR's licenses apply to "virtual space" if we are in fact renting that space from Amazon. We are pursuing it. The second question that requires some experiment is whether those people who want to access the "Cloud" will come in via Windows Remote Desktop or our Citrix server. The tradeoff is security vs ease of use. Remote Desktop is embedded in Windows software whereas users would need to get access to the Citrix system with some software. We are going to test this as we move forward. These two issues will be documented by the team once we have a clearer notion of how this will work.

Wednesday, October 21, 2009

First meeting summary and a link

We had our first substantive meeting of the "Cloud" team on 10/13/09 to try to start the engine on this project. I will let the technical people describe how we are to approach the tech side but I think it is useful to summarize the naive aspects of this experiment in order to keep some of the narrative in non-tech speak.

The pieces that need to be constructed for this to work are (1) the web app that gathers data from the user about how they want their analytic instance to look ---i.e. what software, etc. and (2) the "image" ---that is the application that instructs the cloud how to behave. We will develop the UNIX/Windows side of the image independently.

We also need to gather data set and user information from the Restricted Contracting System to both set the security conditions for the data and to pre-populate some of the web forms.

The other issues we discussed were (1) dealing with licensing issues for software that will be used in the analytic instances and (2) how the cloud data will be backed up.

Bryan provided us with an interesting article on cloud security. The great thing about this project is it provides so many wonderful metaphors (which social science usually does not frankly). This article is entitled something like "Hey You, Get Off Of My Cloud" If you are old enough ---you will know that it is a Rolling Stones song from 1965. So, we now also have a theme song for the grant as well. Who can ask for more?

Friday, October 9, 2009

Text of the Challenge Grant



We will have our first design meeting next Tuesday so it makes sense to post the text of the proposal as a shared document. This version contains the narrative for the grant. Two useful components of the document will guide the development. The first is Bryan's stick figure conceptualization of how we will put data in the cloud. The "Analytical Computing Instance" is a compromise name for what the user will configure and see in the cloud. The "instance" is terminology used to describe how clouds are used but seems a strange name. It may change as we get further along.



The second part of the document that will guide our work of course is the schedule. The first part of the grant is time is for setting up design specifications. The two components to be built are the ACI compiler and the web user interface.










Thursday, October 1, 2009

Putting Confidential Data in the Clouds

This blog is designed to chronicle progress on a new project at the Inter-University Consortium for Political and Social Research funded by the National Institutes of Health, Office of the Director through the Challenge Grant Program. The primary goal of the grant is to test whether confidential data, that is data distributed under license or contract, can be effectively and safely disseminated via the computing cloud. Currently data licenses and contracts put the burden of securing data files on the user. This often involves elaborate data security plans that may involve purchasing new technology or securing existing networks and machinery. This grant is to test whether we can dynamically configure temporary computing environments in the Cloud that will provider users with a secure environment in which to analyze confidential data. We will be building both the application that provisions this analytic instance and the web interface to help users navigate it. The experimental part of this project is to test cloud security and analyst's reaction to more distant analytic environments where they have less control. We have partners in this endeavor to help us recruit users to test the applications we build. They are at the Panel Study of Income Dynamics and the Los Angeles Family and Neighborhood Survey.

This is the first day of the project!