Tuesday, November 17, 2009

ACI Chooser --- what a terrible name

We met today to decide how to put together the place where we gather information from the contract holder so that we can configure their space. I think that's what Bryan means by "Chooser" ---ugh. The inelegance of tech speak is legendary but this one is a real winner. The restricted contracting portal will already provide us with a great deal of information about the primary investigator that we can "populate" (another elegant use of a term) the forms with but we need to confirm the identity of the primary investigator and his/her research team. We will do that with an entry screen that confirms the identities through a MyData registration on the PI and a list of research team members and their emails. We then need to identify their choice of stat package and OS (the on-going argument between Steve Burling --our 70% man and Bryan is that almost 90% of the folks coming through this will choose a Windows environment as an OS and thus it makes both the OS choice and Burling an anachronism). We will also bring information from the RCS to determine the structure of the security for the data system the PI and team are accessing.

The things left to research are (1) how to bring users through to a firewalled entry to the site so that they do not have find and identify their IP addresses; (2) whether and how the PI would set up read/right permissions within the workspace for their affiliated researchers; (3) whether to allow content to be uploaded to the ACI without scrutiny. This research belongs to Steve.

Thursday, November 5, 2009

Project Deliverables - Software

So, what do we need to build to start our science experiment of building a secure data enclave in a public computing utility cloud?

Felicia and I have been talking about the list of software deliverables for the project. These flow from the high-level architecture diagram from my earlier post; if the high-level diagram is at the 10,000 foot level, then these items are at the 1,000 foot level. (And the Cloud Developer we are recruiting will take them down to the 1 foot level.)

I think we'll need six systems:
  1. ACI Chooser - This is a public-facing webapp where the researcher selects options to configure the ACI. This might include the desired platform (Windows), desired analytic software (SAS), the allowable users (say a subset of people on the restricted-use contract), a zipped bundle of miscellaneous tools the researcher wants pre-loaded on the ACI, etc.
  2. ACI Pre-Launcher - This is an ICPSR-facing (command-line?) utility for taking configuration information supplied by the researcher + restrictions associated with the dataset(s), and launching a customized ACI in the cloud for the researcher
  3. ACI Post-Launcher - This is also an ICPSR-facing utility for customizing the ACI, but performs post-launch configuration of the instance. We may decide to couple #2 and #3 into a single tool.
  4. ACI Watcher - This is a tool that monitors the availability and performance of each ACI. If the ACI is unavailable or sluggish, this will tell us.
  5. ACI Dashboard - This is a tool that aggregates views of all ACIs, giving an overall view of the cloud provider and. Perhaps this would be public-facing if properly anonymous?
  6. ACI Waste Manager - This tool securely and completely cleans up an ACI once the research has been complete. This tool will have done its job if there is no trace of the research or the data left in the cloud once the ACI has been terminated.
There are undoubtedly other items we'll also need, but this is a good starter list.

Monday, November 2, 2009

High-Level System Architecture

[ click the image at the left to navigate to a larger version ]

Our high-level architecture for our Enclave in the Cloud starts with a researcher who is interested in using restricted-access data.

Step #1: The researcher uses ICPSR's contracting portal (which is nearing completion) to submit a request for access. This portal pulls together information from the researcher (who will be using the data, what's the research plan, institutional approval) and information from the dataset (licensing terms, data protection requirements).

Step #2: ICPSR reviews the application, and if everything is in order, approves access to the data.

Step #3: The researcher uses a (yet to be built) portal to configure choices about access: platform (Linux or Windows), required statistical software, etc. This portal also pulls in requirements from the contracting system which may influence available options.

Step #4: ICPSR uses this configuration as a template to a (yet to be built) utility that launches a virtual machine in the cloud. This system - an Analytic Computing Instance - contains all of data and software that the researcher or research team will need, and is protected by firewalls and host-level security to prevent unauthorized access.

Step #5: The researchers download a copy of the Citrix client (if the ACI platform is Windows). This is the tool they will (likely) need to use to login, and which can restrict functions such as cut and paste between the ACI and the local desktop. We'd like to make this download and install as easy as downloading Acrobat Reader.

Step #6: Research happens, and while it is happening....

Step #7: ICPSR monitors both the cloud provider and the ACI for performance and security. Some of the tools we'll use for this already exist because ICPSR uses Amazon's cloud for several extant portals and systems.

Step #8: The research has concluded and ICPSR destroys the ACI in a secure manner such that no trace of the research or the data lingers in the cloud.