During my time as a graduate researcher in Dr. Eric Sorin’s computational biophysics lab at Cal State Long Beach, I had the opportunity to work with the Folding@Home (F@H) distributed computing network. We used this platform to run data-intensive protein folding simulations that might otherwise take decades on a standard desktop computer.
Outputs of time-based protein folding simulations consist of files at every time step that indicate the position of each atom of the protein in 3-dimensional space. Nodes in the F@H network perform the computation necessary to step the folding simulation in time. Once the file for a time step is generated, it is stored in your local nodes.
As you might imagine, simulating the movements of millions upon millions of interconnected atomic structures outputs very large datasets (< 20GB per simulation in our case). Furthermore, the raw output data itself is rarely used during post-simulation analysis. Depending on the context of the problem, aggregate or latent values (e.g. root-mean-squared deviation, rate of gyration, etc.) are more useful when studying, say, drug binding. Thus, simulated data must be processed and stored as it is being sent to our lab nodes from the global network.
Given these challenges, a colleague and I decided to create a web platform that processes ongoing simulation data in real time and ingests the results into a relational database. A dynamic web application then visualizes this data along with the various parameters of interest at every hierarchy of the simulation. For example, a single simulation triggers multiple runs in parallel, where each run starts with a different set of initial conditions.
The platform is now actively used by lab members. Any simulation launched on the Folding@Home network, both ongoing and completed, is present in the platform.