NOAA Data Processing Pipeline

PUBLISHED ON SEP 1, 2018 — INDUSTRY, THE AEROSPACE CORPORATION

Purpose

One of my first contributions to The Aerospace Corporation’s cloud native initiative was creating a highly-parallelized satellite telemetry data processing pipeline. The goal of this project was two-fold. First, I needed to build a system that provided an order of magnitude faster processing time with respect to a legacy system from the pre-cloud-native era. Processed data would then be streamed into a reporting tool for real-time abnormality detection, a capability that was not possible prior to these foundational changes. Second, I needed to demonstrate the benefits and challenges associated with a containerized and Kubernetes-orchestrated data processing pipeline.

Background

The telemetry data is generated from a National Oceanic and Atmospheric Administration (NOAA) joint polar satellite system. It is then collected on a ground station, rsynced into the The Aerospace Corporation’s Cloud Technologies Lab, and stored as H5 files. These files are massive because they store billions of time-based readings across tens of thousands of measurands: sensors, batteries, orientation, etc. Once the data is stored, it is streamed into a processing pipeline with two key phases:

  • Transforming the H5 files into a readable format for the abnormality detection software
  • Merging data streams with varying sample rates

Finally, the processed data is fed into a non-real-time abnormality detection software.

Parallelism and Containerization Save the Data

Initially, I was handed a legacy JAR binary that accomplished the above. For a number of reasons I had no access to the code base. So I played around with it against some test data until I figured out what the cat inside the black box was hiding. I then rewrote the entire thing in Python. A parallelized version, indeed.

Dataflow Part 1: Processing Raw Data

Dataflow Part 2: Merging Data

Once a local, parallelized version of the system was running successfully, I containerized everything and deployed it in a Kubernetes cluster. This involved persistent storage so that worker pods can read and store the telemetry data. It involved service discovery so that pods can stream event metadata to and from Kafka topics. Finally, it involved replica sets so that a large number of workers can crunch away at the incoming telemetry data in real time.

Two Birds, One Stone

Rewriting a parallelized version of this pipeline allowed us to utilize a Kubernetes cluster to deploy highly available data processing workers. This combination enabled us to scale up as much as the underlying hardware would allow, and it provided a high degree of modularity and testability of the system. As a result, the processing time was improved by an order of magnitude, and the customer was convinced of the power of orchestration and cloud-native approaches in general.

Final Deployment

Technologies

  • Python
  • Docker
  • Kubernetes: PVs, PVCs, Services, DaemonSets, etc.
  • Kafka
comments powered by Disqus