Keystone: A Streaming Data Management Model for the Environmental Sciences
Computer Science and Engineering
AltmetricsView Usage Statistics
In this thesis we explore a problem facing contemporary data management in the earth and environmental sciences: effective production of uniform and quality data products which keeps pace with the volume and velocity of continuous data collection. The process of creating a quality data product is non-trivial and this thesis explores in detail what knowledge is required to automate this process for emerging mid-scale efforts and what prior attempts have been made towards this goal. Furthermore, we propose a model which can be used to develop a mid-scale data product pipeline in the earth and environmental sciences: Keystone. Specifically, by automating Quality Assurance, Quality Control and Data Repair processes, data products can be created at a rate that keeps pace with the production of data itself. To prove the effectiveness of this model, three software applications that fulfilled each of the key roles suggested by the Keystone model were conceived, implemented and validated individually. These three application are the NRDC Quality Assurance Application, the Near Real Time Autonomous Quality Control Application (NRAQC), and the Improved Robust and Sparse Fuzzy K Means (iRSFKM) imputation algorithm. Respectively, they provide the functionalities of metadata management and binding through a multi-platform mobile application; automated data quality control with the help of a dynamic web application; and rapid data imputation for data repair. The latter leverages multi-gpu processing to add significant speed to a high accuracy algorithm. The NRDC Quality Assurance application was validated with the aid of a directed user survey which was disseminated among environmental scientist members of the Earth Science Information Partners organization. An analysis of these surveys indicated that the NRDC Quality Assurance application addresses many significant gaps in the area of metadata binding and creation with many respondents still recording metadata on pen and paper and taking between hours to weeks to digitize their metadata.The efficacy of the NRAQC application was demonstrated through a case study where nearly a million data points were batch tested according to various user configured metrics. The NRAQC system consistently flagged the same data points on the same streams over the course of five iterations, with an average testing time of 134.24 seconds per testing iteration. Specifically, in each iteration, the NRAQC system identified 1946 repeat values, 365 missing values, and 131 out-of-range values.Finally, we demonstrated the effectiveness of our iRSFKM algorithm and implementation with multiple experiments, clustering real environmental sensor data. These experiments showed that the our improved multi-GPU implementation is significantly faster than sequential implementations with 185 times speedup over eight GPUs. Experiments also indicated greater than 300 times increase in throughput with eight GPUs and 95\% efficiency with two GPUs compared to one.The overall model itself was validated through a discussion of how effectively these software solutions worked in tandem to produce a final data product in an primarily automated fashion.