This project supports data-related analysis in a wide range of science and engineering applications. It will contribute to the development of scalable data-centric cyberinfrastructure capabilities, to accelerate interdisciplinary and collaborative research. The exascale file system will serve as catalyst for research in data storage architectures, and will enable new data-focused services and capabilities that advance scientific discoveries, collaborations, and innovations. The project addresses a major data challenge common to a range of communities such as social science, economics, and bioengineering, and will provide thorough training and collaborative research opportunities for project participants.
Both high performance computing (HPC) clusters and Hadoop clusters use file systems. A Hadoop cluster uses the Hadoop Distributed File System (HDFS) that resides on compute nodes, while an HPC cluster usually uses a remote storage system. Despite years of efforts on research and application development on HPC and Hadoop clusters, the file systems in both types of clusters still face a formidable challenge, that of achieving exascale computing capabilities. The centralized data indexing in HDFS and HPC storage architectures cannot provide high scalability and reliability, and both HDFS and HPC storage architectures have shortcomings such as single point of failure and insufficiently efficient data access. This project builds scalable high-availability data capabilities in data-centric cyberinfrastructure to overcome the shortcomings and create a highly scalable file system with new techniques for distributed load balancing, data replication and consistency maintenance.
Github Package for Implementation of Adaptive File Replication in Hadoop MapReduce
Github Package for Implementation of Load Rebalancing in Hadoop MapReduce