Actual buildings (above) whose CAD model (below)
is highly compressible,

Compression-Aware Algorithms Project

This project is based in the Department of Computer Science at the University of Virginia, and is supported by the National Science Foundation under grant III-1117684,

The people involved in this research are professors Gabriel Robins (PI) and abhi Shelat (co-PI), and graduate students Nathan Brunelle, Michael Skalak, and Robbie Hott.

Motivation and Overview

A data avalanche is sweeping and permeating all sectors of society, including industry, government, and academia. Achieving the full transformative potential of this deluge of data requires addressing new and open questions, especially with respect to the scalability of data creation, storage, and processing. In particular, while much data is stored in compressed format, very few classic data processing algorithms are able to process compressed data. Thus the rate of growth of common datasets is far outpacing algorithmic capability.

To address this growing disconnect, this project is developing a general framework for compression-aware algorithms that operate directly on compressed massive datasets. Each algorithm's speed and memory space performance can dramatically improve with the input's compressibility (i.e., descriptive complexity), rather than depend solely on the input size. This improvement derives from leveraging highly repetitive or parametrically specified input structures, leading to much smaller inputs (and outputs), and enabling algorithms to manipulate very large composite objects while interacting only with their succinct descriptions. We are devising compression-aware algorithms in several classical domains, including for geometric, graph-based, and numerical applications such as sorting and searching.

This project also investigates algorithmically-aware compression schemes specifically designed to allow efficient computations directly on the compressed data. We are also devising new compression-aware data structures that can efficiently handle and store compressed inputs, while still allowing standard queries and access to that data. Our algorithmic contributions will be validated with experiments on both real and synthetic massive data sets. This principled compression-aware algorithms methodology has wide applicability across numerous diverse areas, including networks, bioinformatics, databases, computer graphics, artificial intelligence, geographic information systems, integrated circuit design, and computer-aided engineering.

The intellectual merit of this proposal stems from its broad conceptualization of new families of compression-aware algorithms that will operate directly on massive compressed datasets, coupled with a focus on new applications and practical considerations. This project will design more sophisticated and efficient algorithms, and will enable the processing and mining of very large datasets in numerous areas across academia, industry and government. This project is also training students, conducting outreach, and helping to produce key curricula components in the area of data mining, compression-aware algorithms, optimization, and approximation, and operations research.

The broader impact of the proposed research lies in increasing the efficacy and utility of future compression-aware algorithms and approximation schemes across a wide spectrum of fields, with clear benefits to science, society, and the economy. Data mining and computational sciences have attracted much attention in recent years, but runaway data volumes have swamped existing algorithms and stifled progress. The formulations, algorithms, codes, and theories being developed by this project will help usher in future efficient and practical algorithms and heuristics, and fundamentally change the way in which organizations generate, collect, process, utilize, and mine large datasets.

Preliminary Results

We obtained a series of interesting results on compression-aware algorithmic variants of classical problems in sorting and searching, including order-statistics (i.e., "selection" or finding the kth largest element). Our new compression-aware algorithms are able to sort and search lists compressed using either the standard Lempel-Ziv-77 compression scheme, unions of arithmetic progressions, or using context-free grammar -based compressions. In all cases our new algorithms have substantially faster run times than their classical counterparts.

Some of our findings thus far are summarized in the following draft:

Brunelle, N., Robins, G., and Shelat, a., Sorting a Compressed List, June, 2012

These preliminary results are very encouraging, and demonstrate that this general research area of compression-aware algorithms is ripe with new and useful results that await discovery. We are currently developing additonal compression-aware algorithms for other problem areas, and implementing and benchmarking our new algorithms. We will soon make our results, implementations, and benchmarks available to the larger research community.