The algorithm used by rechunker tries to satisfy several constraints simultaneously:
Respect memory limits. Rechunker’s algorithm guarantees that worker processes will not exceed a user-specified memory threshold.
Minimize the number of required tasks. Specifically, for N source chunks and M target chunks, the number of tasks is always less than N + M.
Be embarrassingly parallel. The task graph should be as simple as possible, to make it easy to execute using different task scheduling frameworks. This also means avoiding write locks, which are complex to manage, and inter-worker communication.
The algorithm we chose emerged via a lively discussion on the Pangeo Discourse Forum. We call it Push / Pull Consolidated.
A rough sketch of the algorithm is as follows
User inputs a source array with a specific shape, chunk structure and data type. Also specifies
target_chunks, the desired chunk structure of the output array and
max_mem, the maximum amount of memory each worker is allowed to use.
Determine the largest batch of data we can write by one worker given
max_mem. These are the
Determine the largest batch of data we can read by one worker given
max_mem, plus the additional constraint of trying to fit within write chunks if possible. These are the
write_chunks == read chunks, we can avoid creating an intermediate dataset and copy the data directly from source to target.
Otherwise, intermediate chunks are defined as the minimum of
read_chunksalong each axis. The source is copied first to the intermediate array and then from intermediate to target.