API

Rechunk Function

The main function exposed by rechunker is rechunker.rechunk().

rechunker.rechunk(source, target_chunks, max_mem, target_store, target_options=None, temp_store=None, temp_options=None, executor: Union[str, rechunker.types.CopySpecExecutor] = 'dask') rechunker.api.Rechunked

Rechunk a Zarr Array or Group, a Dask Array, or an Xarray Dataset

Parameters
sourcezarr.Array, zarr.Group, dask.array.Array, or xarray.Dataset

Named dimensions in the Zarr arrays will be parsed according to the Xarray Zarr Encoding Specification.

target_chunkstuple, dict, or None

The desired chunks of the array after rechunking. The structure depends on source.

  • For a single array source, target_chunks can be either a tuple (e.g. (20, 5, 3)) or a dictionary (e.g. {'time': 20, 'lat': 5, 'lon': 3}). Dictionary syntax requires the dimension names be present in the Zarr Array attributes (see Xarray Zarr Encoding Specification.) A value of None means that the array will be copied with no change to its chunk structure.

  • For a group of arrays, a dict is required. The keys correspond to array names. The values are target_chunks arguments for the array. For example, {'foo': (20, 10), 'bar': {'x': 3, 'y': 5}, 'baz': None}. All arrays you want to rechunk must be explicitly named. Arrays that are not present in the target_chunks dict will be ignored.

max_memstr or int

The amount of memory (in bytes) that workers are allowed to use. A string (e.g. 100MB) can also be used.

target_storestr, MutableMapping, or zarr.Store object

The location in which to store the final, rechunked result. Will be passed directly to zarr.creation.create()

target_options: Dict, optional

Additional keyword arguments used to control array storage. If the source is xarray.Dataset, then these options will be used to encode variables in the same manner as the encoding parameter in xarray.Dataset.to_zarr(). Otherwise, these options will be passed to zarr.creation.create(). The structure depends on source.

  • For a single array source, this should be a single dict such as {'compressor': zarr.Blosc(), 'order': 'F'}.

  • For a group of arrays, a nested dict is required with values like the above keyed by array name. For example, {'foo': {'compressor': zarr.Blosc(), 'order': 'F'}, 'bar': {'compressor': None}}.

temp_storestr, MutableMapping, or zarr.Store object, optional

Location of temporary store for intermediate data. Can be deleted once rechunking is complete.

temp_options: Dict, optional

Options with same semantics as target_options for temp_store rather than target_store. Defaults to target_options and has no effect when source is of type xarray.Dataset.

executor: str or rechunker.types.Executor

Implementation of the execution engine for copying between zarr arrays. Supplying a custom Executor is currently even more experimental than the rest of Rechunker: we expect the interface to evolve as we add more executors and make no guarantees of backwards compatibility. The currently implemented executors are

  • dask

  • beam

  • prefect

  • python

  • pywren

Returns
rechunkedRechunked object

The Rechunked Object

rechunk returns a Rechunked object.

class rechunker.Rechunked(executor, plan, source, intermediate, target)

A delayed rechunked result.

This represents the rechunking plan, and when executed will perform the rechunking and return the rechunked array.

Examples

>>> source = zarr.ones((4, 4), chunks=(2, 2), store="source.zarr")
>>> intermediate = "intermediate.zarr"
>>> target = "target.zarr"
>>> rechunked = rechunk(source, target_chunks=(4, 1), target_store=target,
...                     max_mem=256000,
...                     temp_store=intermediate)
>>> rechunked
<Rechunked>
* Source      : <zarr.core.Array (4, 4) float64>
* Intermediate: dask.array<from-zarr, ... >
* Target      : <zarr.core.Array (4, 4) float64>
>>> rechunked.execute()
<zarr.core.Array (4, 4) float64>
Attributes
plan

Returns the executor-specific scheduling plan.

Methods

execute(**kwargs)

Execute the rechunking.

Note

You must call execute() on the Rechunked object in order to actually perform the rechunking operation.

Warning

You must manually delete the intermediate store when execute is finished.

Executors

Rechunking plans can be executed on a variety of backends. The following table lists the current options.

rechunker.executors.beam.BeamExecutor(*args, ...)

An execution engine based on Apache Beam.

rechunker.executors.pywren.PywrenExecutor([...])

An execution engine based on Pywren.

class rechunker.executors.beam.BeamExecutor(*args, **kwds)

An execution engine based on Apache Beam.

Supports copying between any arrays that implement __getitem__ and __setitem__ for tuples of slice objects. Array must also be serializable by Beam (i.e., with pickle).

Execution plans for BeamExecutor are beam.PTransform objects.

Methods

execute_plan(plan, **kwargs)

Execute a plan.

prepare_plan(specs)

Convert copy specifications into a plan.

class rechunker.executors.pywren.PywrenExecutor(pywren_function_executor: Optional[pywren_ibm_cloud.executor.FunctionExecutor] = None)

An execution engine based on Pywren.

Supports zarr arrays as inputs. Outputs must be zarr arrays.

Any Pywren FunctionExecutor can be passed to the constructor. By default a Pywren local_executor will be used

Execution plans for PywrenExecutor are functions that accept no arguments.

Methods

execute_plan(plan, **kwargs)

Execute a plan.

prepare_plan(specs)

Convert copy specifications into a plan.