gedidb.GEDIDatabase#

class gedidb.GEDIDatabase(config: Dict[str, Any], credentials: dict | None = None)[source]#

Manages creation and operation of global TileDB arrays for GEDI data storage.

Performance design decisions#

  • Hilbert pre-sort is opt-in via config (‘hilbert_presort’: true). It improves compression and read locality, but costs an argsort + fancy-index copy per granule. Disable when write throughput matters more than read performance.

  • TileDBFilterPolicy defaults to fast-write mode (‘use_filters’: false). In fast mode only Zstd(1) is applied — no ByteShuffle or BitWidthReduction pre-processors. Set ‘use_filters: true’ in config to enable the full compression pipeline (ByteShuffle+Zstd for floats, BitWidthReduction+Zstd for narrow ints). DoubleDelta is kept on time/timestamp in both modes.

  • dtype coercion in _extract_variable_data is skipped when the source Series already matches the target dtype — avoids a full array copy for every attribute on every granule.

  • Spatial domain bounds and array domain metadata are cached on __init__ so write_granule / spatial_chunking never re-open the array just for config lookups.

  • allows_duplicates=True preserves all valid GEDU shots, including co-located shots within the same UTC day. The old drop_duplicates() silently discarded valid data.

  • write_batch() amortises the TileDB open/close cost across many granules. Prefer it over calling write_granule() in a loop for large ingestion jobs.

  • mark_granule_as_processed() now has retry logic (absent in old version).

  • timestamp_ns is stored as true int64 nanoseconds. The old version divided by 1000 (yielding microseconds), which broke nanosecond-precision deduplication.

__init__(config: Dict[str, Any], credentials: dict | None = None)[source]#

Initialise GEDIDatabase.

Parameters:
  • config (dict) – Configuration dictionary.

  • credentials (dict, optional) – AWS/S3 credentials. Required when storage_type == ‘s3’.

Methods

__init__(config[, credentials])

Initialise GEDIDatabase.

check_granules_status(granule_ids[, full_only])

Check processed status for a list of granule IDs in a single metadata read.

consolidate_fragments([consolidation_type, ...])

Consolidate fragments, metadata, and commit logs.

mark_granule_as_processed(granule_key)

Mark a granule as processed in TileDB metadata (with retry).

mark_granules_as_processed_batch(granule_keys)

Mark multiple granules with the given status in a single TileDB open/close.

spatial_chunking(dataset[, ...])

Yield ((lat_min, lat_max, lon_min, lon_max), view) pairs.

write_granule(granule_data)

Write the parsed GEDI granule data to the global TileDB arrays, filtering out shots that are outside the spatial domain.