Data Provider#

The gedidb.GEDIProvider module in gediDB is the core interface for accessing structured GEDI data and metadata from a tileDB database. With this module, you can execute spatial and temporal queries on GEDI data, retrieving relevant variables efficiently and enabling complex geospatial operations. The gedidb.GEDIProvider class streamlines the process, making it easy to access the extensive data generated by the GEDI mission for advanced analysis.

Key capabilities#

Spatial Queries: Query GEDI data based on specific spatial boundaries, enabling analyses within defined regions.
Temporal Queries: Filter data by date range to focus on specific time periods.
Variable Selection: Retrieve only the data variables needed for your analysis to optimize performance.
Quality Filters: Apply additional quality filters to refine data retrieval based on specific conditions.
Reference Point Query: Query GEDI data based on a reference point and get the nearest shots within a defined radius.
Flexible Output Formats: Export results as either xarray.Dataset for multi-dimensional data or pandas.DataFrame for tabular data.

Potential available variables#

The database includes a wide range of variables, covering spatial coordinates, elevation data, vegetation metrics, biomass estimates, and quality flags across multiple GEDI products (L2A, L2B, L4A, L4C). These variables enable detailed analyses of forest structure, canopy height, biomass density, and waveform complexity. Below is a table of some variables stored in the database:

Variable Descriptions#
Variable Name	Description	Units	Product
agbd	Aboveground biomass density	Mg/ha	L4A
cover	Total canopy cover	Percent	L2B
cover_z	Cumulative canopy cover vertical profile	Percent	L2B
rh	Relative height metrics at 1% interval	Meters	L2A
wsci	Waveform Structural Complexity Index	adimensional	L4C
rh100	Height above ground of the received waveform signal start	cm	L2B
fhd_normal	Foliage Height Diversity	adimensional	L2B
pai	Total Plant Area Index	m²/m²	L2B
pavd_z	Plant Area Volume Density profile	m²/m³	L2B
sensitivity	Maximum canopy cover that can be penetrated	adimensional	L2A

Additional variables include elevation data, geolocation flags, and quality indicators, allowing for comprehensive assessments of forest ecosystems. Users can refer to the configuration file or use the gediDB package to query the full list of available variables.

Retrieving GEDI data with the GEDI provider#

The gedidb.GEDIProvider class is your main tool for querying GEDI data from the tileDB database. The following example demonstrates how to configure and use the provider to retrieve data with options to include additional quality filters for customized data refinement.

Basic query example#

import geopandas as gpd
import gedidb as gdb

# Load region of interest
region_of_interest = gpd.read_file('./data/geojson/BR-Sa1.geojson')

# Instantiate the GEDIProvider
provider = gdb.GEDIProvider(storage_type='local',
                            local_path= "/path/to/your/database")

# Define the columns to query and additional parameters
variables = ["agbd", "rh"]

dataset = provider.get_data(variables = variables,
                            geometry = region_of_interest,
                            start_time = "2018-01-01",
                            end_time = "2024-12-31",
                            return_type= 'xarray')

Parameters for `get_data()`#

variables: List of variables (columns) to retrieve from the database. Profile variables (e.g. rh, cover_z, pai_z, pavd_z) return all values per shot as a list by default. To fetch a single element by label and save bandwidth, use the "variable:label" syntax, e.g. "rh:98" (98th-percentile canopy height only) or "cover_z:50" (cumulative cover at 50 m above ground only). The resulting column is renamed {variable}_p{label} (e.g. rh_p98).

geometry: (Optional) GeoPandas geometry for spatial filtering.

start_time: (Optional) Start date for temporal filtering (format: “YYYY-MM-DD”).

end_time: (Optional) End date for temporal filtering (format: “YYYY-MM-DD”).

return_type: Specifies the format of the returned data, either xarray.Dataset (“xarray”). or pandas.DataFrame (“dataframe”) - The default is “xarray”.

query_type: (Optional) Type of query to execute, either “nearest” or “bounding_box”, in case of nearest, a point has to be provided as well (default: “bounding_box”).

point: (Optional) Reference point for nearest query, required if query_type is “nearest” (format: Tuple[longitude, latitude]).

num_shots: (Optional) Number of shots to retrieve if the query_type is “nearest” (default: 10).

radius: (Optional) Radius in degrees around the point if the query_type is “nearest” (default: 0.1).

quality_filters: (Optional) Additional quality filters to apply to the query.

The returned data is formatted according to the return_type parameter, making it ready for further analysis.

Querying individual profile points#

Several GEDI variables are stored as per-shot vertical profiles:

Profile variables#
Variable	Profile labels	Label meaning	Length
`rh`, `rh_a1`, `rh_a2`, `rh_a5`	0, 1, 2, …, 100	Percentile	101
`cover_z`	0, 5, 10, …, 145	Height above ground (m)	30
`pai_z`	0, 5, 10, …, 145	Height above ground (m)	30
`pavd_z`	0, 5, 10, …, 145	Height above ground (m)	30

Requesting a full profile (e.g. "rh") returns all values as a list per shot. To fetch only a single element — useful when you need just one percentile or one height bin and want to minimise data transfer — use the "variable:label" syntax:

import geopandas as gpd
import gedidb as gdb

provider = gdb.GEDIProvider(storage_type='local',
                            local_path="/path/to/your/database")
region_of_interest = gpd.read_file('./data/geojson/BR-Sa1.geojson')

# Fetch only the 98th-percentile canopy height → column named "rh_p98"
df = provider.get_data(
    variables=["agbd", "rh:98"],
    geometry=region_of_interest,
    start_time="2020-01-01",
    end_time="2020-12-31",
    return_type="dataframe",
)

# Fetch cumulative cover at 50 m height → column named "cover_z_p50"
df = provider.get_data(
    variables=["agbd", "cover_z:50"],
    geometry=region_of_interest,
    start_time="2020-01-01",
    end_time="2020-12-31",
    return_type="dataframe",
)

# Mix full profiles with single-label selections
df = provider.get_data(
    variables=["rh:98", "rh:50", "pavd_z:25"],
    geometry=region_of_interest,
    start_time="2020-01-01",
    end_time="2020-12-31",
    return_type="dataframe",
)
# Columns: rh_p98, rh_p50, pavd_z_p25

Applying additional quality filters#

You can further refine the data retrieval by specifying additional quality filters. This customization allows filtering based on specific conditions for selected variables. The filters are added as keyword arguments in the form of field-value conditions.

Example with additional quality filters#

In the following example, we define specific quality filters for the sensitivity and surface_flag fields:

import geopandas as gpd
import gedidb as gdb

# Instantiate the GEDIProvider
provider = gdb.GEDIProvider(storage_type='local',
                            local_path= "/path/to/your/database")


# Load region of interest
region_of_interest = gpd.read_file('./data/geojson/BR-Sa1.geojson')

# Define the columns to query, additional parameters, and quality filters
variables = ["agbd", "rh"]
quality_filters = {
'sensitivity': '>= 0.95 and <= 1.0',
'beam_type': "== 'full'"
}

gedi_data = provider.get_data(variables = variables,
                              geometry = region_of_interest,
                              start_time = "2018-01-01",
                              end_time = "2024-12-31",
                              return_type = 'xarray',
                              **quality_filters)

Quality filters are passed as key-value pairs where the key is the variable name, and the value is the condition (e.g., ‘sensitivity’: ‘>= 0.95 and <= 1.0’). This adds flexibility to refine the query based on specific criteria, improving the relevance of the retrieved data.

Supported output formats#

The gedidb.GEDIProvider supports the following output formats, allowing you to choose the structure that best suits your analysis:

xarray.Dataset: Ideal for multi-dimensional data that includes labeled dimensions, suitable for advanced numerical and geospatial analysis.
pandas.DataFrame: Perfect for tabular data and smaller datasets, allowing for quick manipulation and export to CSV or other formats.

Below is an example of how the dataset looks in the xarray.Dataset format:

<xarray.Dataset> Size: 291MB
Dimensions:         (shot_number: 660802, profile_points: 101)
Coordinates:
  * shot_number     (shot_number) uint64 5MB 84121100400504737 ... 8412110040...
  * profile_points  (profile_points) int64 808B 0 1 2 3 4 5 ... 96 97 98 99 100
    latitude        (shot_number) float64 5MB -1.044 -1.139 ... -14.85 -14.85
    longitude       (shot_number) float64 5MB -56.48 -56.38 ... -46.41 -46.41
    time            (shot_number) datetime64[ns] 5MB 2020-06-07 ... 2020-06-07
Data variables:
    agbd            (shot_number) float32 3MB 143.8 45.86 50.03 ... 6.885 11.16
    rh              (shot_number, profile_points) float32 267MB -1.53 ... 8.85

The dataset includes multiple dimensions and variables:

Dimensions: shot_number (unique ID for each shot) and profile_points (vertical profile points).
Coordinates: Metadata such as time, latitude, and longitude, describing each shot’s spatial and temporal context.
Data Variables: Core variables like rh (relative height) and agbd (Aboveground biomass density) for ecological analysis.

Below is an example of how the dataset looks in the pandas.DataFrame format:

             latitude  longitude       time  ...  rh_99     rh_100     rh_101
0       -1.044146 -56.475181 2020-06-07  ...  25.59  26.040001  26.570000
1       -1.138822 -56.375156 2020-06-07  ...  15.30  15.680000  16.280001
2       -1.138396 -56.375457 2020-06-07  ...  14.48  14.740000  15.080000
3       -1.189413 -56.366139 2020-06-07  ...  16.48  16.809999  17.219999
4       -1.188570 -56.366732 2020-06-07  ...   9.97  10.200000  10.500000
          ...        ...        ...  ...    ...        ...        ...
660797 -14.849312 -46.408216 2020-06-07  ...   2.42   2.760000   3.580000
660798 -14.848904 -46.408533 2020-06-07  ...   4.14   4.970000   6.650000
660799 -14.848492 -46.408853 2020-06-07  ...   6.53   7.920000   9.790000
660800 -14.847665 -46.409496 2020-06-07  ...   4.97   6.500000   8.740000
660801 -14.848078 -46.409175 2020-06-07  ...   6.09   7.170000   8.850000

[660802 rows x 106 columns]

—