Data Provider#

The gedidb.GEDIProvider module in gediDB is the core interface for accessing structured GEDI data and metadata from a tileDB database. With this module, you can execute spatial and temporal queries on GEDI data, retrieving relevant variables efficiently and enabling complex geospatial operations. The gedidb.GEDIProvider class streamlines the process, making it easy to access the extensive data generated by the GEDI mission for advanced analysis.

Key capabilities#

  • Spatial Queries: Query GEDI data based on specific spatial boundaries, enabling analyses within defined regions.

  • Temporal Queries: Filter data by date range to focus on specific time periods.

  • Variable Selection: Retrieve only the data variables needed for your analysis to optimize performance.

  • Quality Filters: Apply additional quality filters to refine data retrieval based on specific conditions.

  • Reference Point Query: Query GEDI data based on a reference point and get the nearest shots within a defined radius.

  • Flexible Output Formats: Export results as either xarray.Dataset for multi-dimensional data or pandas.DataFrame for tabular data.

Potential available variables#

The database includes a wide range of variables, covering spatial coordinates, elevation data, vegetation metrics, biomass estimates, and quality flags across multiple GEDI products (L2A, L2B, L4A, L4C). These variables enable detailed analyses of forest structure, canopy height, biomass density, and waveform complexity. Below is a table of some variables stored in the database:

Variable Descriptions#

Variable Name

Description

Units

Product

agbd

Aboveground biomass density

Mg/ha

L4A

cover

Total canopy cover

Percent

L2B

cover_z

Cumulative canopy cover vertical profile

Percent

L2B

rh

Relative height metrics at 1% interval

Meters

L2A

wsci

Waveform Structural Complexity Index

adimensional

L4C

rh100

Height above ground of the received waveform signal start

Meters

L2B

fhd_normal

Foliage Height Diversity

adimensional

L2B

pai

Total Plant Area Index

m²/m²

L2B

pavd_z

Plant Area Volume Density profile

m²/m³

L2B

sensitivity

Maximum canopy cover that can be penetrated

adimensional

L2A

Additional variables include elevation data, geolocation flags, and quality indicators, allowing for comprehensive assessments of forest ecosystems. Users can refer to the configuration file or use the gediDB package to query the full list of available variables.

Retrieving GEDI data with the GEDI provider#

The gedidb.GEDIProvider class is your main tool for querying GEDI data from the tileDB database. The following example demonstrates how to configure and use the provider to retrieve data with options to include additional quality filters for customized data refinement.

Basic query example#

import geopandas as gpd
import gedidb as gdb

# Load region of interest
region_of_interest = gpd.read_file('./data/geojson/BR-Sa1.geojson')

# Instantiate the GEDIProvider
provider = gdb.GEDIProvider(storage_type='local',
                            local_path= "/path/to/your/database")

# Define the columns to query and additional parameters
variables = ["agbd", "rh"]

dataset = provider.get_data(variables = variables,
                            geometry = region_of_interest,
                            start_time = "2018-01-01",
                            end_time = "2024-12-31",
                            return_type= 'xarray')

Parameters for get_data()#

  • variables: List of variables (columns) to retrieve from the database.

  • geometry: (Optional) GeoPandas geometry for spatial filtering.

  • start_time: (Optional) Start date for temporal filtering (format: “YYYY-MM-DD”).

  • end_time: (Optional) End date for temporal filtering (format: “YYYY-MM-DD”).

  • return_type: Specifies the format of the returned data, either xarray.Dataset (“xarray”). or pandas.DataFrame (“dataframe”) - The default is “xarray”.

  • query_type: (Optional) Type of query to execute, either “nearest” or “bounding_box”, in case of nearest, a point has to be provided as well (default: “bounding_box”).

  • point: (Optional) Reference point for nearest query, required if query_type is “nearest” (format: Tuple[longitude, latitude]).

  • num_shots: (Optional) Number of shots to retrieve if the query_type is “nearest” (default: 10).

  • radius: (Optional) Radius in degrees around the point if the query_type is “nearest” (default: 0.1).

  • quality_filters: (Optional) Additional quality filters to apply to the query.

The returned data is formatted according to the return_type parameter, making it ready for further analysis.

Applying additional quality filters#

You can further refine the data retrieval by specifying additional quality filters. This customization allows filtering based on specific conditions for selected variables. The filters are added as keyword arguments in the form of field-value conditions.

Example with additional quality filters#

In the following example, we define specific quality filters for the sensitivity and surface_flag fields:

import geopandas as gpd
import gedidb as gdb

# Instantiate the GEDIProvider
provider = gdb.GEDIProvider(storage_type='local',
                            local_path= "/path/to/your/database")


# Load region of interest
region_of_interest = gpd.read_file('./data/geojson/BR-Sa1.geojson')

# Define the columns to query, additional parameters, and quality filters
variables = ["agbd", "rh"]
quality_filters = {
'sensitivity': '>= 0.95 and <= 1.0',
'beam_type': "== 'full'"
}

gedi_data = provider.get_data(variables = variables,
                              geometry = region_of_interest,
                              start_time = "2018-01-01",
                              end_time = "2024-12-31",
                              return_type = 'xarray',
                              **quality_filters)

Quality filters are passed as key-value pairs where the key is the variable name, and the value is the condition (e.g., ‘sensitivity’: ‘>= 0.95 and <= 1.0’). This adds flexibility to refine the query based on specific criteria, improving the relevance of the retrieved data.

Supported output formats#

The gedidb.GEDIProvider supports the following output formats, allowing you to choose the structure that best suits your analysis:

  • xarray.Dataset: Ideal for multi-dimensional data that includes labeled dimensions, suitable for advanced numerical and geospatial analysis.

  • pandas.DataFrame: Perfect for tabular data and smaller datasets, allowing for quick manipulation and export to CSV or other formats.

Below is an example of how the dataset looks in the xarray.Dataset format:

<xarray.Dataset> Size: 291MB
Dimensions:         (shot_number: 660802, profile_points: 101)
Coordinates:
  * shot_number     (shot_number) uint64 5MB 84121100400504737 ... 8412110040...
  * profile_points  (profile_points) int64 808B 0 1 2 3 4 5 ... 96 97 98 99 100
    latitude        (shot_number) float64 5MB -1.044 -1.139 ... -14.85 -14.85
    longitude       (shot_number) float64 5MB -56.48 -56.38 ... -46.41 -46.41
    time            (shot_number) datetime64[ns] 5MB 2020-06-07 ... 2020-06-07
Data variables:
    agbd            (shot_number) float32 3MB 143.8 45.86 50.03 ... 6.885 11.16
    rh              (shot_number, profile_points) float32 267MB -1.53 ... 8.85

The dataset includes multiple dimensions and variables:

  • Dimensions: shot_number (unique ID for each shot) and profile_points (vertical profile points).

  • Coordinates: Metadata such as time, latitude, and longitude, describing each shot’s spatial and temporal context.

  • Data Variables: Core variables like rh (relative height) and agbd (Aboveground biomass density) for ecological analysis.

Below is an example of how the dataset looks in the pandas.DataFrame format:

             latitude  longitude       time  ...  rh_99     rh_100     rh_101
0       -1.044146 -56.475181 2020-06-07  ...  25.59  26.040001  26.570000
1       -1.138822 -56.375156 2020-06-07  ...  15.30  15.680000  16.280001
2       -1.138396 -56.375457 2020-06-07  ...  14.48  14.740000  15.080000
3       -1.189413 -56.366139 2020-06-07  ...  16.48  16.809999  17.219999
4       -1.188570 -56.366732 2020-06-07  ...   9.97  10.200000  10.500000
          ...        ...        ...  ...    ...        ...        ...
660797 -14.849312 -46.408216 2020-06-07  ...   2.42   2.760000   3.580000
660798 -14.848904 -46.408533 2020-06-07  ...   4.14   4.970000   6.650000
660799 -14.848492 -46.408853 2020-06-07  ...   6.53   7.920000   9.790000
660800 -14.847665 -46.409496 2020-06-07  ...   4.97   6.500000   8.740000
660801 -14.848078 -46.409175 2020-06-07  ...   6.09   7.170000   8.850000

[660802 rows x 106 columns]