Data Model & Outputs

Input contract, calibration conventions, and what functions return

Purpose

This page is the practical bridge between:

Quick Start / Tutorials (how to run),
User Guide (how to interpret), and
Technical Docs (why the math and design choices work).

If you are unsure whether your data is “ready” for SPI/SPEI or what a function returns (DataArray vs Dataset vs DataFrame), start here.

Input Data Contract

Supported containers

Primary: xarray.DataArray / xarray.Dataset
Single-location analysis: also supports pandas.Series for run theory functions

Required coordinates and dimensions

Your gridded inputs must include:

time
lat
lon

Dimension order can vary. Preferred is (time, lat, lon).

Temporal resolution and periodicity

Your data can be monthly or daily.
Always set periodicity consistently with your time axis.
Time scales (e.g., scale=12) represent aggregation at the chosen periodicity.

Variable requirements

Index	Required inputs	Units (expected)
SPI	precipitation	mm/month or mm/day
SPEI	precipitation + PET or precipitation + temperature + latitude	precip: mm/, PET: mm/, temp: °C

SPEI: Three input paths

SPEI supports three ways to provide evapotranspiration data:

Pre-computed PET — pass pet=your_pet_data directly (any PET method)
Temperature + latitude (Thornthwaite) — pass temperature= and latitude=; PET is computed internally using the Thornthwaite method (requires only mean temperature)
Temperature + Tmin/Tmax + latitude (Hargreaves) — pass temperature=, temp_min=, temp_max=, and latitude=; PET is computed using the Hargreaves-Samani method (better for arid/semi-arid regions)

# Path 1: Pre-computed PET
spei_12 = spei(precip, pet=pet, scale=12)

# Path 2: Auto-PET from temperature (Thornthwaite - default)
spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12)

# Path 3: Auto-PET using Hargreaves (requires Tmin/Tmax)
spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12,
               pet_method='hargreaves', temp_min=tmin, temp_max=tmax)

PET Method Selection

Method	Data Requirements	Best For	Reference
Thornthwaite (default)	Tmean, latitude	Humid/temperate climates	Thornthwaite (1948)
Hargreaves-Samani	Tmean, Tmin, Tmax, latitude	Arid/semi-arid regions	Hargreaves & Samani (1985)

When to use Hargreaves

If you have TerraClimate or ERA5-Land data with Tmin/Tmax available, Hargreaves often correlates better with physically-based PET (like Penman-Monteith). Testing on Bali showed Hargreaves correlation r=0.80 vs Thornthwaite r=0.75 when compared to TerraClimate PET.

Variable auto-detection

When you pass an xarray.Dataset (rather than a DataArray), the code searches for variables matching these name patterns:

Variable type	Auto-detected names
Precipitation	`precip`, `prcp`, `precipitation`, `pr`, `ppt`, `rainfall`
PET	`pet`, `eto`, `et`, `evap`
Temperature (mean)	`temp`, `tas`, `t2m`, `tmean`
Temperature (min)	`tmin`, `tasmin`
Temperature (max)	`tmax`, `tasmax`

You can always override auto-detection with explicit parameters: precip_var_name=, pet_var_name=, temp_var_name=.

Missing values and data quality

SPI/SPEI are only meaningful when the underlying record is reasonably complete.
Recommended minimum for robust statistics: ~30 years of monthly data.
Ensure basic QC: impossible negatives, unit mistakes, and time gaps should be handled before fitting.

Distribution fitting thresholds

The code enforces these minimums per grid cell during calibration. If a cell fails any check, its output is set to NaN:

Check	Threshold	What happens if violated
Minimum valid values	30	Fitting skipped (NaN output)
Minimum non-zero values	10	Fitting skipped (NaN output)
Maximum zero proportion	95%	Fitting skipped (NaN output)

Arid regions and zero precipitation

Problem: In deserts and semi-arid regions, many months have zero precipitation. If >95% of months are zero, fitting fails entirely (NaN output). Even with 80-95% zeros, fitting quality degrades.

Why this matters: Pearson Type III distribution requires at least 4 non-zero values for L-moments computation. When this fails, the code falls back to Method of Moments, then to a normal approximation — but results may be unreliable.

Diagnosis: Use diagnose_data() before running SPI/SPEI to check zero proportions:

from distributions import diagnose_data

# Check a single location
ts = precip.isel(lat=50, lon=100).values
diag = diagnose_data(ts)
print(f"Zero proportion: {diag.zero_proportion:.1%}")
print(f"Recommendation: {diag.recommendation}")

Options for arid regions:

Accept NaN — If a region is truly hyper-arid, SPI may not be meaningful there
Use longer time scales — SPI-12 or SPI-24 aggregates more precipitation, reducing zero proportion
Mask desert pixels — Exclude cells with mean annual precipitation < 50 mm
Use Gamma distribution — More robust than Pearson III for high-zero data (but still limited)

Quick sanity checks (recommended)

import xarray as xr

precip = xr.open_dataset('input/precip.nc')['precip']

# 1. Dimensions
print(f"Dims: {precip.dims}")          # Should include time, lat, lon

# 2. Non-negative
print(f"Min: {precip.min().values}")   # Should be >= 0

# 3. Reasonable range
print(f"Max: {precip.max().values}")   # Should be < 1000 mm/month

# 4. Time span
years = precip.time.dt.year
print(f"Period: {int(years.min())}-{int(years.max())}")  # >= 30 years

# 5. Missing data
missing_pct = float(precip.isnull().mean() * 100)
print(f"Missing: {missing_pct:.1f}%")  # Should be < 5%

Calibration and Scale Conventions

Calibration period

Typical default: 1991-2020 (WMO-style baseline)
Calibration should overlap your data range
Calibration is applied per grid cell (each pixel fitted independently)
Update every 10 years when WMO revises the normal period

Distribution selection

Five probability distributions are supported via the distribution parameter:

Distribution	Key	Parameters	Recommended for
Gamma	`'gamma'`	alpha, beta, prob_zero	SPI (default)
Pearson Type III	`'pearson3'`	skew, loc, scale, prob_zero	SPEI
Log-Logistic	`'log_logistic'`	alpha, beta, prob_zero	SPEI (alternative)
GEV	`'gev'`	shape, loc, scale, prob_zero	Extreme value analysis
Generalized Logistic	`'gen_logistic'`	shape, loc, scale, prob_zero	Heavy-tailed data

Default: 'gamma' for both SPI and SPEI. Recommendation: Gamma for SPI, Pearson III or Log-Logistic for SPEI.

See Probability Distributions for fitting methods, goodness-of-fit tests, and detailed guidance.

Zero precipitation handling (SPI)

SPI uses a mixed distribution when zeros exist:

\[H(x) = q + (1-q) \cdot G(x)\]

Where $q$ = proportion of zeros in the calibration period, $G(x)$ = fitted continuous distribution for positive values. See Methodology for the full derivation.

Parameter reuse (recommended for operational workflows)

For repeated runs (e.g., adding new months), you can:

Fit once on a calibration period
Save fitted parameters
Reuse them later to avoid refitting

from indices import spi, save_fitting_params, load_fitting_params

# One-time: fit and save
spi_12, params = spi(precip, scale=12, return_params=True)
save_fitting_params(params, 'spi_12_params.nc',
                    scale=12, periodicity='monthly', index_type='spi')

# Operational: load and reuse (much faster)
params = load_fitting_params('spi_12_params.nc', scale=12, periodicity='monthly')
spi_12 = spi(new_precip, scale=12, fitting_params=params)

This is especially important for large domains.

Output Contract (What You Get Back)

SPI / SPEI single-scale outputs

Return type: xarray.DataArray

Dimensions: (time, lat, lon)
Values: standardized index (typically within [-3.09, +3.09], clipped at bounds)
Variable name: follows the pattern {index}_{distribution}_{scale}_{periodicity}
- Examples: spi_gamma_12_month, spei_pearson3_3_month

Attributes on the DataArray:

Attribute	Example value
`scale`	`12`
`distribution`	`'gamma'`
`calibration_start_year`	`1991`
`calibration_end_year`	`2020`

Multi-scale outputs

Return type: xarray.Dataset with one variable per scale:

spi_gamma_1_month, spi_gamma_3_month, spi_gamma_12_month, …
spei_pearson3_3_month, spei_pearson3_12_month, …

NetCDF file encoding

When saved via save_index_to_netcdf() or .to_netcdf():

Property	Value
Data type	`float32`
Fill value	`-9999.0`
Compression	zlib, level 4-5
Conventions	CF-1.8

Global attributes (customizable via global_attrs or DEFAULT_METADATA):

from config import DEFAULT_METADATA

# Set once at script start — applies to all subsequent outputs
DEFAULT_METADATA['institution'] = 'My University'
DEFAULT_METADATA['references'] = 'McKee et al. (1993)'
DEFAULT_METADATA['comment'] = 'Regional drought monitoring project'

Available fields: institution, source, Conventions, references, comment. You can also add custom fields (e.g., project, contact, license) via the global_attrs parameter.

Run Theory Outputs (Schemas)

Run theory functions operate on a single-location time series (DataArray/Series) or on a gridded index for period statistics.

1) Event-based extraction: `identify_events(...)`

Returns: pandas.DataFrame, one row per event

Column	Meaning
`event_id`	sequential event number
`start_date` / `end_date`	event boundaries
`duration`	number of time steps in event
`magnitude`	cumulative magnitude (total deviation from threshold)
`intensity`	magnitude / duration
`peak`	most extreme index value during event
`peak_date`	when `peak` occurs
`interarrival`	time since previous event start

2) Monitoring time series: `calculate_timeseries(...)`

Returns: pandas.DataFrame, one row per time step

Column	Meaning
`index_value`	SPI/SPEI value at this time step
`is_event`	whether threshold condition is active (`True`/`False`)
`event_id`	current event number (or 0 if no event)
`duration`	current duration so far (resets between events)
`magnitude_cumulative`	accumulated magnitude so far (always increases during event)
`magnitude_instantaneous`	current-step severity (varies — rises, peaks, falls)
`intensity`	magnitude_cumulative / duration

Two magnitude columns

magnitude_cumulative = total accumulated deviation; like debt — always grows during an event
magnitude_instantaneous = current month’s severity; like crop NDVI — shows event evolution

See Magnitude Explained for detailed comparison.

3) Gridded period statistics: `calculate_period_statistics(...)`

Returns: xarray.Dataset with dimensions (lat, lon), containing:

Variable	Description	Units
`num_events`	Number of dry/wet events	count
`total_event_months`	Total months spent in events	months
`total_magnitude`	Sum of all event magnitudes	index units
`mean_magnitude`	Average magnitude per event	index units
`max_magnitude`	Largest single event magnitude	index units
`worst_peak`	Most extreme SPI/SPEI value	index units
`mean_intensity`	Average intensity across events	index units/month
`max_intensity`	Maximum intensity from any event	index units/month
`pct_time_in_event`	Percentage of period spent in events	%

Where to go next

Want “how-to”? → Quick Start and Tutorials
Want meaning/interpretation? → User Guide
Want scientific detail? → Methodology
Want distribution theory? → Distributions

--- title: "Data Model & Outputs" subtitle: "Input contract, calibration conventions, and what functions return" --- ## Purpose This page is the practical bridge between: - **Quick Start / Tutorials** (how to run), - **User Guide** (how to interpret), and - **Technical Docs** (why the math and design choices work). If you are unsure whether your data is "ready" for SPI/SPEI or what a function returns (DataArray vs Dataset vs DataFrame), start here. --- ## Input Data Contract ### Supported containers - Primary: `xarray.DataArray` / `xarray.Dataset` - Single-location analysis: also supports `pandas.Series` for run theory functions ### Required coordinates and dimensions Your gridded inputs must include: - `time` - `lat` - `lon` Dimension order can vary. Preferred is `(time, lat, lon)`. ### Temporal resolution and periodicity - Your data can be **monthly or daily**. - Always set `periodicity` consistently with your time axis. - Time scales (e.g., `scale=12`) represent aggregation at the chosen periodicity. ### Variable requirements | Index | Required inputs | Units (expected) | |------|------------------|------------------| | SPI | precipitation | mm/month or mm/day | | SPEI | precipitation + PET **or** precipitation + temperature + latitude | precip: mm/\*, PET: mm/\*, temp: °C | #### SPEI: Three input paths SPEI supports three ways to provide evapotranspiration data: 1. **Pre-computed PET** — pass `pet=your_pet_data` directly (any PET method) 2. **Temperature + latitude (Thornthwaite)** — pass `temperature=` and `latitude=`; PET is computed internally using the **Thornthwaite method** (requires only mean temperature) 3. **Temperature + Tmin/Tmax + latitude (Hargreaves)** — pass `temperature=`, `temp_min=`, `temp_max=`, and `latitude=`; PET is computed using the **Hargreaves-Samani method** (better for arid/semi-arid regions) ```python # Path 1: Pre-computed PET spei_12 = spei(precip, pet=pet, scale=12) # Path 2: Auto-PET from temperature (Thornthwaite - default) spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12) # Path 3: Auto-PET using Hargreaves (requires Tmin/Tmax) spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12, pet_method='hargreaves', temp_min=tmin, temp_max=tmax) ``` #### PET Method Selection | Method | Data Requirements | Best For | Reference | |--------|------------------|----------|-----------| | **Thornthwaite** (default) | Tmean, latitude | Humid/temperate climates | Thornthwaite (1948) | | **Hargreaves-Samani** | Tmean, Tmin, Tmax, latitude | Arid/semi-arid regions | Hargreaves & Samani (1985) | ::: {.callout-tip} ## When to use Hargreaves If you have TerraClimate or ERA5-Land data with Tmin/Tmax available, Hargreaves often correlates better with physically-based PET (like Penman-Monteith). Testing on Bali showed Hargreaves correlation r=0.80 vs Thornthwaite r=0.75 when compared to TerraClimate PET. ::: ### Variable auto-detection When you pass an `xarray.Dataset` (rather than a `DataArray`), the code searches for variables matching these name patterns: | Variable type | Auto-detected names | |--------------|---------------------| | Precipitation | `precip`, `prcp`, `precipitation`, `pr`, `ppt`, `rainfall` | | PET | `pet`, `eto`, `et`, `evap` | | Temperature (mean) | `temp`, `tas`, `t2m`, `tmean` | | Temperature (min) | `tmin`, `tasmin` | | Temperature (max) | `tmax`, `tasmax` | You can always override auto-detection with explicit parameters: `precip_var_name=`, `pet_var_name=`, `temp_var_name=`. ### Missing values and data quality - SPI/SPEI are only meaningful when the underlying record is reasonably complete. - Recommended minimum for robust statistics: **~30 years of monthly data**. - Ensure basic QC: impossible negatives, unit mistakes, and time gaps should be handled before fitting. #### Distribution fitting thresholds The code enforces these minimums per grid cell during calibration. If a cell fails any check, its output is set to NaN: | Check | Threshold | What happens if violated | |-------|-----------|------------------------| | Minimum valid values | 30 | Fitting skipped (NaN output) | | Minimum non-zero values | 10 | Fitting skipped (NaN output) | | Maximum zero proportion | 95% | Fitting skipped (NaN output) | ::: {.callout-warning} ## Arid regions and zero precipitation **Problem:** In deserts and semi-arid regions, many months have zero precipitation. If >95% of months are zero, fitting fails entirely (NaN output). Even with 80-95% zeros, fitting quality degrades. **Why this matters:** Pearson Type III distribution requires at least 4 non-zero values for L-moments computation. When this fails, the code falls back to Method of Moments, then to a normal approximation — but results may be unreliable. **Diagnosis:** Use `diagnose_data()` before running SPI/SPEI to check zero proportions: ```python from distributions import diagnose_data # Check a single location ts = precip.isel(lat=50, lon=100).values diag = diagnose_data(ts) print(f"Zero proportion: {diag.zero_proportion:.1%}") print(f"Recommendation: {diag.recommendation}") ``` **Options for arid regions:** 1. **Accept NaN** — If a region is truly hyper-arid, SPI may not be meaningful there 2. **Use longer time scales** — SPI-12 or SPI-24 aggregates more precipitation, reducing zero proportion 3. **Mask desert pixels** — Exclude cells with mean annual precipitation < 50 mm 4. **Use Gamma distribution** — More robust than Pearson III for high-zero data (but still limited) ::: ::: {.callout-tip} ## Quick sanity checks (recommended) ```python import xarray as xr precip = xr.open_dataset('input/precip.nc')['precip'] # 1. Dimensions print(f"Dims: {precip.dims}") # Should include time, lat, lon # 2. Non-negative print(f"Min: {precip.min().values}") # Should be >= 0 # 3. Reasonable range print(f"Max: {precip.max().values}") # Should be < 1000 mm/month # 4. Time span years = precip.time.dt.year print(f"Period: {int(years.min())}-{int(years.max())}") # >= 30 years # 5. Missing data missing_pct = float(precip.isnull().mean() * 100) print(f"Missing: {missing_pct:.1f}%") # Should be < 5% ``` ::: --- ## Calibration and Scale Conventions ### Calibration period - Typical default: **1991-2020** (WMO-style baseline) - Calibration should overlap your data range - Calibration is applied **per grid cell** (each pixel fitted independently) - Update every 10 years when WMO revises the normal period ### Distribution selection Five probability distributions are supported via the `distribution` parameter: | Distribution | Key | Parameters | Recommended for | |-------------|-----|------------|----------------| | Gamma | `'gamma'` | alpha, beta, prob_zero | **SPI** (default) | | Pearson Type III | `'pearson3'` | skew, loc, scale, prob_zero | **SPEI** | | Log-Logistic | `'log_logistic'` | alpha, beta, prob_zero | SPEI (alternative) | | GEV | `'gev'` | shape, loc, scale, prob_zero | Extreme value analysis | | Generalized Logistic | `'gen_logistic'` | shape, loc, scale, prob_zero | Heavy-tailed data | **Default:** `'gamma'` for both SPI and SPEI. **Recommendation:** Gamma for SPI, Pearson III or Log-Logistic for SPEI. See [Probability Distributions](../technical/distributions.qmd) for fitting methods, goodness-of-fit tests, and detailed guidance. ### Zero precipitation handling (SPI) SPI uses a mixed distribution when zeros exist: $$H(x) = q + (1-q) \cdot G(x)$$ Where $q$ = proportion of zeros in the calibration period, $G(x)$ = fitted continuous distribution for positive values. See [Methodology](../technical/methodology.qmd) for the full derivation. ### Parameter reuse (recommended for operational workflows) For repeated runs (e.g., adding new months), you can: 1. Fit once on a calibration period 2. Save fitted parameters 3. Reuse them later to avoid refitting ```python from indices import spi, save_fitting_params, load_fitting_params # One-time: fit and save spi_12, params = spi(precip, scale=12, return_params=True) save_fitting_params(params, 'spi_12_params.nc', scale=12, periodicity='monthly', index_type='spi') # Operational: load and reuse (much faster) params = load_fitting_params('spi_12_params.nc', scale=12, periodicity='monthly') spi_12 = spi(new_precip, scale=12, fitting_params=params) ``` This is especially important for large domains. --- ## Output Contract (What You Get Back) ### SPI / SPEI single-scale outputs **Return type:** `xarray.DataArray` - **Dimensions:** `(time, lat, lon)` - **Values:** standardized index (typically within [-3.09, +3.09], clipped at bounds) - **Variable name:** follows the pattern `{index}_{distribution}_{scale}_{periodicity}` - Examples: `spi_gamma_12_month`, `spei_pearson3_3_month` **Attributes on the DataArray:** | Attribute | Example value | |-----------|--------------| | `scale` | `12` | | `distribution` | `'gamma'` | | `calibration_start_year` | `1991` | | `calibration_end_year` | `2020` | ### Multi-scale outputs **Return type:** `xarray.Dataset` with one variable per scale: - `spi_gamma_1_month`, `spi_gamma_3_month`, `spi_gamma_12_month`, ... - `spei_pearson3_3_month`, `spei_pearson3_12_month`, ... ### NetCDF file encoding When saved via `save_index_to_netcdf()` or `.to_netcdf()`: | Property | Value | |----------|-------| | Data type | `float32` | | Fill value | `-9999.0` | | Compression | zlib, level 4-5 | | Conventions | CF-1.8 | **Global attributes** (customizable via `global_attrs` or `DEFAULT_METADATA`): ```python from config import DEFAULT_METADATA # Set once at script start — applies to all subsequent outputs DEFAULT_METADATA['institution'] = 'My University' DEFAULT_METADATA['references'] = 'McKee et al. (1993)' DEFAULT_METADATA['comment'] = 'Regional drought monitoring project' ``` Available fields: `institution`, `source`, `Conventions`, `references`, `comment`. You can also add custom fields (e.g., `project`, `contact`, `license`) via the `global_attrs` parameter. --- ## Run Theory Outputs (Schemas) Run theory functions operate on a **single-location time series** (DataArray/Series) or on a gridded index for period statistics. ### 1) Event-based extraction: `identify_events(...)` **Returns:** `pandas.DataFrame`, one row per event | Column | Meaning | |--------|---------| | `event_id` | sequential event number | | `start_date` / `end_date` | event boundaries | | `duration` | number of time steps in event | | `magnitude` | cumulative magnitude (total deviation from threshold) | | `intensity` | magnitude / duration | | `peak` | most extreme index value during event | | `peak_date` | when `peak` occurs | | `interarrival` | time since previous event start | ### 2) Monitoring time series: `calculate_timeseries(...)` **Returns:** `pandas.DataFrame`, one row per time step | Column | Meaning | |--------|---------| | `index_value` | SPI/SPEI value at this time step | | `is_event` | whether threshold condition is active (`True`/`False`) | | `event_id` | current event number (or 0 if no event) | | `duration` | current duration so far (resets between events) | | `magnitude_cumulative` | accumulated magnitude so far (always increases during event) | | `magnitude_instantaneous` | current-step severity (varies — rises, peaks, falls) | | `intensity` | magnitude_cumulative / duration | ::: {.callout-note} ## Two magnitude columns - **`magnitude_cumulative`** = total accumulated deviation; like debt — always grows during an event - **`magnitude_instantaneous`** = current month's severity; like crop NDVI — shows event evolution See [Magnitude Explained](../user-guide/magnitude.qmd) for detailed comparison. ::: ### 3) Gridded period statistics: `calculate_period_statistics(...)` **Returns:** `xarray.Dataset` with dimensions `(lat, lon)`, containing: | Variable | Description | Units | |----------|-------------|-------| | `num_events` | Number of dry/wet events | count | | `total_event_months` | Total months spent in events | months | | `total_magnitude` | Sum of all event magnitudes | index units | | `mean_magnitude` | Average magnitude per event | index units | | `max_magnitude` | Largest single event magnitude | index units | | `worst_peak` | Most extreme SPI/SPEI value | index units | | `mean_intensity` | Average intensity across events | index units/month | | `max_intensity` | Maximum intensity from any event | index units/month | | `pct_time_in_event` | Percentage of period spent in events | % | --- ## Where to go next - Want **"how-to"**? → [Quick Start](quick-start.qmd) and [Tutorials](../tutorials/01-calculate-spi.qmd) - Want **meaning/interpretation**? → [User Guide](../user-guide/index.qmd) - Want **scientific detail**? → [Methodology](../technical/methodology.qmd) - Want **distribution theory**? → [Distributions](../technical/distributions.qmd) --- ## See Also - [Installation](installation.qmd) - Setup and dependencies - [Quick Start](quick-start.qmd) - First SPI/SPEI calculation - [Configuration](configuration.qmd) - Customize settings - [Examples](examples.qmd) - Common use cases and code snippets