Data Model & Outputs
Input contract, calibration conventions, and what functions return
Purpose
This page is the practical bridge between:
- Quick Start / Tutorials (how to run),
- User Guide (how to interpret), and
- Technical Docs (why the math and design choices work).
If you are unsure whether your data is “ready” for SPI/SPEI or what a function returns (DataArray vs Dataset vs DataFrame), start here.
Input Data Contract
Supported containers
- Primary:
xarray.DataArray/xarray.Dataset - Single-location analysis: also supports
pandas.Seriesfor run theory functions
Required coordinates and dimensions
Your gridded inputs must include:
timelatlon
Dimension order can vary. Preferred is (time, lat, lon).
Temporal resolution and periodicity
- Your data can be monthly or daily.
- Always set
periodicityconsistently with your time axis. - Time scales (e.g.,
scale=12) represent aggregation at the chosen periodicity.
Variable requirements
| Index | Required inputs | Units (expected) |
|---|---|---|
| SPI | precipitation | mm/month or mm/day |
| SPEI | precipitation + PET or precipitation + temperature + latitude | precip: mm/*, PET: mm/*, temp: °C |
SPEI: Three input paths
SPEI supports three ways to provide evapotranspiration data:
- Pre-computed PET — pass
pet=your_pet_datadirectly (any PET method) - Temperature + latitude (Thornthwaite) — pass
temperature=andlatitude=; PET is computed internally using the Thornthwaite method (requires only mean temperature) - Temperature + Tmin/Tmax + latitude (Hargreaves) — pass
temperature=,temp_min=,temp_max=, andlatitude=; PET is computed using the Hargreaves-Samani method (better for arid/semi-arid regions)
# Path 1: Pre-computed PET
spei_12 = spei(precip, pet=pet, scale=12)
# Path 2: Auto-PET from temperature (Thornthwaite - default)
spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12)
# Path 3: Auto-PET using Hargreaves (requires Tmin/Tmax)
spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12,
pet_method='hargreaves', temp_min=tmin, temp_max=tmax)PET Method Selection
| Method | Data Requirements | Best For | Reference |
|---|---|---|---|
| Thornthwaite (default) | Tmean, latitude | Humid/temperate climates | Thornthwaite (1948) |
| Hargreaves-Samani | Tmean, Tmin, Tmax, latitude | Arid/semi-arid regions | Hargreaves & Samani (1985) |
If you have TerraClimate or ERA5-Land data with Tmin/Tmax available, Hargreaves often correlates better with physically-based PET (like Penman-Monteith). Testing on Bali showed Hargreaves correlation r=0.80 vs Thornthwaite r=0.75 when compared to TerraClimate PET.
Variable auto-detection
When you pass an xarray.Dataset (rather than a DataArray), the code searches for variables matching these name patterns:
| Variable type | Auto-detected names |
|---|---|
| Precipitation | precip, prcp, precipitation, pr, ppt, rainfall |
| PET | pet, eto, et, evap |
| Temperature (mean) | temp, tas, t2m, tmean |
| Temperature (min) | tmin, tasmin |
| Temperature (max) | tmax, tasmax |
You can always override auto-detection with explicit parameters: precip_var_name=, pet_var_name=, temp_var_name=.
Missing values and data quality
- SPI/SPEI are only meaningful when the underlying record is reasonably complete.
- Recommended minimum for robust statistics: ~30 years of monthly data.
- Ensure basic QC: impossible negatives, unit mistakes, and time gaps should be handled before fitting.
Distribution fitting thresholds
The code enforces these minimums per grid cell during calibration. If a cell fails any check, its output is set to NaN:
| Check | Threshold | What happens if violated |
|---|---|---|
| Minimum valid values | 30 | Fitting skipped (NaN output) |
| Minimum non-zero values | 10 | Fitting skipped (NaN output) |
| Maximum zero proportion | 95% | Fitting skipped (NaN output) |
Problem: In deserts and semi-arid regions, many months have zero precipitation. If >95% of months are zero, fitting fails entirely (NaN output). Even with 80-95% zeros, fitting quality degrades.
Why this matters: Pearson Type III distribution requires at least 4 non-zero values for L-moments computation. When this fails, the code falls back to Method of Moments, then to a normal approximation — but results may be unreliable.
Diagnosis: Use diagnose_data() before running SPI/SPEI to check zero proportions:
from distributions import diagnose_data
# Check a single location
ts = precip.isel(lat=50, lon=100).values
diag = diagnose_data(ts)
print(f"Zero proportion: {diag.zero_proportion:.1%}")
print(f"Recommendation: {diag.recommendation}")Options for arid regions:
- Accept NaN — If a region is truly hyper-arid, SPI may not be meaningful there
- Use longer time scales — SPI-12 or SPI-24 aggregates more precipitation, reducing zero proportion
- Mask desert pixels — Exclude cells with mean annual precipitation < 50 mm
- Use Gamma distribution — More robust than Pearson III for high-zero data (but still limited)
import xarray as xr
precip = xr.open_dataset('input/precip.nc')['precip']
# 1. Dimensions
print(f"Dims: {precip.dims}") # Should include time, lat, lon
# 2. Non-negative
print(f"Min: {precip.min().values}") # Should be >= 0
# 3. Reasonable range
print(f"Max: {precip.max().values}") # Should be < 1000 mm/month
# 4. Time span
years = precip.time.dt.year
print(f"Period: {int(years.min())}-{int(years.max())}") # >= 30 years
# 5. Missing data
missing_pct = float(precip.isnull().mean() * 100)
print(f"Missing: {missing_pct:.1f}%") # Should be < 5%Calibration and Scale Conventions
Calibration period
- Typical default: 1991-2020 (WMO-style baseline)
- Calibration should overlap your data range
- Calibration is applied per grid cell (each pixel fitted independently)
- Update every 10 years when WMO revises the normal period
Distribution selection
Five probability distributions are supported via the distribution parameter:
| Distribution | Key | Parameters | Recommended for |
|---|---|---|---|
| Gamma | 'gamma' |
alpha, beta, prob_zero | SPI (default) |
| Pearson Type III | 'pearson3' |
skew, loc, scale, prob_zero | SPEI |
| Log-Logistic | 'log_logistic' |
alpha, beta, prob_zero | SPEI (alternative) |
| GEV | 'gev' |
shape, loc, scale, prob_zero | Extreme value analysis |
| Generalized Logistic | 'gen_logistic' |
shape, loc, scale, prob_zero | Heavy-tailed data |
Default: 'gamma' for both SPI and SPEI. Recommendation: Gamma for SPI, Pearson III or Log-Logistic for SPEI.
See Probability Distributions for fitting methods, goodness-of-fit tests, and detailed guidance.
Zero precipitation handling (SPI)
SPI uses a mixed distribution when zeros exist:
\[H(x) = q + (1-q) \cdot G(x)\]
Where \(q\) = proportion of zeros in the calibration period, \(G(x)\) = fitted continuous distribution for positive values. See Methodology for the full derivation.
Parameter reuse (recommended for operational workflows)
For repeated runs (e.g., adding new months), you can:
- Fit once on a calibration period
- Save fitted parameters
- Reuse them later to avoid refitting
from indices import spi, save_fitting_params, load_fitting_params
# One-time: fit and save
spi_12, params = spi(precip, scale=12, return_params=True)
save_fitting_params(params, 'spi_12_params.nc',
scale=12, periodicity='monthly', index_type='spi')
# Operational: load and reuse (much faster)
params = load_fitting_params('spi_12_params.nc', scale=12, periodicity='monthly')
spi_12 = spi(new_precip, scale=12, fitting_params=params)This is especially important for large domains.
Output Contract (What You Get Back)
SPI / SPEI single-scale outputs
Return type: xarray.DataArray
- Dimensions:
(time, lat, lon) - Values: standardized index (typically within [-3.09, +3.09], clipped at bounds)
- Variable name: follows the pattern
{index}_{distribution}_{scale}_{periodicity}- Examples:
spi_gamma_12_month,spei_pearson3_3_month
- Examples:
Attributes on the DataArray:
| Attribute | Example value |
|---|---|
scale |
12 |
distribution |
'gamma' |
calibration_start_year |
1991 |
calibration_end_year |
2020 |
Multi-scale outputs
Return type: xarray.Dataset with one variable per scale:
spi_gamma_1_month,spi_gamma_3_month,spi_gamma_12_month, …spei_pearson3_3_month,spei_pearson3_12_month, …
NetCDF file encoding
When saved via save_index_to_netcdf() or .to_netcdf():
| Property | Value |
|---|---|
| Data type | float32 |
| Fill value | -9999.0 |
| Compression | zlib, level 4-5 |
| Conventions | CF-1.8 |
Global attributes (customizable via global_attrs or DEFAULT_METADATA):
from config import DEFAULT_METADATA
# Set once at script start — applies to all subsequent outputs
DEFAULT_METADATA['institution'] = 'My University'
DEFAULT_METADATA['references'] = 'McKee et al. (1993)'
DEFAULT_METADATA['comment'] = 'Regional drought monitoring project'Available fields: institution, source, Conventions, references, comment. You can also add custom fields (e.g., project, contact, license) via the global_attrs parameter.
Run Theory Outputs (Schemas)
Run theory functions operate on a single-location time series (DataArray/Series) or on a gridded index for period statistics.
1) Event-based extraction: identify_events(...)
Returns: pandas.DataFrame, one row per event
| Column | Meaning |
|---|---|
event_id |
sequential event number |
start_date / end_date |
event boundaries |
duration |
number of time steps in event |
magnitude |
cumulative magnitude (total deviation from threshold) |
intensity |
magnitude / duration |
peak |
most extreme index value during event |
peak_date |
when peak occurs |
interarrival |
time since previous event start |
2) Monitoring time series: calculate_timeseries(...)
Returns: pandas.DataFrame, one row per time step
| Column | Meaning |
|---|---|
index_value |
SPI/SPEI value at this time step |
is_event |
whether threshold condition is active (True/False) |
event_id |
current event number (or 0 if no event) |
duration |
current duration so far (resets between events) |
magnitude_cumulative |
accumulated magnitude so far (always increases during event) |
magnitude_instantaneous |
current-step severity (varies — rises, peaks, falls) |
intensity |
magnitude_cumulative / duration |
magnitude_cumulative= total accumulated deviation; like debt — always grows during an eventmagnitude_instantaneous= current month’s severity; like crop NDVI — shows event evolution
See Magnitude Explained for detailed comparison.
3) Gridded period statistics: calculate_period_statistics(...)
Returns: xarray.Dataset with dimensions (lat, lon), containing:
| Variable | Description | Units |
|---|---|---|
num_events |
Number of dry/wet events | count |
total_event_months |
Total months spent in events | months |
total_magnitude |
Sum of all event magnitudes | index units |
mean_magnitude |
Average magnitude per event | index units |
max_magnitude |
Largest single event magnitude | index units |
worst_peak |
Most extreme SPI/SPEI value | index units |
mean_intensity |
Average intensity across events | index units/month |
max_intensity |
Maximum intensity from any event | index units/month |
pct_time_in_event |
Percentage of period spent in events | % |
Where to go next
- Want “how-to”? → Quick Start and Tutorials
- Want meaning/interpretation? → User Guide
- Want scientific detail? → Methodology
- Want distribution theory? → Distributions
See Also
- Installation - Setup and dependencies
- Quick Start - First SPI/SPEI calculation
- Configuration - Customize settings
- Examples - Common use cases and code snippets