Data Model & Outputs

Input contract, calibration conventions, and what functions return

Purpose

This page is the practical bridge between:

  • Quick Start / Tutorials (how to run),
  • User Guide (how to interpret), and
  • Technical Docs (why the math and design choices work).

If you are unsure whether your data is “ready” for SPI/SPEI or what a function returns (DataArray vs Dataset vs DataFrame), start here.


Input Data Contract

Supported containers

  • Primary: xarray.DataArray / xarray.Dataset
  • Single-location analysis: also supports pandas.Series for run theory functions

Required coordinates and dimensions

Your gridded inputs must include:

  • time
  • lat
  • lon

Dimension order can vary. Preferred is (time, lat, lon).

Temporal resolution and periodicity

  • Your data can be monthly or daily.
  • Always set periodicity consistently with your time axis.
  • Time scales (e.g., scale=12) represent aggregation at the chosen periodicity.

Variable requirements

Index Required inputs Units (expected)
SPI precipitation mm/month or mm/day
SPEI precipitation + PET or precipitation + temperature + latitude precip: mm/*, PET: mm/*, temp: °C

SPEI: Three input paths

SPEI supports three ways to provide evapotranspiration data:

  1. Pre-computed PET — pass pet=your_pet_data directly (any PET method)
  2. Temperature + latitude (Thornthwaite) — pass temperature= and latitude=; PET is computed internally using the Thornthwaite method (requires only mean temperature)
  3. Temperature + Tmin/Tmax + latitude (Hargreaves) — pass temperature=, temp_min=, temp_max=, and latitude=; PET is computed using the Hargreaves-Samani method (better for arid/semi-arid regions)
# Path 1: Pre-computed PET
spei_12 = spei(precip, pet=pet, scale=12)

# Path 2: Auto-PET from temperature (Thornthwaite - default)
spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12)

# Path 3: Auto-PET using Hargreaves (requires Tmin/Tmax)
spei_12 = spei(precip, temperature=temp, latitude=lat, scale=12,
               pet_method='hargreaves', temp_min=tmin, temp_max=tmax)

PET Method Selection

Method Data Requirements Best For Reference
Thornthwaite (default) Tmean, latitude Humid/temperate climates Thornthwaite (1948)
Hargreaves-Samani Tmean, Tmin, Tmax, latitude Arid/semi-arid regions Hargreaves & Samani (1985)
TipWhen to use Hargreaves

If you have TerraClimate or ERA5-Land data with Tmin/Tmax available, Hargreaves often correlates better with physically-based PET (like Penman-Monteith). Testing on Bali showed Hargreaves correlation r=0.80 vs Thornthwaite r=0.75 when compared to TerraClimate PET.

Variable auto-detection

When you pass an xarray.Dataset (rather than a DataArray), the code searches for variables matching these name patterns:

Variable type Auto-detected names
Precipitation precip, prcp, precipitation, pr, ppt, rainfall
PET pet, eto, et, evap
Temperature (mean) temp, tas, t2m, tmean
Temperature (min) tmin, tasmin
Temperature (max) tmax, tasmax

You can always override auto-detection with explicit parameters: precip_var_name=, pet_var_name=, temp_var_name=.

Missing values and data quality

  • SPI/SPEI are only meaningful when the underlying record is reasonably complete.
  • Recommended minimum for robust statistics: ~30 years of monthly data.
  • Ensure basic QC: impossible negatives, unit mistakes, and time gaps should be handled before fitting.

Distribution fitting thresholds

The code enforces these minimums per grid cell during calibration. If a cell fails any check, its output is set to NaN:

Check Threshold What happens if violated
Minimum valid values 30 Fitting skipped (NaN output)
Minimum non-zero values 10 Fitting skipped (NaN output)
Maximum zero proportion 95% Fitting skipped (NaN output)
WarningArid regions and zero precipitation

Problem: In deserts and semi-arid regions, many months have zero precipitation. If >95% of months are zero, fitting fails entirely (NaN output). Even with 80-95% zeros, fitting quality degrades.

Why this matters: Pearson Type III distribution requires at least 4 non-zero values for L-moments computation. When this fails, the code falls back to Method of Moments, then to a normal approximation — but results may be unreliable.

Diagnosis: Use diagnose_data() before running SPI/SPEI to check zero proportions:

from distributions import diagnose_data

# Check a single location
ts = precip.isel(lat=50, lon=100).values
diag = diagnose_data(ts)
print(f"Zero proportion: {diag.zero_proportion:.1%}")
print(f"Recommendation: {diag.recommendation}")

Options for arid regions:

  1. Accept NaN — If a region is truly hyper-arid, SPI may not be meaningful there
  2. Use longer time scales — SPI-12 or SPI-24 aggregates more precipitation, reducing zero proportion
  3. Mask desert pixels — Exclude cells with mean annual precipitation < 50 mm
  4. Use Gamma distribution — More robust than Pearson III for high-zero data (but still limited)
TipQuick sanity checks (recommended)
import xarray as xr

precip = xr.open_dataset('input/precip.nc')['precip']

# 1. Dimensions
print(f"Dims: {precip.dims}")          # Should include time, lat, lon

# 2. Non-negative
print(f"Min: {precip.min().values}")   # Should be >= 0

# 3. Reasonable range
print(f"Max: {precip.max().values}")   # Should be < 1000 mm/month

# 4. Time span
years = precip.time.dt.year
print(f"Period: {int(years.min())}-{int(years.max())}")  # >= 30 years

# 5. Missing data
missing_pct = float(precip.isnull().mean() * 100)
print(f"Missing: {missing_pct:.1f}%")  # Should be < 5%

Calibration and Scale Conventions

Calibration period

  • Typical default: 1991-2020 (WMO-style baseline)
  • Calibration should overlap your data range
  • Calibration is applied per grid cell (each pixel fitted independently)
  • Update every 10 years when WMO revises the normal period

Distribution selection

Five probability distributions are supported via the distribution parameter:

Distribution Key Parameters Recommended for
Gamma 'gamma' alpha, beta, prob_zero SPI (default)
Pearson Type III 'pearson3' skew, loc, scale, prob_zero SPEI
Log-Logistic 'log_logistic' alpha, beta, prob_zero SPEI (alternative)
GEV 'gev' shape, loc, scale, prob_zero Extreme value analysis
Generalized Logistic 'gen_logistic' shape, loc, scale, prob_zero Heavy-tailed data

Default: 'gamma' for both SPI and SPEI. Recommendation: Gamma for SPI, Pearson III or Log-Logistic for SPEI.

See Probability Distributions for fitting methods, goodness-of-fit tests, and detailed guidance.

Zero precipitation handling (SPI)

SPI uses a mixed distribution when zeros exist:

\[H(x) = q + (1-q) \cdot G(x)\]

Where \(q\) = proportion of zeros in the calibration period, \(G(x)\) = fitted continuous distribution for positive values. See Methodology for the full derivation.

Output Contract (What You Get Back)

SPI / SPEI single-scale outputs

Return type: xarray.DataArray

  • Dimensions: (time, lat, lon)
  • Values: standardized index (typically within [-3.09, +3.09], clipped at bounds)
  • Variable name: follows the pattern {index}_{distribution}_{scale}_{periodicity}
    • Examples: spi_gamma_12_month, spei_pearson3_3_month

Attributes on the DataArray:

Attribute Example value
scale 12
distribution 'gamma'
calibration_start_year 1991
calibration_end_year 2020

Multi-scale outputs

Return type: xarray.Dataset with one variable per scale:

  • spi_gamma_1_month, spi_gamma_3_month, spi_gamma_12_month, …
  • spei_pearson3_3_month, spei_pearson3_12_month, …

NetCDF file encoding

When saved via save_index_to_netcdf() or .to_netcdf():

Property Value
Data type float32
Fill value -9999.0
Compression zlib, level 4-5
Conventions CF-1.8

Global attributes (customizable via global_attrs or DEFAULT_METADATA):

from config import DEFAULT_METADATA

# Set once at script start — applies to all subsequent outputs
DEFAULT_METADATA['institution'] = 'My University'
DEFAULT_METADATA['references'] = 'McKee et al. (1993)'
DEFAULT_METADATA['comment'] = 'Regional drought monitoring project'

Available fields: institution, source, Conventions, references, comment. You can also add custom fields (e.g., project, contact, license) via the global_attrs parameter.


Run Theory Outputs (Schemas)

Run theory functions operate on a single-location time series (DataArray/Series) or on a gridded index for period statistics.

1) Event-based extraction: identify_events(...)

Returns: pandas.DataFrame, one row per event

Column Meaning
event_id sequential event number
start_date / end_date event boundaries
duration number of time steps in event
magnitude cumulative magnitude (total deviation from threshold)
intensity magnitude / duration
peak most extreme index value during event
peak_date when peak occurs
interarrival time since previous event start

2) Monitoring time series: calculate_timeseries(...)

Returns: pandas.DataFrame, one row per time step

Column Meaning
index_value SPI/SPEI value at this time step
is_event whether threshold condition is active (True/False)
event_id current event number (or 0 if no event)
duration current duration so far (resets between events)
magnitude_cumulative accumulated magnitude so far (always increases during event)
magnitude_instantaneous current-step severity (varies — rises, peaks, falls)
intensity magnitude_cumulative / duration
NoteTwo magnitude columns
  • magnitude_cumulative = total accumulated deviation; like debt — always grows during an event
  • magnitude_instantaneous = current month’s severity; like crop NDVI — shows event evolution

See Magnitude Explained for detailed comparison.

3) Gridded period statistics: calculate_period_statistics(...)

Returns: xarray.Dataset with dimensions (lat, lon), containing:

Variable Description Units
num_events Number of dry/wet events count
total_event_months Total months spent in events months
total_magnitude Sum of all event magnitudes index units
mean_magnitude Average magnitude per event index units
max_magnitude Largest single event magnitude index units
worst_peak Most extreme SPI/SPEI value index units
mean_intensity Average intensity across events index units/month
max_intensity Maximum intensity from any event index units/month
pct_time_in_event Percentage of period spent in events %

Where to go next


See Also

Back to top