Benny Istanto's Blog

Word Clock

Benny Istanto — Fri, 28 Feb 2025 00:00:00 GMT

Ever wondered what time it would be if clocks could speak? This Word Clock translates the current time into natural language phrases that we use in everyday conversation. Instead of looking at hands or digits, you simply read the highlighted words to tell the time!

“IT IS TWENTY MINUTES PAST FOUR” or “IT IS QUARTER TO SEVEN” - just like you’d say it to a friend.

The clock updates every minute, automatically highlighting the words that form the correct time phrase. Notice how at different times, different word combinations light up to create readable sentences.

It’s a fun, more human way to experience time passing. Enjoy watching the words change as the minutes tick by!

Inspiration

Inspired by the Word Clock sold by Walmart - The Word Clock - Shows The Time In A Sentence

See it at Observable - https://observablehq.com/@bennyistanto/word-clock

xkcd style for Country map

Benny Istanto — Sun, 26 May 2024 00:00:00 GMT

Source: https://gist.github.com/bennyistanto/7b391b11e861334bc020dd03c06815f2

xkcd style for LSEQM illustration

Benny Istanto — Thu, 23 May 2024 00:00:00 GMT

Source: https://gist.github.com/bennyistanto/b9e5d9c932dc6b034f559deaa26e2743

Skip PEARSON fitting on climate-indices python package

Benny Istanto — Fri, 03 May 2024 00:00:00 GMT

Source: https://gist.github.com/bennyistanto/e8710f89bfbebaf24498dd957a1fa961

Utilizing CUDA

Benny Istanto — Tue, 16 Apr 2024 00:00:00 GMT

This week I try to utilize CUDA on my desktop, to support the upcoming activities on heavy geospatial and climate analytics. Bit tricky but I managed to install it in both Windows 11 and WSL2 Debian 12. See below.

Install CUDA and cuDNN using Conda

Tested on:

Windows 11 Pro for Workstations and WSL2 Debian 12
Processor: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz 2.00 GHz (2 processors)
Installed RAM: 384 GB
VGA: NVIDIA Quadro P2000 5GB

1. Install the GPU driver

This step only apply to Windows

Download and install the NVIDIA Driver for GPU Support to use with your existing CUDA ML workflows. For my case, I choses:

Product type: NVIDIA RTX/Quadro
Product series: Quadro Series
Product: Quadro P2000
Operating System: Windows 11
Download Type: Production Branch/Studio
Language: English (US)

Click Search, then you will Click Download, follow with Click on Agree & Download. It will grab a file from this link https://us.download.nvidia.com/Windows/Quadro_Certified/551.86/551.86-quadro-rtx-desktop-notebook-win10-win11-64bit-international-dch-whql.exe with size 483 MB.

Next, install and follow to step until completed.

Note

This is the only driver we need to install. Do not install any Linux display driver in WSL.

Reference: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl-2

Step 2-7 below, apply for both Windows and WSL

2. Create new Conda environment

Open Anaconda Prompt on Windows or Terminal on WSL (I am sure both are in the same Windows Terminal with different Tab). Please make sure we are outside the Conda environment, by typing:

conda deactivate

Let’s create new Conda environment, called cudawith Python version 3.11

conda create -n cuda python==3.11

3. Install essential Python package for geospatial analysis and data visualization

I would like to use this cuda env to do heavy geospatial and climate data process, so I will install Python geospatial package

conda install -c conda-forge geospatial

If needed, we can install other package too. Example: cdo, nco, gdal, awscli

cdo package only available in Linux (WSL) environment.

4. Install CUDA toolkit

Install cudatoolkit v11.8.0 - https://anaconda.org/conda-forge/cudatoolkit

conda install -c conda-forge cudatoolkit

5. Install cuDNN

Install cudnn v8.9.7 - https://anaconda.org/conda-forge/cudnn

conda install -c conda-forge cudnn

6. Install Pytorch

Install Pytorch - https://pytorch.org/

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

7. Install Tensorflow

Install Tensorflow 2.14.0, as this is the last Tensorflow compatible version with CUDA 11.8. Reference: https://www.tensorflow.org/install/source#gpu

conda install -c conda-forge tensorflow=2.14.0=cuda118py311heb1bdc4_0

8. Setting the Library

This step only apply to WSL

If we installed CUDA and cuDNN via Conda, then typically we should not need to manually set LD_LIBRARY_PATH or PATH for these libraries, as describe by many tutorial when we install the CUDA and cuDNN system-wide, because Conda handles the environment setup for us.

However, sometimes we are encountering issues like - errors related to cuDNN not being registered correctly - there might still be a need to ensure that TensorFlow is able to find and use the correct libraries provided by the Conda environment.

Why We Might Still Need to Set LD_LIBRARY_PATH?

Even though Conda generally manages library paths internally, in some cases, especially when integrating complex software stacks like TensorFlow with GPU support, the automatic configuration might not work perfectly out of the box.

Find the library paths: We can look for CUDA and cuDNN libraries within the Conda environment’s library directory:

ls $CONDA_PREFIX/lib | grep libcudnn
ls $CONDA_PREFIX/lib | grep libcublas
ls $CONDA_PREFIX/lib | grep libcudart

Manually Set LD_LIBRARY_PATH (If Needed)

If we find that TensorFlow still fails to recognize these libraries despite them being present in the Conda environment, we might try setting LD_LIBRARY_PATH manually:

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

In my case, I have set the PATH in .zshrc, so above approach is already done

# Anaconda 
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/bennyistanto/anaconda3/bin/conda' 'shell.zsh' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/bennyistanto/anaconda3/etc/profile.d/conda.sh" ]; then
        . "/home/bennyistanto/anaconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/bennyistanto/anaconda3/bin:$PATH"
        export LD_LIBRARY_PATH="/home/bennyistanto/anaconda3/lib:$LD_LIBRARY_PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

Based on my .zshrc settings and the Conda environment settings, my LD_LIBRARY_PATH is already set to include the Conda libraries at /home/bennyistanto/anaconda3/lib. This should generally be sufficient for TensorFlow to locate and use the CUDA and cuDNN libraries installed via Conda, given that Conda typically manages its own library paths very well.

Evaluation of Current Setup

Since I’ve already set LD_LIBRARY_PATH in my .zshrc, TensorFlow should correctly recognize and utilize the CUDA and cuDNN libraries installed in my Conda environment, assuming there are no other conflicting settings or installations. The LD_LIBRARY_PATH in my .zshrc appears correctly configured to point to the general Conda library directory, but there are a few additional things we might consider:

Make sure we are stil working inside cuda environment.

If TensorFlow continues to have issues finding or correctly using the cuDNN libraries, we might consider adding a direct link to the specific CUDA and cuDNN library paths in LD_LIBRARY_PATH within our Conda activation scripts. We can modify the environment’s activation and deactivation scripts as follows:

Activate Script ($CONDA_PREFIX/etc/conda/activate.d/env_vars.sh):

#! /bin/sh
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Deactivate Script ($CONDA_PREFIX/etc/conda/deactivate.d/env_vars.sh):

#! /bin/sh
export LD_LIBRARY_PATH=$(echo $LD_LIBRARY_PATH | sed -e "s|$CONDA_PREFIX/lib:||g")

This explicitly ensures that our specific Conda environment’s library path is prioritized while the environment is active.

In my case (as I am working inside cuda environment, $CONDA_PREFIX = /home/bennyistanto/anaconda3/envs/cuda

If the env_vars.sh file does not exist in both the activate.d and deactivate.d directories within our Conda environment, we should create them. These scripts are useful for setting up and tearing down environment variables each time we activate or deactivate our Conda environment. This ensures that any customizations to our environment variables are applied only within the context of that specific environment and are cleaned up afterwards.

Here’s how to create and use these scripts:

Step 1: Create the Directories

If the activate.d and deactivate.d directories don’t exist, we’ll need to create them first. Here’s how we can do it:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d

Step 2: Create the Activation Script

Create the env_vars.sh script in the activate.d directory. This script will run every time we activate the environment.

Navigate to the directory:
```
cd $CONDA_PREFIX/etc/conda/activate.d
```
Create and edit the env_vars.sh file:
```
nano env_vars.sh
```

Add the following content to set up the LD_LIBRARY_PATH:

#!/bin/sh
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Save and exit the editor (in nano, press Ctrl+O, Enter, and then Ctrl+X).

Step 3: Create the Deactivation Script

Similarly, create the env_vars.sh script in the deactivate.d directory. This script will clear the environment variables when we deactivate the environment.

Navigate to the directory:
```
cd $CONDA_PREFIX/etc/conda/deactivate.d
```
Create and edit the env_vars.sh file:
```
nano env_vars.sh
```

Add the following content to unset the LD_LIBRARY_PATH:

#!/bin/sh
export LD_LIBRARY_PATH=$(echo $LD_LIBRARY_PATH | sed -e "s|$CONDA_PREFIX/lib:||g")

Save and exit the editor.

Step 4: Make Scripts Executable

Ensure that both scripts are executable:

chmod +x $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
chmod +x $CONDA_PREFIX/etc/conda/deactivate.d/env_vars.sh

Step 5: Testing

Activate our environment again to test the changes:

conda deactivate
conda activate cuda

Check that the LD_LIBRARY_PATH is correctly set:

echo $LD_LIBRARY_PATH

This should reflect the changes we’ve made, showing that the library path of our Conda environment is included.

In my case, the output from echo $LD_LIBRARY_PATH shows /home/bennyistanto/anaconda3/envs/cuda/lib: indicates that my LD_LIBRARY_PATH is correctly set to include the library directory of our Conda environment named “cuda”. This setup is what we want because it directs the system to look in our Conda environment’s lib directory for shared libraries, such as those provided by CUDA and cuDNN, which are crucial for TensorFlow to correctly utilize GPU resources.

9. Configure Jupyter Notebook

To configure Jupyter Notebook to use GPUs, we need to create a new kernel that uses the Conda environment we created earlier cuda and specifies the GPU device. We can do this by running the following command:

python -m ipykernel install --user --name cuda --display-name "Python 3 (GPU)"

This command installs a new kernel called “Python (GPU)” that uses the cuda Conda environment and specifies the GPU device.

Voila, the installation process is completed. Next we can test using test_GPU.ipynb

Github Gist file: https://gist.github.com/bennyistanto/46d8cfaf88aaa881ec69a2b5ce60cb58

Maximizing Thinkpad T14 Gen 2 AMD

Benny Istanto — Fri, 26 Jan 2024 00:00:00 GMT

I bought a Thinkpad T14 Gen 2 AMD (released on August 2022) end of December 2023, it’s second hand with mint condition and standard specification (AMD Ryzen™ 7 PRO 5850U Processor with Radeon Graphics, 256GB nvme and 16GB RAM, FHD 14”).

This is my second-owned Thinkpad after the T480, and same as before I also did some upgrades on my T14. Here’s the list of the components:

T14 Gen 2: 14″, 3840×2160, IPS, 500 nits, 100% Adobe RGB, Anti-glare from https://www.myfixguide.com/store/screen-for-thinkpad-t14/
40pin UHD cable from https://www.myfixguide.com/store/lcd-cable-for-t14-gen2/
4TB NVMe SSD from https://www.crucial.com/ssd/p3-plus/CT4000P3PSSD8
32GB RAM DDR4 SODIMM 3200MHz from https://www.corsair.com/us/en/p/memory/cmsx32gx4m1a3200c22/corsair-high-performance-vengeance-memory-kit-cmsx32gx4m1a3200c22
WWAN card and the antenna from https://thinkparts.com/products/-4970

I did all the installation myself by following the awesome guideline and use Pro Tech Toolkit https://www.ifixit.com/Store/Tools/Pro-Tech-Toolkit/IF145-307 from IFIXIT.

See some of the picture below:

[caption id=“” align=“alignnone” width=“3333”] Remove the old screen and install the WWAN antenna [/caption] [caption id=“” align=“alignnone” width=“3333”] Install the new screen [/caption] [caption id=“” align=“alignnone” width=“3333”] Put the memory in the free slot [/caption] [caption id=“” align=“alignnone” width=“3333”] Replace the old SSD with new one, and put the WWAN card in place and antenna cable too [/caption] [caption id=“” align=“alignnone” width=“4032”] This is the machine with upgrades [/caption] [caption id=“” align=“alignnone” width=“3024”] Look, the Cellular connection is available now [/caption] [caption id=“” align=“alignnone” width=“3024”] The new 4K screen is on and I just need to install the bezel 😎 [/caption]

Drought Propagation

Benny Istanto — Wed, 24 Jan 2024 00:00:00 GMT

Last month I did experiment test to analyze the propagation of Meteorological Drought (Standardized Precipitation Index - SPI) to Hydrological Drought (Standardized Streamflow Index - SSI) using Lagged Correlation at the pixel level with area of interest, Indonesia.

To download the full repository, you can ccess it via this link: https://github.com/bennyistanto/drought-propagation

Data

I use the Standardized Precipitation Index (SPI) - as proxy for meteorological drought, and the (SSI - as proxy for hydrological drought.

The SPI use monthly gridded Satellite precipitation estimates from Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS).

The SSI use daily gridded River discharge in the last 24 hours from GloFAS-ERA5 operational global river discharge reanalysis 1979–present as a proxy for the streamflow time series infomation.

Folder structure and files

There are 3 notebook along with support folder that required to run the analysis. Feel free to use your own preferences for this setting/folder arrangements.

hyd # Files required to proceed the hydrological drought goes here.
met # Files required to proceed the meteorological drought goes here.
prop # File required to proceed the propagation using lagged correlation goes here.
subset # In this folder I put idn_subset_chirps.nc file, a subset file to clip the input data to follow the area of interest. Basically this file are came from a shapefile polygon which has land attribute column with value = 1, then converting to raster based on land column, and set the cell size following our standard (I use 0.05 deg, because the SPI and SSI also has the same spatial resolution, 0.05 deg). After that, convert it to netCDF. All is done in ArcGIS Desktop.

The notebook

This is using Cross-Correlation for each pixel accross the entire time series, also employ noise filtering techniques like Singular Spectrum Analysis (SSA) which can help in isolate the underlying trends and patterns in our data before performing the CCA. This step is crucial for enhancing the signal-to-noise ratio in our datasets.

Approach

The analysis using combination from various time scale [3, 6, 9, and 12-month] and Lag range from 1 - 12 month

time_scale_combinations = [
    "spi03_ssi03", "spi06_ssi03", "spi09_ssi03", "spi12_ssi03",
    "spi03_ssi06", "spi06_ssi06", "spi09_ssi06", "spi12_ssi06",
    "spi03_ssi09", "spi06_ssi09", "spi09_ssi09", "spi12_ssi09",
    "spi03_ssi12", "spi06_ssi12", "spi09_ssi12", "spi12_ssi12"
]

Preprocessing

The drought characteristics originally following the method proposed by Yevjevich in 1967 and has been employed to recognize the feature of droughts. The paper from Le, et al in 2019 provide better explanation about it: duration, severity, intensity, and interarrival.

Drought

Masking for Drought Event The drought condition is set when the SPI or SSI value negative, or less than -1.2. Focusing on drought conditions could be a more relevant approach for our analysis compare to using all SPI and SSI data, as it has dry and wet condition. By concentrating on these periods, we can potentially gain more insight into the correlation between meteorological and hydrological droughts.

Calculate Drought Magnitude Compute the absolute cumulative values during drought events for both datasets. This gives a measure of drought magnitude, which may be more meaningful for correlation analysis than using raw SPI/SSI values.

Applying Singular Spectrum Analysis (SSA) For noise filtering and trend extraction in drought magnitude data in SPI and SSI datasets. In drought propagation analysis, noise filtering with SSA is a critical step for data preparation. SSA effectively separates the underlying signal from the noise in climate datasets, such as SPI and SSI.

SSA decomposes a time series into a sum of components:

Where:

: Original time series
: Trend component
: Seasonal component
: Noise component

This process is crucial for enhancing the clarity and accuracy of the data, which in turn facilitates a more precise understanding of drought patterns and their progression.

Analysis

Cross-Correlation Analysis Especially when applied to data refined through SSA noise filtering, is pivotal in understanding drought propagation. This technique examines the relationship between different drought indicators across various time scales. By utilizing data filtered through SSA, which isolates the core signal from noise, Cross-Correlation Analysis can more accurately determine the time lag and intensity with which meteorological droughts (indicated by SPI) translate into hydrological droughts (indicated by SSI).

The cross-correlation coefficient at lag is calculated as:

where:

: Value of the first time series at time
Yi: Value of the second time series at time i + τ
τ: Time lag
X̄: Mean of the first time series
Ȳ: Mean of the second time series
N: Number of data points

This approach is essential for predicting the onset and progression of drought conditions, enabling timely decision-making and effective resource management to mitigate the adverse impacts of droughts.

Frequency Analysis In the context of drought propagation analysis, frequency analysis plays a critical role in identifying the most prominent patterns of correlation between meteorological and hydrological drought indicators over time. By classifying cross-correlation values into distinct ranges (e.g., 0.0-0.1, 0.1-0.2, etc.) and analyzing these across different lag times, researchers can pinpoint the range that most frequently occurs.

This approach helps in understanding the typical strength of correlation and the temporal shift (lag) between the onset of meteorological drought and its subsequent impact on hydrological conditions. The most frequent range provides insights into the commonality of correlation strengths, while the corresponding lag sheds light on the typical delay between atmospheric changes and their effects on hydrological systems. We can also derive the maximum correlation value that can provides insight on which areas has the best correlation, and Lag time where the maximum correlation between SPI and SSI is observed.

Visualisation

There are two map type that use to illustrate the results of the cross-correlation analysis between meteorological and hydrological droughts.

Lag Map This map displays the time lag (in months) between meteorological and hydrological droughts across the study area. It helps identify regions where hydrological responses to meteorological changes are immediate or delayed.

Strength Map This map shows the strength of the correlation between meteorological and hydrological droughts. It highlights areas with a strong predictive relationship, indicating regions sensitive to meteorological changes.

Below some example of the individual Strength Map from various time scale combinations and lag.

SPI 03 and SSI 03, Lag 1-month

SM1
SPI 06 and SSI 03, Lag 1-month

SM2
SPI 06 and SSI 03, Lag 3-month

SM3
SPI 12 and SSI 06, Lag 6-month

SM4

And below some example of the composite Strength and Lag Map from various time scale combinations.

Most frequent correlation and Lag where the most frequent observe of SPI to SSI 3-month

SM1
Most frequent correlation and Lag where the most frequent observe of SPI to SSI 6-month

SM2
Most frequent correlation and Lag where the most frequent observe of SPI to SSI 9-month

SM3
Most frequent correlation and Lag where the most frequent observe of SPI to SSI 12-month

SM4
Maximum correlation and Lag where the maximum observe of SPI to SSI 3-month

SM1
Maximum correlation and Lag where the maximum observe of SPI to SSI 6-month

SM2
Maximum correlation and Lag where the maximum observe of SPI to SSI 9-month

SM3
Maximum correlation and Lag where the maximum observe of SPI to SSI 12-month

SM4

To do

Number of Lag from 1-12 month in the existing simulation is good enough.

Adding more time scale from 3, 6, 9, 12 to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 and the combination, potentially produce more insight.

THIS WORK STILL IN PROGRESS

Live testing

You can access the notebook via Binder

https://mybinder.org/v2/gh/bennyistanto/drought-propagation/HEAD

Firmware upgrade on Thuraya SatSleeve for iPhone

Benny Istanto — Thu, 04 Jan 2024 00:00:00 GMT

I have old version of Thuraya SatSleeve for iPhone, now categorized as a legacy product on Thuraya website.

Since year ago the SatSleeve device experiencing intermittent connection, between bluetooth on iPhone SE and SatSleeve, also the GPS connection. Luckily, Thuraya providing a firmware upgrade with release v3.0.1 on their website https://www.thuraya.com/en/support/upgrades/legacy/thuraya-satsleeve-for-iphone

Before upgrading a SatSleeve, please check which firmware is installed (SatSleeve > Settings > Device Info > Firmware version). Perform the upgrade only if Thuraya releases a firmware version newer than your existing one (mine is v2.94)

To upgrade the firmware, follow steps on the website:

Step 1

Download the below SatSleeve Upgrader program.

SatSleeve upgrader
Unzip and Run the setup file - the Upgrader program including the USB driver will be installed.

Step 2

Download the latest Thuraya SatSleeve firmware release to your hard disk.

SatSleeve iPhone firmware release v3.0.1

(works only on the SatSleeve for iPhone Data model) - Unzip it.

Release notes of v3.0.1: Fixed GPS rollover issues

Step 3

Connect your SatSleeve with the PC/laptop via USB data cable.
You can now start the SatSleeve Upgrader program (please make sure you run this software as Administrator.
- The requirement is using PC with Windows 8/8.1, Windows 7 or Windows Vista, and I am using Windows 11 is working fine
- Right click Thuraya SatSlevee Upgrader > More > Run as administrator
and locate the firmware on your hard disk. The Upgrader program will help you through the upgrade process.

Now, my SatSleeve is working fine on iPhone SE

Hourly Humidity Data

Benny Istanto — Sun, 15 Oct 2023 00:00:00 GMT

Recently, I embarked on a journey to calculate humidity data from a myriad of sources. Throughout this process, I experimented with various methods, ranging from the saturation water vapour pressure using Teten’s formula (with parameters according to Buck) to saturation over ice from Alduchov and Eskridge, and finally to Clausius-Clapeyron.

For those in need of hourly humidity data spanning from 1 Jan 1950 to the present, there’s good news! You can seamlessly extract this information from ERA5-Land Hourly data via Google Earth Engine (GEE). The Specific and Relative Humidity is meticulously calculated based on three core parameters: T2m (Temperature at 2 meters), Dew Point, and Surface Pressure.

Interested in exploring further? Check out my GEE script: https://code.earthengine.google.com/9b23f929939122fb1fdc8418d17c43f5

By the way, for those diving deep into the technicalities, the GEE script I’ve shared leans on a simpler approach, kinda like a nod to the good ol’ Magnus formula. So, it’s pretty straightforward and user-friendly

I hope this proves beneficial to researchers, data scientists, and enthusiasts in the realm of climatology. If you have any suggestions, feedback, or improvements, please don’t hesitate to reach out.

Reference

Alduchov, O. A., & Eskridge, R. E. (1996). Improved Magnus form approximation of saturation vapor pressure. Journal of Applied Meteorology, 35(4), 601-609.

A certified GISP

Benny Istanto — Fri, 25 Aug 2023 00:00:00 GMT

Exciting news! I’m now a certified GIS Professional (GISP)

Big thanks to GIS Certification Institute (GISCI) for this recognition. To learn more about the GISP certification process, visit https://www.gisci.org

Shoutout to my mentors at the WBG, Keith Garrett and Ben Stewart, for their guidance in the last 2 years.

Looking forward to doing more meaningful work in climate analytics and geospatial technology for greater impact! 🗺️

Update: happy to received the certificate at the end of 2023

Monthly mosaic of modified Radar Vegetation Index

Benny Istanto — Thu, 24 Aug 2023 00:00:00 GMT

Few months ago, I wrote a post about how to calculate The modified Radar Vegetation Index (mRVI) using Sentinel-1 satellite. It was try to extract the mRVI every Dekad with study case is Ukraine.

For areas in Europe, getting the S1 data every dekad is doable, but it’s bit tricky for area outside Europe. Currently, for my work, I would like to extract the mRVI for Mpumalanga province in South Africa. The location is near -25.5 S, then according to below picture the revisit time is every 12 days.

[caption id=“” align=“alignnone” width=“3579”][](https://sentinel.esa.int/documents/247904/4748961/Sentinel-1-Repeat-Coverage-Frequency-Geometry-2021.jpg) Source: https://sentinel.esa.int/documents/247904/4748961/Sentinel-1-Repeat-Coverage-Frequency-Geometry-2021.jpg [/caption]

Getting monthly mosaic mRVI seems possible for case in South Africa, as if I keep the dekad, some of them will return an empty collection.

So, I need to modify the existing code to get the monthly list, calculate monthly mosaics, calculate monthly mean and the ratio anomaly

Full GEE code is here: https://code.earthengine.google.com/aea00cb8f3f1ccc921d5f6698b5c0c5a

Parsing BMKG’s daily climate data

Benny Istanto — Tue, 22 Aug 2023 00:00:00 GMT

To replicate below code, please download daily climate data from BMKG Data Online https://dataonline.bmkg.go.id/home, just a heads up, if you haven’t already registered on the portal, you might want to do so. It’s a necessary step before you can download any data.

Then go to Climate Data > Daily Data, choose the Station Type, Parameter, Province, Regency, Station Name and the Date Period and click Process button. You will get the data in *.xlsx format.

You can get one of the data example from this link: https://docs.google.com/spreadsheets/d/1xbBWeHhiMNs8IehHbsrMV9yeZlcu8GqR/edit?usp=sharing&ouid=104182606454912191559&rtpof=true&sd=true

In above example, I tried to get daily precipitation data for all station from 1 Jun 2000 - 31 Dec 2021. I would like to use it to correct the value and distribution daily IMERG data using bias correction method that currently I have develop.

Unfortunately, too many missing value.

I should find alternative daily timeseries precipitation data, probably gridded data will suit my objectives.

SPI-based drought characteristics

Benny Istanto — Fri, 11 Aug 2023 00:00:00 GMT

Fourier regression model to generate monthly to daily temperature data

Benny Istanto — Sun, 02 Jul 2023 00:00:00 GMT

1 Introduction

In the sphere of meteorology, the significance of statistical models in comprehending and forecasting diverse weather patterns is incontestable. Within this context, the Fourier regression model has emerged as a formidable asset, specifically in generating daily time series from monthly temperature data (Wilks, 1998). The model lays a robust foundation for simulating temperature patterns, yielding crucial insights that are indispensable for weather prediction, climate change studies, and managing water resources.

The Fourier regression model has been proven to be a highly effective tool for generating daily time series from monthly temperature data, enhancing our understanding and prediction capabilities in weather forecasting, climate change studies, and water resource management. This model’s unique ability to incorporate historical context allows it to capture intricate dependencies and transitions in temperature data, which are crucial in understanding temperature patterns.

By applying Fourier series, it is possible to reduce the number of parameters involved in the process, thereby simplifying complex calculations and making the model more efficient. Moreover, the Fourier regression model can seamlessly replace missing values and handle anomalies, which are often challenges in data analysis. This enables more accurate simulations and predictions, making it a vital tool in fields such as agriculture and urban planning.

The Fourier regression model’s success in generating daily time series from monthly temperature data not only contributes to our understanding of weather patterns but also provides practical solutions for real-world challenges, making it a powerful instrument in various domains.

2 Data

Over the past three decades, Bogor’s climate has remained relatively consistent. The city experiences an average annual temperature of around 26 °Celsius. The temperature varies little throughout the year, with the warmest month averaging around 27 °Celsius and the coolest month averaging around 25 °Celsius.

Daily temperature data of Bogor Climatological Station from 1984-2021 were used in this analysis, downloaded from BMKG Data Online in *.xlsx format. The file then manipulated by remove the logo and unnecessary text, aggregated into monthly, leaving only two columns, namely date in column A and temperature in column B for the header with the format extending downwards, and save as *.csv format.

The final input file is accessible via this link: https://drive.google.com/file/d/1vKT5ekDnqahkG6um5wIm-ZfhExqZTAm8/view?usp=sharing

3 Methods

This exercise focuses on the Fourier regression model as a tool for generating daily temperature data from monthly time series (Boer, 1999).

Fourier series can also be employed to generate other climate data. McCaskill (1990a) utilized Fourier series regression, incorporating rainfall events to generate pan evaporation data, maximum and minimum air temperature, and daily radiation intensity (P(i)).

where represents a rain function, is a rain event on day , and , , and are determined through regression analysis. In the context of Australia, the incorporation of rain events in the Fourier series function did not exert a significant impact, although it substantially reduced the error level of the estimated value (McCaskill, 1990a).

In the above equation 1, the effect of rainfall events is assumed to be additive. However, for certain regions, this rainfall event impact could be multiplicative.

In many cases, climate data is generally presented as monthly data, making analysis requiring daily data difficult to execute. Fourier series regression can also be used to generate daily climate data from average monthly climate data (Epstein, 1991). The equation is written as follows:

where , and is the month. This equation assumes an equal number of days in each month, which is not the case in reality. Therefore, to adjust it, the value of in the above equation is changed as the -th day for the -th month so that the value , where is the number of days in month . The use of equation 2 to create fitting lines for daily data is highly effective. The fitting lines composed from daily data and those derived from monthly data almost overlap.

For simulation purposes, an error component, , which has a normal distribution with a mean of 0 and a variance of is typically included. Thus, the data series generated by each simulation will differ but still reflect the seasonal diversity of data. Errors in climate data simulation models often autocorrelate. Therefore, the error component can be modeled using a k-th order autocorrelation function (Wannacott and Wannacott, 1987), which is:

where is the correlation value and is the random error (white noise). The simplest autocorrelation error function linearly connects the error on day with the error on day plus the random error on day (first-order autocorrelation function), namely:

Therefore, if the value of is positive, the error on day tends to increase if the error on the previous day was high, and vice versa. Practically, the value of is always less than one, but its magnitude is unknown.

4 Implementation

In the implementation phase of this analysis, we utilized Python and the Pandas, Numpy and Matplotlib library to develop a Fourier regression model to generate daily time series from monthly temperature data.

4.1 How-to?

The step-by-step guide for the model is readily accessible in Google Colab, an ideal platform for data analysis and machine learning. This comprehensive how-to guide explains the entire process, starting with reshaping the data to ensure compatibility with the model, and aggregate calculation from daily to monthly, assigning monthly data across the corresponding days of the month, Fourier Series Modeling and Coefficient Extraction, Temperature estimation using Fourier Coefficients, Autocorrelated Error Calculation and Final estimates adjusted temperature..

4.1.1 Reshape the data

The first step in our analysis involves pre-processing and reshaping the data to fit the requirements of the subsequent statistical modeling. Our raw temperature data, originally in a CSV file, consists of daily temperature readings recorded over several years. In this data, dates are represented in a ‘YYYY-MM-DD’ format. However, for our analysis, we require the ‘day of the year’ and the ‘year’ as separate variables.

We start by loading the data into a Pandas DataFrame. Next, we convert the ‘date’ column into a datetime format using the pd.to_datetime() function, which facilitates date-specific manipulations. This allows us to extract the ‘day of the year’ and the ‘year’ information from each date and store these in new columns titled ‘dayofyear’ and ‘year’, respectively.

Since we have multiple temperature readings per day, we average these readings for each day of the year across all years. We do this by grouping the data by ‘dayofyear’ and ‘year’, and then calculating the mean temperature for each group using the groupby() and mean() functions.

However, this leaves us with a long format DataFrame, where each row represents a day of a particular year. For easier visualization and modeling, we convert this into a wide format DataFrame, where each column represents a year and each row represents a day of the year. This transformation is performed using the unstack() function.

Lastly, we reset the DataFrame index for neatness and compatibility with future operations. The resulting DataFrame is saved into a new CSV file. This reshaping of data forms the foundation for our Fourier regression model and helps ensure accuracy and efficiency in the subsequent analysis.

Above code will produce output previews like below.

Table 1. Reshape data from long to wide format

4.1.2 Daily to Monthly

In addition to the daily analysis, we decided to explore the temperature trends on a monthly basis. The process for reshaping the data for monthly temperature averages mirrors the daily approach.

First, we load the raw temperature data from a CSV file into a Pandas DataFrame and convert the ‘date’ column to a datetime format. With the datetime format, we’re able to extract the ‘month’ and ‘year’ from each date, creating new columns for each.

As with the daily analysis, we handle multiple temperature readings per day by averaging these for each month of each year. We achieve this by grouping the data by ‘month’ and ‘year’, then calculating the mean temperature for each group.

To facilitate further analysis and visualization, we convert this long format DataFrame to a wide format DataFrame, with each column representing a year and each row representing a month. This is done using the unstack() function.

After resetting the DataFrame index for better data structure, we save the resulting DataFrame into a new CSV file. This CSV file contains average monthly temperatures over the years and will be useful for understanding broader temperature trends and providing context to our Fourier regression model.

Above code will produce output previews like below.

Table 2. Monthly average of temperature

4.1.3 Assigning monthly data into across the corresponding days of the month

In order to prepare our dataset for Fourier regression modeling, we need to map the average monthly temperature values to their corresponding days of the year. This step is crucial as it enables the creation of a continuous time series from the previously calculated monthly averages.

We begin this process by defining the number of days in each month, differentiating between leap and non-leap years. Then, we create a new DataFrame, dayofyear_df, with a ‘dayofyear’ column that sequentially enumerates each day of the year from 1 to 366. A binary ‘leap’ column is also added to indicate if the day corresponds to a leap year.

To map the ‘dayofyear’ to the corresponding month, we create a ‘month’ column using np.repeat() to repeat the month index according to the number of days in each month. This column is then adjusted for non-leap years.

The average monthly temperatures, stored in monthly_avg_df, are merged with the dayofyear_df DataFrame, repeating each monthly average across the corresponding days of the month. As a result, we obtain a DataFrame with daily granularity, which contains the corresponding average monthly temperature for each day.

We then handle the 366th day of non-leap years, setting the temperature to NaN, as it doesn’t exist in those years.

Finally, we remove the unnecessary ‘month’ and ‘leap’ columns, reset the index, and save this DataFrame into a new CSV file. This final, reshaped DataFrame serves as our input for the Fourier regression modeling, enabling us to predict temperatures at a daily level from average monthly temperatures.

Above code will produce output previews like below.

Table 3. Assigning monthly data into daily

4.1.4 Fourier series modeling and coefficient extraction

The next step in the analysis process involves fitting a Fourier series to our daily temperature data. The Fourier series is a mathematical tool used for analyzing periodic functions, making it suitable for modeling periodic patterns in weather data like temperature.

To begin, we first load the reshaped DataFrame containing the daily average temperatures. Next, we define a Fourier function, specifying the form it should take. The function is expressed in terms of trigonometric terms (cosine and sine functions) and includes coefficients that we aim to estimate (a0, a1, b1, a2, b2).

To perform this estimation, we iterate over each year in the DataFrame. For each year, we calculate new variables ‘T’, ‘m’, ‘D’, and ‘t’. These variables represent respectively the month, the day of the month, the number of days in the month, and a transformed time index (where each month is considered as a unit time interval). We exclude data points with NaN or infinite values.

We then utilize the curve_fit function from the scipy.optimize module to fit the Fourier function to the temperature data for each year. This function returns the optimal values for the coefficients a0, a1, b1, a2, and b2 that best fit the data.

In cases where there’s insufficient data to fit the Fourier series, we handle the errors and assign NaN values to the coefficients for that year.

Once we obtain the coefficients for each year, we save this data into a new CSV file. This file will then be used to generate our Fourier regression model and perform temperature estimation. The generated Fourier coefficients provide insights into the amplitude and phase of the cyclical patterns in the temperature data.

Above code will produce output previews like below.

Table 4. Fourier coefficient

4.1.5 Temperature estimation using Fourier coefficient

Having determined the coefficients of the Fourier series for each year, we can now use these coefficients to generate temperature estimates. This step entails constructing a time series model for the daily temperatures based on the Fourier series.

We start by loading the DataFrame that contains the Fourier coefficients for each year. These coefficients were calculated in the previous step and are used to define the form of the Fourier series for each year.

Our next task is to create a new DataFrame, ‘temp_estimates’, to store our estimated temperatures. This DataFrame is initially populated with a ‘dayofyear’ column, containing each day of the year (from 1 to 367).

We then iterate over each year in our coefficients DataFrame. For each year, we create a separate DataFrame ‘year_df’ and calculate the transformed time index ‘t’ just as we did when fitting the Fourier series. This time index is used as the input to our Fourier function.

Next, we use our Fourier function, along with the coefficients for the current year, to calculate the estimated temperature for each day of that year. These estimated temperatures are then added as a new column in the ‘year_df’ DataFrame, with the column name being the current year.

We repeat this process for all years in our dataset, merging the temperature estimates for each year into the ‘temp_estimates’ DataFrame.

Finally, we save these temperature estimates to a new CSV file. The end result of this process is a DataFrame that provides a day-by-day estimate of the temperature for each year based on the Fourier regression model. These estimates serve as the basis for our subsequent analysis and allow us to visualize and quantify the cyclical patterns present in the temperature data.

Above code will produce output previews like below.

Table 5. Temperature estimates

4.1.6 Autocorrelated error calculation

This code calculates the autocorrelated error between the observed temperature and the estimated temperature from the Fourier model for each year, and stores the errors in a dataframe.

Firstly, we load the wide-format data and the estimated temperature data. Then, we specify an autocorrelation factor (r), which is a parameter that describes the correlation between values of the error at different points in time.

We loop over each year from 1986 to 2022, and for each year we:

Calculate the difference between the observed and estimated temperatures to get the error.

Generate a sequence of random numbers from a normal distribution, called white noise.

Compute the autocorrelated error. The error for the first day is simply the white noise, and for each subsequent day, the error is the autocorrelation factor multiplied by the previous day’s error, plus the white noise for that day.

Finally, we create a DataFrame from the dictionary of autocorrelated errors, and save it to a CSV file.

This autocorrelated error represents the error in our model’s estimate that cannot be explained by the model itself, but rather depends on previous errors. This could be due to factors that we did not include in our model, such as atmospheric conditions or climate change. By including this autocorrelation in our analysis, we can better understand and model these unexplained variations in temperature.

Above code will produce output previews like below.

Table 6. Autocorrelated error

4.1.7 Final estimates adjusted temperature

We integrated the autocorrelated error into our temperature estimates to generate a more refined model of temperature estimation. With this data in place, we transformed our wide-format data into a long-format data frame. Each row of this data frame represented a specific day from a specific year, containing information on the date, observed temperature, estimated temperature, error, and the adjusted estimated temperature (estimate + error).

This transformed format provided us with a holistic and granular view of our data, suitable for subsequent detailed analyses. Once the transformation was complete, the data was saved into a CSV file, enabling easy access for further research or data visualization tasks.

Above code will produce output previews like below.

Table 7. Final adjusted temperature

4.2 Jupyter Notebook

The provided code shows the process on how we can use monthly temperature data to generate daily time series temperature using Fourier regression models. Here’s a summary of what each part of the code does: https://gist.github.com/bennyistanto/a9e6045a78b230dbd5c443a0e0e4fa41

5 Results

The graphical visualization of the estimated daily temperature against the observed temperature provided a robust means of evaluating the efficacy of the Fourier model across the study period (1986-2021) at the Bogor Climatological Station. The estimated temperature, generated from the Fourier model, was superimposed onto a scatter plot of the observed temperatures. The latter were smoothed using the nonparametric LOESS technique to discern major trends within the data.

Each subplot delineated a separate year’s worth of data, allowing for an insightful year-to-year examination of the model’s performance. The observed temperatures were presented as scatter points, with the LOESS smoothing line capturing the general pattern of the temperature across different days of the year.

The comparison between the observed temperature trends and the estimates from the Fourier model revealed a substantial degree of congruence, indicating the model’s reliability in predicting daily temperature patterns. The Fourier model demonstrated a commendable ability to generate daily temperature estimates from monthly data. This affirms the model’s utility in climatological studies, particularly when granular daily data are not readily available.

Above code will produce a chart visualization below

Picture 1. Observed temperature vs Estimated temperature, year-by-year

In this exercise, we implemented a method to compute autocorrelated errors between observed and estimated temperatures from 1986 to 2022. We began by loading our dataset, after which we defined an autocorrelation factor, a parameter that reveals the correlation between different points in time within the error series.

Each year’s temperature discrepancy was calculated and white noise, a random sequence derived from a normal distribution, was added. We then introduced the autocorrelation factor into the errors, with the first day’s error being only the white noise. The error for subsequent days factored in both the white noise and a portion of the previous day’s error, weighted by the autocorrelation factor.

Post computation, we organized the autocorrelated errors into a dictionary, subsequently transforming it into a DataFrame for further analysis. This dataset of autocorrelated errors presents the unexplained variance within our model, potentially stemming from unaccounted factors such as climate changes or certain atmospheric conditions.

Finally, we integrated these autocorrelated errors with our estimated temperatures, resulting in an adjusted and more refined temperature prediction. This revised dataset was then visualized in a time series plot, allowing for a comparative analysis between observed and adjusted estimated temperatures. The plot revealed a clear upward trend in the temperature at the Bogor Climatological Station from 1986 to 2021. Notably, the application of autocorrelated errors offered an excellent fit to the observed temperatures, thus affirming the effectiveness of our model.

Above code will produce a chart visualization below

Picture 2. Observed temperature vs Estimated temperature with error

The Fourier model’s estimates were further scrutinized by integrating autocorrelated errors into the calculations. This facilitated the generation of a modified temperature prediction that comprised the original estimate and the error. Upon examination, the plots vividly displayed a comprehensive juxtaposition of this adjusted forecast against the observed data for each year from 1986 to 2021.

Notably, the error-adjusted estimates, visualized through LOESS-smoothed lines, revealed minor disparities compared to the original model predictions. These charts underscored the pertinence of accommodating inherent model errors and substantiated the robustness of the Fourier model’s initial estimates. The insights derived from this comparison can guide further refinements to the model for superior accuracy in future temperature estimates.

Above code will produce a chart visualization below

Picture 3. Observed temperature vs Estimated temperature with error, year-by-year

6 Conclusion

The application of the Fourier regression model for generating daily time series data from monthly temperature observations has demonstrated considerable efficacy in climatological studies. This modeling approach provides a mathematically rigorous way to interpolate intra-monthly fluctuations, leveraging periodicity inherent in annual temperature patterns. The model is thus capable of filling data gaps and offering granular insights into day-to-day temperature variations, a granularity that monthly data alone cannot provide.

Notably, the model’s provision for the autocorrelation of errors adds an additional layer of realism to the estimations, acknowledging the dependence of errors on preceding values. This factor makes the model responsive to the serial correlation often seen in climatic data, enhancing its predictive capabilities.

In conclusion, Fourier regression modeling serves as an invaluable tool for climatologists, offering an effective means of generating daily time series data from sparse or aggregated observations. Through its utilization, it is possible to acquire more detailed insights into temperature dynamics, paving the way for refined climate studies, policy formulation, and mitigation strategies against climatic anomalies. The model’s robustness, flexibility, and accommodating nature towards error correlation further enhance its applicability, making it a staple in the data-driven examination of climate patterns.

7 References

Epstein, E.S. 1991. On Obtaining daily climatological values from monthly means. J. Climate 4:465-368. https://doi.org/10.1175/1520-0442(1991)004%3C0365:OODCVF%3E2.0.CO;2

Boer, R., Notodipuro, K.A., Las, I. 1999. Prediction of Daily Rainfall Characteristics from Monthly Climate Indices. RUT-IV report. National Research Council, Indonesia.

Castañeda-Miranda, A., Icaza-Herrera, M. de, & Castaño, V. M. (2019). Meteorological Temperature and Humidity Prediction from Fourier-Statistical Analysis of Hourly Data. Advances in Meteorology, 2019, 1–13. https://doi.org/10.1155/2019/4164097

Hernández-Bedolla, J.; Solera, A.; Paredes-Arquiola, J.; Sanchez-Quispe, S.T.; Domínguez-Sánchez, C. A Continuous Multisite Multivariate Generator for Daily Temperature Conditioned by Precipitation Occurrence. Water 2022, 14, 3494. https://doi.org/10.3390/w14213494

McCaskill, M.R. 1990. An efficient method for generation of full climatological records from daily rainfall. Australian Journal of Agricultural Research 41, 595-602. https://doi.org/10.1071/AR9900595

Parra-Plazas, J., Gaona-Garcia, P. & Plazas-Nossa, L. Time series outlier removal and imputing methods based on Colombian weather stations data. Environ Sci Pollut Res 30, 72319–72335 (2023). https://doi.org/10.1007/s11356-023-27176-x

Srikanthan, R., & McMahon, T. A. (2001). Stochastic generation of annual, monthly and daily climate data: A review. Hydrology and Earth System Sciences Discussions, 5(4), 653-670. https://doi.org/10.5194/hess-5-653-2001

Stern, R. D., & Coe, R. (1984). A model fitting analysis of daily rainfall data. Journal of the Royal Statistical Society. Series A (General), 147(1), 1-34. https://doi.org/10.2307/2981736

Wannacott, T.H. abd R.J. Wannacott. 1987. Regression: A second course in statistics. Robert E. Krieger Publishing, Co. Florida.

Regression analysis with dummy variables

Benny Istanto — Fri, 16 Jun 2023 00:00:00 GMT

This exercise aims to determine the best reduced model (RM) in regression analysis with dummy variables from annual rainfall data and altitude data in three different regions. This will result in a new regression equation capable of describing the relationship between altitude and rainfall in these three regions.

Summary

The dummy variables constructed in this article are based on regional location, specifically regions 1, 2, and 3. The initial analysis entailed the representation of data for each region through scatter plots. Thereafter, a regression analysis with all parameters was conducted to derive the Full Model (FM). Subsequently, the scatter plot patterns for each region were examined, and regression equation models with identical intercepts or slopes were identified. The objective was to generate simpler regression models or equations (Reduced Models - RM) from the dummy variables constructed based on regional location. Upon obtaining several RMs, all were statistically tested using the F-test to ascertain their similarity to the FM. It was also necessary to compute and analyze the Mallows’s Cp value for all RMs to determine the optimal RM. A good Reduced Model is one that is similar or identical to the Full Model. The F-test performed in this report was designed to determine whether the RM is similar or identical to the FM. The hypotheses for the F-test were as follows.

H0: FM = RM

H1: FM ≠ RM

The null hypothesis is refuted in instances where the observed F-value surpasses the F-table value. This suggests that, as of yet, there’s insufficient robust evidence to proclaim that the Reduced Model (RM) bears resemblance to the Full Model (FM) (Kutner et al., 2005). Within the conducted F-test analysis, a confidence interval of 95% is employed. The observed F-value can be calculated using the following formulation (Kutner et al., 2005):

The F-table value is derived from the F-distribution with the calculated degrees of freedom

The RM is considered as efficient or akin to the FM if the Mallows’s Cp value is equal to or less than the total number of parameters () (Mallows, 1973). The Cp value is determined using the equation:

where represents the number of parameters utilized in the RM, denotes the total number of observations within the model (), is the variance of the RM, and is the variance of the FM.

Data

Total rainfall in different altitude and region. The data available in csv format with columns: altitude; rainfall; region

The data for this analysis is available from this link: https://drive.google.com/file/d/1v3CGHBykg3UUqjKS3oyy8rIGsogN1DY5/view?usp=sharing

Implementation

In the implementation phase of this analysis, we utilized Python and the library to develop dummy variables regression

Plot the input data

The code presented aims to investigate the relationship between rainfall and altitude across different regions. The dataset, obtained from a CSV file, contains information on rainfall and altitude for various regions. The code utilizes the pandas library to read the data and matplotlib and seaborn libraries for data visualization.

To begin, unique regions in the dataset are identified. A dictionary, regression_params, is created to store the coefficients of the regression equations for each region. Subsequently, a scatter plot is generated for each region, where altitude is plotted on the x-axis and rainfall on the y-axis. This is achieved using the sns.scatterplot function from the seaborn library.

A linear regression model is then fitted to the data for each region using the LinearRegression class from the sklearn.linear_model module. The model is trained with altitude as the predictor variable (X) and rainfall as the target variable (y). The slope and intercept coefficients of the regression equation are obtained from the fitted model.

The regression coefficients are stored in the regression_params dictionary, associating them with their respective regions. Additionally, the regression equation is displayed on the plot for each region using the ax.text function. A regression line is drawn on the plot using the ax.plot function to visualize the relationship between altitude and rainfall.

The resulting plot showcases the rainfall-altitude relationship for different regions, with each region’s data points, regression line, and equation displayed. The plot is saved as an image file, and the figure is displayed for further examination.

Finally, the regression parameters for each region are printed to provide insights into the specific regression equations obtained. The slope and intercept values are extracted from the regression_params dictionary and displayed for each region.

[caption id=“” align=“alignnone” width=“1920”] For region 1, the regression equation is y = 2.43x + 996.53 For region 2, the regression equation is y = 1.99x + 1096.34 For region 3, the regression equation is y = 0.70x + 623.63 [/caption]

The provided code snippet focuses on data preprocessing and feature creation based on the information in a CSV file. It employs the pandas library for data manipulation and transformation.

Initially, the CSV file is read into a DataFrame using the pd.read_csv function, with the resulting DataFrame stored as df.

Next, several new columns are created based on the region column. These new columns serve as indicator variables to represent different regions in the dataset. Specifically, columns I1, I2, and I3 are generated using logical comparisons to check if the region value matches the respective region number. The astype(int) method is then applied to convert the resulting Boolean values to integers.

Similarly, additional columns H1, H2, and H3 are created by multiplying the altitude column with the corresponding indicator variables (I1, I2, and I3). This results in the creation of separate altitude columns for each region, where the altitude values are present only for the respective region and are set to zero for other regions.

Following this, combinations of indicator variables are generated to represent different combinations of regions. Columns I12, I13, I23, and I123 are created using logical comparisons to check if the region value matches the respective region combination. Column I123 is assigned a constant value of 1 since it represents the inclusion of all regions.

Similarly, new altitude columns H12, H13, H23, and H123 are created by multiplying the altitude column with the respective combination indicator variables. These columns enable the representation of altitude values for specific region combinations.

Lastly, the modified DataFrame is saved as a new CSV file using the to_csv function, with the file path specified and the separator set to ;. The resulting DataFrame is displayed using the df.head() method to show the first few rows of the transformed dataset.

In summary, this code segment demonstrates a data preprocessing step where new columns are created to represent regions and region combinations based on the original data. These transformations facilitate subsequent analysis and modeling tasks by providing a more informative and structured dataset.

Full model, avoid dummy trap

The Full Model regression equation doesn’t include I3 because of a technique used in regression analysis known as dummy coding. When we have a categorical variable with k levels (in this case, region with 3 levels), we need to create k-1 dummy variables to represent it in the regression model.

The reason for using k-1 dummy variables instead of k is to avoid the dummy variable trap, which is a scenario in which the independent variables are multicollinear. In other words, one variable can be predicted perfectly from the others.

In our case, I1, I2, and I3 represent the three regions. If we included all three in our model, we would have perfect multicollinearity because I3 can be perfectly predicted from I1 and I2 (if I1 = 0 and I2 = 0, then I3 has to be 1). This would make the model’s estimates unstable and uninterpretable.

By leaving out I3, we are implicitly choosing region 3 as the reference category. The coefficients for I1 and I2 then represent the difference in the outcome between regions 1 and 3, and regions 2 and 3, respectively.

If we want to make comparisons between regions 1 and 2, we can either change the reference category (by including I3 and leaving out I1 or I2 instead), or compute the difference between the I1 and I2 coefficients.

Reduced Model

Next, we are talking about creating Reduced Models (RMs) from a Full Model (FM) with dummy variables representing regions and altitude variables interacted with these region dummies. The Full Model (FM) in this context is:

FM: y123 = a1 I1 + a2 I2 + a3 I3 + b1 H1 + b2 H2 + b3 H3

where:

y123 represents rainfall
I1, I2, I3 are dummy variables for regions 1, 2, and 3, respectively
H1, H2, H3 are altitude variables interacted with the respective region dummies

Based on this FM, we can derive 5 different Reduced Models (RMs):

RM1: Common slope across region 1 and region 2: y123 = a12 I12 + b1H1 + b2H2 + a3 I3 + b3H3
RM2: Common intercept and slope across all regions: y123 = a123 I123 + b123 H123
RM3: Common intercept and slope across region 1 and region 3, different slope for region 2: y123 = a13 I13 + b13H13 + a2I2 + b2H2
RM4: Common slope across region 1 and region 2, different slope for region 3: y123 = a12 I12 + b12 H12 + a3I3 + b3H3
RM5: Common intercept across all regions, common slope across region 1 and region 2, different slope for region 3: y123 = a1 I1 + a2 I2 + a3 I3 + b12 H12 + b3 H3

Cp Mallows

Next, a summary table was created to provide a succinct overview of both the Full Model (FM) and each Reduced Model (RM). This table is essential as it encapsulates vital statistical information about each model. This summary table consists of five columns: P, S, σ, n, and C_P_Mallow, and six rows corresponding to FM’,RM1,RM2’, RM3, RM4, and RM5.

The P denotes the number of parameters used in each model, S indicates the standard deviation of the residuals, σ represents the standard deviation of residuals for the full model, n signifies the number of observations, and C_P_Mallow represents the value of Mallow’s C_P statistic.

In the process of determining the effectiveness of the Reduced Models in relation to the Full Model, the Mallow’s statistic plays a crucial role. According to Mallows (1973), a Reduced Model can be considered comparable to the Full Model if the Mallow’s value is less than or equal to the number of parameters (). This statistic is calculated using the formula:

In this context, corresponds to the number of parameters used in the Reduced Model, denotes the total data observations used in the model (in this case, ), is the variance of the Reduced Model, and is the variance of the Full Model. By making use of this computation, we were able to evaluate the efficiency of each Reduced Model in comparison to the Full Model, aiding in the effective and accurate analysis of our data set.

Plot Cp Mallows

Based on the provided results from the code, we can create a plot to visualize the CP Mallow statistic. The x-axis of the plot will represent the number of predictors (P), while the y-axis will represent the CP Mallow values.

To begin, we will draw a cross line starting from the point (1, 1) and extending to the point (n, n), where ‘n’ represents the total number of observations. This line will serve as a reference and help us identify the region of interest.

Next, we will plot the CP values on the y-axis corresponding to the respective number of predictors (P) on the x-axis. Each point on the plot will represent a reduced model, with the CP value indicating its performance compared to the full model.

To highlight the specific point that satisfies the given criteria - the lowest number of predictors (P) and falls either above or below the cross line - we can customize the marker style or color for that point. This will make it visually distinct from the other points on the plot.

By examining the plot, we can easily identify the reduced model that strikes a balance between simplicity (fewer predictors) and predictive power (CP Mallow value). The highlighted point will represent the optimal reduced model that meets these criteria.

This plot provides a visual representation of the CP Mallow statistic, allowing us to compare the performance of different reduced models and select the most appropriate one based on the desired balance between complexity and prediction accuracy.

The execution of linear regression analysis across the three designated regions indicated a correlation between the augmentation of annual precipitation and escalating altitude. The reduced models (RM) that demonstrated a close correspondence to the full model (FM) were RM2 and RM4. However, the models that exhibited remarkable efficacy were RM2. The characteristics embodied by the first and second regions can be postulated to bear similarities, while they exhibit discernible divergence from the attributes of the third region

Sentinel-1 modified Radar Vegetation Index

Benny Istanto — Thu, 15 Jun 2023 00:00:00 GMT

The Sentinel-1 modified Radar Vegetation Index (RVI) based on Google Earth Engine (GEE) script below originally developed by my friend Jose Manuel Delgado Blasco (Scholar, Linkedin) as part of our team (GOST) activities to support during Ukraine response last year, published as GOST Public Good’s Github repo https://github.com/worldbank/GOST_SAR/tree/master/Radar_Vegetation_Index

The original GEE script was meant to be used only for individual updates, as time progresses and the need for vegetation monitoring continually increases, I believe it’s necessary to obtain this RVI time-series data, which can be matched with monthly rainfall time-series data for monitoring food crop phenology.

For this reason, I’ve added a function to mosaic every ten days and batch downloading if the list of data is quite extensive.

All credit goes to the awesome work of Jose Manuel! Hats off to him!

[caption id=“” align=“alignnone” width=“1277”] RVI in Crimean Peninsula [/caption]

Above picture is Vegetation Indices based on Sentinel-1 (generated using the GEE script), below picture is Vegetation Indices for the same period based on Sentinel-2 (generated using Climate Engine https://climengine.page.link/sZnR)

NDVI in Crimean Peninsula

Full GEE code is here: https://code.earthengine.google.com/62f799954525c997629cefdd435c500e

Second-order Markov chain model to generate time series of occurrence and rainfall

Benny Istanto — Fri, 26 May 2023 00:00:00 GMT

1 Introduction

In the realm of meteorological studies, the use of statistical models is pivotal for understanding and predicting various weather phenomena. Among these models, the second-order Markov chain model has emerged as a powerful tool, particularly in generating time series of rainfall occurrence (Wilks, 1998). This model provides a robust framework for simulating rainfall patterns, offering valuable insights that are crucial for weather forecasting, water resource management, and climate change studies.

The second-order Markov chain model distinguishes itself from its first-order counterpart through its ability to consider not just the state of the system at the previous time step, but also the state at the time step before that. This additional layer of historical context allows the model to capture more complex dependencies and transitions in the rainfall data (Bellone et al., 2000). This enhanced capability significantly improves the accuracy of the generated time series, making it a powerful tool in the study of rainfall patterns.

Rainfall, as a natural phenomenon, exhibits a high degree of variability and randomness. The second-order Markov chain model, with its ability to incorporate historical context, is well-equipped to handle this variability (Hughes et al., 1999). By considering the state of the system at two previous time steps, the model can capture the inherent randomness in rainfall occurrence, thereby generating a time series that closely mirrors real-world rainfall patterns.

The application of the second-order Markov chain model to rainfall data is not just a theoretical exercise. The generated time series of rainfall occurrence can have practical applications in various fields. For instance, in the field of agriculture, understanding rainfall patterns can help farmers plan their planting and harvesting schedules (Rosenzweig et al., 2000). In urban planning, accurate rainfall predictions can inform the design of drainage systems to prevent flooding (Ashley et al., 2005).

2 Data

In terms of rainfall, Bogor receives an average annual precipitation of over 3,000 millimeters. The city experiences the most rainfall from November to March, with each of these months receiving over 300 millimeters of rain on average. Even in the driest months, from June to September, Bogor still receives over 100 millimeters of rain per month on average.

This consistent and significant rainfall, combined with the city’s warm temperatures, contributes to its lush, tropical environment. The climatic conditions of Bogor provide a rich dataset for the application of a second-order Markov chain model to generate time series of occurrence and rainfall.

Daily rainfall data of Bogor Climatological Station from 1984-2021 were used in this analysis, downloaded from BMKG Data Online in *.xlsx format. The file then manipulated by remove the logo and unnecessary text, leaving only two columns, namely date in column A and rainfall in column B for the header with the format extending downwards, and save as *.csv format.

The final input file is accessible via this link: https://drive.google.com/file/d/1molqggv9o71Z0VT50h5OvEqCxYq4Bp1Z/view?usp=sharing

3 Methods

This exercise focuses on the second-order Markov chain model as a tool for generating rainfall occurrence probabilities and the gamma distribution for determining rainfall height (Boer, 1999).

The second-order Markov chain model is widely used to represent rainfall occurrence (Stern and Coe, 1984; Hann et al, 1976). In this model, rainfall occurrence on day “i” is influenced by the presence or absence of rainfall in the previous days and the day after the previous day. If rainfall on day “i” is only influenced by rainfall on the previous day, it is considered a first-order Markov chain, and if it is influenced by rainfall two days prior, it is considered a second-order Markov chain, and so on.

The second-order Markov chain model has been demonstrated to be effective in generating time series of rainfall occurrence (Nick and Harp, 1980; Richardson, 1981; Wilks, 1990). Furthermore, the gamma distribution is frequently utilized to determine rainfall height (Wilks, 1990). By employing both the second-order Markov chain model and gamma distribution, this study offers a comprehensive approach to generating rainfall data.

The combined use of the second-order Markov chain model for rainfall occurrence and gamma distribution for rainfall height provides a robust method for generating rainfall data. This approach has significant implications for various fields, such as agriculture and urban planning, where accurate rainfall data is crucial for informed decision-making.

In this exercise, the focus is limited to second-order Markov chains, as the analysis for lower/higher-order chains is fundamentally similar. The analysis employs the symbol 0 for non-rainy days and 1 for rainy days. The probability of rainfall on day i, given that it did not rain on the previous day and the day before previous day, is denoted as P001(i), while the probability of rain given that it rained the previous day and the day before previous day is represented as P111(i). The general form of the estimated probability of rainfall occurrence is as follows:

where njkl(i) represents the number of years in which the event l (0 or 1) occurred on day i, and the event jk (0 or 1) happened on the previous day and the day before the previous day.

3.1 Rainfall occurrence model

Rainfall occurrence models commonly use Fourier regression equations to predict the probability of rainfall occurrence. However, these equations can sometimes produce a fitting line with values greater than 1 or smaller than 0. To address this issue, the probability values are first transformed into a logit function gjkl(i).

To transform gjkl(i) back into probability values, the following equation is used:

The fitting line for gjkl(i) follows the form presented by Stern and Coe (1984):

Where and

The number of harmonics, m, can be determined using multiple regression techniques, where independent variables are introduced sequentially, starting with harmonic 1, harmonic 2, and so on until no more variance is explained by the newly introduced variable.

3.2 Rainfall generation model

To generate rainfall data, the probability information required is the probability of rainfall occurrence on day i, where the previous day’s occurrence is k (0 or 1), and the day before yesterday is j (0 or 1). The estimated value for gjkl(i) can be calculated if daily rainfall observation data is available.

For simulation purposes, probability data must be converted into occurrence data. This is done by generating random numbers from a uniform distribution U(0, 1; VanTassel et al., 1990). If the random value from the uniform distribution is smaller than the probability value, it indicates rainfall; otherwise, it indicates no rainfall. If the simulation result indicates rainfall, the next step is to generate rainfall height using theoretical distributions.

The next step in creating a rainfall data simulation model is to calculate the parameters of a theoretical distribution that approximates the rainfall data distribution. The Gamma distribution is widely used to describe rainfall intensity variability (Ison et al., 1971; Stern and Coe, 1984; Waggoner, 1989; Wilks, 1990). The probability density function is as follows:

with being the shape parameter and being the scale parameter of the gamma function .

Several methods can be employed to estimate the values of the two parameters of the gamma distribution, one of which is the Maximum Likelihood Method. According to Shenton and Bowman (1970, as cited in Haan, 1979), the value obtained from the Maximum Likelihood Method may still have a bias, and therefore needs to be corrected. The corrected value, calculated using the Greenwood and Durand method, is:

Subsequently, the parameter is calculated as follows:

The predicted rainfall is based on a predetermined set of patterns, P001, P010, P011 and P111, referred to as P-types. These P-types represent different combinations of previous and future rainy days and are used as event triggers in the model. The rainfall data is then separated into different seasons, DJF, MAM, JJA and SON, based on the month of occurrence. The model takes into account the day-to-day variability within each season.

The gamma distribution, parameterized by (shape) and (scale), is used to generate the predicted rainfall. and parameters for each season are pre calculated. These parameters are fetched for each P-type and season combination.

The random gamma function, employed to simulate rainfall events, generates samples from a Gamma distribution. The number of samples drawn (pertaining to the size parameter in the gamma distribution) ideally aligns with the valid event days within a given season, conforming to the original precipitation data (Wilks, 2011). In essence, for each season and event type, the gamma distribution is simulated as frequently as the number of event days occurring within the season, according to the original data.

Although the simulated events from a uniform distribution and the derived rainfall values from the gamma distribution are not intrinsically connected in the simulation process, they both represent the same event category. To maintain consistency in the temporal distribution of events, the generated rainfall values are matched with the valid event days in the original data. This coherence in the number of samples drawn from the gamma distribution is accomplished by aligning it with the structure of the initial precipitation data, rather than the simulated occurrences. Consequently, the gamma distribution is simulated for as many instances as the number of simulated event days within the season, thereby aligning with the frequency of simulated events in the synthetic weather data. The methodology of generating synthetic weather data using stochastic processes is a widely recognized approach in atmospheric sciences (Rodriguez-Iturbe, Cox, & Isham, 1987; Srikanthan & McMahon, 2001).

For every P-type, the model iterates through each season. During each iteration, it identifies the days in the season when an event (rainfall) is predicted to occur. These are the days that have a corresponding 1 in the event data for the current P-type.

Once these event days are identified, the model generates rainfall values for these days using the gamma distribution with the α and β parameters for the current season. This process is repeated for all the P-types and seasons.

The result is a predicted rainfall dataset that takes into account the specific patterns of rainfall events and the seasonal characteristics of rainfall intensity.

4 Implementation

In the implementation phase of this analysis, we utilized Python and the Pandas, Numpy and Matplotlib library to develop a rainfall occurrence generation model.

4.1 How-to?

The step-by-step guide for the model is readily accessible in Google Colab or Jupyter Notebook, an ideal platform for data analysis and machine learning. This comprehensive how-to guide explains the entire process, starting with reshaping the data to ensure compatibility with the model, generate transition probabilities, essential for accurate predictions, calculate the number of probabilities, followed by translating these probabilities into meaningful rainfall event information. The final step involves chart generation, effectively visualizing the results for clear interpretation and analysis.

Configuration

Configuration is a crucial aspect of setting up any data analysis or processing workflow. Proper configuration ensures seamless access to data, efficient execution of tasks, and smooth integration of required tools and libraries. This article covers several essential subtopics related to configuration, such as connecting Google Drive to Colab, installing packages, importing libraries, and setting up working directories.

Google Drive directory into Colab

Connecting Google Drive to Colab is a vital step when working with data stored in Google Drive. It allows us to access and manipulate files directly from our Colab notebook. To connect our Google Drive, we can use the google.colab.drive module to mount our drive, enabling seamless access to our files and folders.

Notes

This is only apply if we are working in Colab

Working Directories

Setting up working directories involves defining the input and output directory paths for our project. This ensures that our code knows where to find the input data and where to store the results. Properly organizing our working directories makes it easier to manage our project, share it with others, and maintain a clean and structured codebase.

4.1.1 Rainfall categorization

In the first stage of the analysis, we import the data and categorize whether the day is rainy (value = 1) or sunny (value = 0).

Above code will produce output previews like below.

4.1.2 Function for transition probabilities order 2

This step explain the process to calculate the transition probabilities of weather states from one day to the next, considering the weather states of the previous two days, based on historical weather data. The weather states are represented as binary values: 0 for “Sunny” and 1 for “Rain”. The transition probabilities are calculated for eight different scenarios:

P000: The probability that today is Sunny given that the day before yesterday was Sunny and yesterday was Sunny.
P010: The probability that today is Sunny given that the day before yesterday was Sunny and yesterday was Rain.
P100: The probability that today is Sunny given that the day before yesterday was Rain and yesterday was Sunny.
P110: The probability that today is Sunny given that the day before yesterday was Rain and yesterday was Rain.
P001: The probability that today is Rain given that the day before yesterday was Sunny and yesterday was Sunny.
P101: The probability that today is Rain given that the day before yesterday was Sunny and yesterday was Rain.
P011: The probability that today is Rain given that the day before yesterday was Rain and yesterday was Sunny.
P111: The probability that today is Rain given that the day before yesterday was Rain and yesterday was Rain.

The given code defines a function calculate_transition_probabilities_orders_2_long that calculates transition probabilities based on weather conditions in a DataFrame (df). The function takes three conditions (condition1, condition2, and result) and checks if these conditions are met in consecutive rows of the DataFrame. It creates a new column with binary values indicating the occurrence of the specified conditions. NaN values are set for rows with missing data. The code then defines a list of conditions and results and iterates over them to calculate transition probabilities for each scenario. The resulting probabilities are stored in new columns in the DataFrame. The DataFrame is restructured and saved as a CSV file. Finally, the program prints ‘Completed!’ and displays a preview of the DataFrame.

Above code will produce output previews like below.

4.1.3 Reshape the data

The provided code segment executes a series of steps to transform weather data from long to wide format, to make easy for further process.

Firstly, it generates a list of unique scenarios represented by “P” values. Subsequently, a ‘year’ column is added to the DataFrame bin_df based on the ‘date’ information. The code then iterates through each unique “P” value. For each iteration, it selects the relevant columns (‘year’, ‘day’, and the current “P” value) from bin_df while removing rows with missing values. The “P” column is renamed as ‘value’. The DataFrame is then pivoted, organizing the data with ‘day’ as the index, ‘year’ as the columns, and ‘value’ as the values. Each resulting pivoted DataFrame is saved as a CSV file, with the file name corresponding to the current “P” value.

Above code will produce output previews like below.

4.1.4 Calculate number of event

This code below is responsible for calculating the total number of occurrences per day for each of the eight possible weather state transitions (P000, P001, P010, P100, P110, P101, P011, P111) over the entire period of the dataset.

In this context, each weather state transition represents a sequence of three consecutive days. For example, P010 represents a sequence where it was sunny two days ago, rained yesterday, and is sunny today. The weather states are represented as binary values: 0 for “Sunny” and 1 for “Rain”.

The code first calculates the total number of occurrences per day for each weather state transition by summing up the values in the respective columns of the binary DataFrame (bin_bin_reshape_dfxxx). It then creates a new DataFrame (num_df) that includes these totals along with the corresponding day. This DataFrame provides a daily summary of the weather state transitions for the entire period of the dataset.

Finally, the code saves this DataFrame to a CSV file for further analysis and previews the data. This step is crucial as it allows for the inspection of the calculated totals and ensures the data is correctly processed and ready for the next steps of the analysis.

Above code will produce output previews like below.

4.1.5 Calculate the probabilities

This specific code block calculates the transition probabilities for each of the four possible weather state transitions where the current day is rainy (P001, P011, P101, P111) and another four where the current day is sunny (P000, P010, P110, P100).

The transition probabilities are calculated by dividing the total number of occurrences of each rainy/sunny weather state transition by the total number of occurrences of both the rainy and sunny weather state transitions for the same previous two days. For example, the transition probability P011 is calculated by dividing the total number of P011 occurrences by the sum of the total number of P011 and P010 occurrences.

The calculated transition probabilities are then stored in a new DataFrame (prob_df_xxxx), which also includes the corresponding day. This DataFrame provides a daily summary of the transition probabilities for the entire period of the dataset.

Finally, the code saves this DataFrame to a CSV file for further analysis and previews the data. This step is crucial as it allows for the inspection of the calculated probabilities and ensures the data is correctly processed and ready for the next steps of the analysis.

Above code will produce output previews like below.

4.1.6 Converting to logit function and transform back to probability value

The code calculates Fourier coefficients and applies a logit transformation to the probability values in a pandas DataFrame prob_df. First, it modifies prob_df, replacing any instances of 0 or 1 probabilities with a small constant epsilon or 1 - epsilon respectively. This prevents errors when applying logarithms and exponentials later in the process. The script then calculates a set of variables based on the day of the year, including trigonometric functions sin_t_prime, cos_t_prime, sin_2t_prime, and cos_2t_prime based on the day of the year scaled by 2*pi/365 to reflect the cyclical nature of the calendar.

After that, the script computes the logit of the probabilities, g_a_df, which is the log of the odds ratio (i.e., the ratio of the probability of an event occurring to the probability of it not occurring). Fourier coefficients are calculated for each original column in prob_df. The Fourier series is a way to represent a function as a sum of periodic components, and in this context, it’s used to capture the cyclical patterns of the probabilities throughout the year.

Finally, the script constructs a new DataFrame result_df that includes the original probabilities, the calculated g_a_df values, fitted g_fit values (based on the Fourier series representation), and final probabilities (the inverse logit of g_a_df). This DataFrame is saved to a CSV file and then returned for review.

Above code will produce output previews like below.

4.1.7 Visualize the calculated logit and their fitted

The script visualizes the calculated logit (g) values and their fitted counterparts (g_fit) from the result_df DataFrame for both rainy and sunny day scenarios.

This is accomplished by setting up a 2-row, 4-column grid of subplots. In the first row, it plots the rainy day scenarios (P_types_rainy) and in the second row, it plots the sunny day scenarios (P_types_sunny). For each scenario (rainy or sunny) and each type of day (defined by P_types), it creates a scatter plot of ‘g’ values and overlays a line plot of ‘g_fit’ values over the course of the year (represented by the ‘day’ variable).

The script then labels each subplot with its respective day type and scenario, sets the x and y labels, and includes a legend indicating which points represent ‘g’ and which line represents ‘g_fit’.

Finally, it adjusts the layout for better visualization and displays the plot. This way, it helps to analyze how well the fitted values (g_fit) are approximating the calculated logit (g) values.

4.1.8 Generate random numbers from a uniform distribution to get the rainfall events

This code generates random numbers from a uniform distribution for each day and compares these to our probabilities to generate the events. Events are coded as 1 for rain and 0 for no rain. The new DataFrame event_df only contains the event data, with columns named event_Pxxx as specified. The data is saved in a CSV file called events.csv.

Above code will produce output previews like below.

4.1.9 Visualize the probability of rainfall occurrence

The given script produces a set of heatmaps to visualize event data related to different scenarios of rainfall given that the current day is rainy. The data, divided by months and days, represents whether it’s a rainy day (indicated by a color) or a sunny day (represented by a white block).

A heatmap is an apt choice of visualization here as it allows for an immediate visual assessment of patterns and trends in the data over a period of time (in this case, over the days of each month). Moreover, the color contrast between rainy and sunny days helps to easily distinguish between the two events. Heatmaps also excel at handling and displaying data over two dimensions (months and days, in this context), making them a clear choice for this kind of data presentation.

Firstly, the code defines different types of events (represented as ‘P_types’), the layout for the subplots, and the number of days in each month (accounting for leap years).

Then, it loops over each event type, creating a 2D array filled with NaNs to hold the event data for each day of each month. The event data is split by month and filled into this array, ensuring the correct day and month placement for each event.

Next, a heatmap for each event type is generated using seaborn, with a color scheme denoting the presence or absence of rainfall, and an outline for each day block to enhance readability. The heatmap’s axes and title are customized for each scenario.

A legend is also created to indicate the meanings of the colors in the heatmaps. The code finally adds a main title for the set of heatmaps, adjusts the layout for clear viewing, and displays the visualizations.

Before running below code, please make sure yopu already have “seaborn” installed. If not, please install it using “pip install seaborn”

4.1.10 Gamma distribution

This code analyses a dataset of rainfall patterns. It first loads the data, and prepares it by converting the ‘date’ column into a datetime format and adding a ‘month’ column. It then assigns each entry to a season (DJF, MAM, JJA, or SON) based on the month of the year. After isolating only the rainy days, the script applies a Gamma distribution model for each season’s rainfall data. The parameters (alpha and beta) of the Gamma distribution for each season are corrected for small sample sizes using the Greenwood and Durand method. These corrected parameters are then stored in a new DataFrame, which is exported as a CSV file for future use or analysis. The resulting DataFrame provides a seasonal breakdown of the rainfall data, and offers insights into how the rainfall pattern is distributed for each season.

Above code will produce output previews like below

4.1.11 Generate rainfall value

Now that we have estimated the parameters for the gamma distribution for each season, and have generated event data, we can generate rainfall values based on these parameters and events.

The gamma distribution is only used to generate rainfall values for rainy days (where event = 1), as it is typically used to model positive continuous data, and cannot generate the zero values corresponding to non-rainy days.

In this script, we create a new rainfall_PXXX column for each event_PXXX column. For each season, we select the days where event_PXXX = 1, and generate rainfall values for these days using the gamma distribution with the corresponding alpha and beta parameters. These generated values are then stored in the rainfall_PXXX column. At the end, the updated DataFrame is saved to a new CSV file.

Here’s how we could do this for each P-type.

Above code will produce output previews like below.

4.2 Evaluations

Evaluating the quality of our predicted rainfall values depends on the specific goals of our analysis and the characteristics of our data. However, here are several common methods for evaluating prediction quality.

4.2.1 Visualize the rainfall compared to predicted rainfall

The given script produces a set of maps to visualize rainfall data compared to predicted rainfall different scenarios of rainfall given that the current day is rainy.

This code is meant to load, process, and plot data on annual rainfall and rainfall predictions from the years 1984 to 2021.

It initializes a plot with 10 rows and 4 columns to make room for a line plot for each year from 1984 to 2021. Each plot will compare actual rainfall (in light blue) with the predicted rainfall (in orange) over the course of a year.

4.2.2 Performance

Distribution of Errors (Residuals): We can plot a histogram or a Kernel Density Estimate plot of the residuals, which are the differences between the actual and predicted values. If our model is a good fit, the residuals should be normally distributed around zero.

Time Series of Residuals: Plotting residuals over time can show whether the errors are consistent throughout the time series, or if they vary significantly at certain time periods.

Boxplot of Errors by Year: This can help us see if the model’s performance varies significantly from year to year.

5 Results

We delve into a comprehensive analysis of rainfall prediction and its various aspects. By examining the curve adjustment chart and transforming probabilities into rainfall events, we gain insights into the predicted outcomes. Furthermore, we assess the performance of these predictions using visual comparisons, distributed errors (residuals), time series of residuals, and boxplot of error by year. This chapter aims to elucidate the accuracy and reliability of our rainfall prediction model.

5.1 Adjustment curve

The scatter plot visualizes the adjustment curve for generating daily rainfall data using Fourier regression analysis. The data spans from 1984 to 2021. Each subplot corresponds to different weather patterns, characterized by the variables ‘P001’, ‘P011’, ‘P101’, ‘P111’, ‘P000’, ‘P010’, ‘P100’, and ‘P110’.

The top row of plots shows the fitting model for rainy days (‘P001’, ‘P011’, ‘P101’, ‘P111’). Here, the patterns in the fitted models (g_fit) align with the data generated by g_a, indicating that the Fourier model accurately captures the distribution pattern of rainfall across different types of rainy day events.

The second row presents the fitting model for dry days (‘P000’, ‘P010’, ‘P100’, ‘P110’). In these plots, the peak of the dry season, occurring in the June-July-August (JJA) period, is prominently reflected in the peak of the g_fit line plot. Conversely, the rainfall is lowest during this period, which is depicted as a valley in the model.

Above visualization effectively demonstrates the application and accuracy of the Fourier regression analysis in modeling and simulating daily weather patterns, both for rainy and dry conditions, over a significant period. The g_fit line plots accurately reflect the distribution patterns of the original data (g), implying that the Fourier model is a suitable tool for simulating these weather patterns.

5.2 Transforming the probability into rainfall event

The image is a set of four heatmaps, each representing a different scenario: ‘P001’, ‘P011’, ‘P101’, and ‘P111’. These scenarios are likely representative of different weather conditions or patterns. Each heatmap shows the pattern of rainfall across a year. The x-axis denotes the day of the month while the y-axis represents the month itself, ranging from 1 (January) to 12 (December).

The color intensity in each cell indicates the probability of rainfall. Darker shades symbolize a higher likelihood of rain, while lighter shades indicate a lower likelihood. This color gradient allows us to visually comprehend the variability and seasonality of rainfall across different periods of the year.

From these heatmaps, one can observe the days and months when rainfall is more or less likely, given that the day is classified as ‘rainy’. These visualizations provide an intuitive understanding of rainfall patterns and their variations throughout the year for each respective scenario.

5.3 Predicted rainfall

The daily rainfall generated by the Fourier regression model and compared with the daily observation data from the Bogor Climatology Station from 1984-2021 in the image below (example using Year 2008-200- and 2012-2013) indicates that the predicted rainfall shows rainfall values produced by the model are higher than the observation data (overrated) with a pattern that tends to be somewhat dissimilar.

5.4 Performance

Visual comparison: It’s a plot of the predicted values against the observed values. This can give us a quick, intuitive sense of how closely our predictions match the actual values. While visually comparing the predicted and actual rainfall data is important and necessary, it is not sufficient on its own to evaluate the performance of the prediction model.

Distribution of Errors (Residuals): This plot shows the distribution of residuals (errors), which are the differences between the predicted and actual rainfall values.

In the context of rainfall prediction, if the residuals are normally distributed and centered around zero, it indicates that your model has made errors that are random and not biased, which is a good sign. If the distribution is not centered around zero or is highly skewed, it indicates that your model may be consistently overestimating or underestimating the rainfall.

Time Series of Residuals: This plot shows how residuals change over time.

We should expect to see no clear pattern in the residuals over time. If we see patterns, such as the residuals increasing or decreasing over time, it suggests that our model is not capturing some trend in the data. This could indicate a problem with our model that needs to be addressed.

Boxplot of Error by Year: This plot shows the distribution of residuals for each year.

This can help you understand if your model’s performance is consistent over time. If some years have much higher or lower residuals, it may indicate that those years had unusual rainfall patterns that your model didn’t capture. You may want to investigate further to understand what’s causing these discrepancies.

6 Conclusion

Markov-chain models, when combined with Fourier regression equations and logit transformations, can be useful in estimating rainfall occurrence probabilities and generating synthetic rainfall data. This generated data can have practical applications in various fields, such as agriculture and urban planning, where accurate rainfall data is crucial for informed decision-making.

7 References

Ashley, R. M., Balmforth, D. J., Saul, A. J., & Blanskby, J. D. (2005). Flooding in the future–predicting climate change, risks and responses in urban areas. Water Science and Technology, 52(5), 265-273. https://doi.org/10.2166/WST.2005.0142

Bellone, E., Hughes, J. P., & Guttorp, P. (2000). A hidden Markov model for downscaling synoptic atmospheric patterns to precipitation amounts. Climate Research, 15(1), 1-12. https://www.jstor.org/stable/e24867295

Boer, R., Notodipuro, K.A., Las, I. 1999. Prediction of Daily Rainfall Characteristics from Monthly Climate Indices. RUT-IV report. National Research Council, Indonesia.

Cho, H., K. P. Bowman, and G. R. North. 2004. A Comparison of Gamma and Lognormal Distributions for Characterizing Satellite Rain Rates from the Tropical Rainfall Measuring Mission. J. Appl. Meteor. Climatol., 43, 1586–1597. https://doi.org/10.1175/JAM2165.1

Hughes, J. P., Guttorp, P., & Charles, S. P. (1999). A non-homogeneous hidden Markov model for precipitation occurrence. Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(1), 15-30. https://doi.org/10.1111/1467-9876.00136

Rodriguez-Iturbe, I., Cox, D. R., & Isham, V. (1987). Some models for rainfall based on stochastic point processes. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 410(1839), 269-288. https://doi.org/10.1098/rspa.1987.0039

Rosenzweig, C., Tubiello, F. N., Goldberg, R., Mills, E., & Bloomfield, J. (2002). Increased crop damage in the US from excess precipitation under climate change. Global Environmental Change, 12(3), 197-202. https://doi.org/10.1016/S0959-3780(02)00008-0

Stern, R. D., & Coe, R. (1984). A model fitting analysis of daily rainfall data. Journal of the Royal Statistical Society. Series A (General), 147(1), 1-34. https://doi.org/10.2307/2981736

VanTassell, L.W., J.W. Richardson and J.R. Conner. 1990. Simulation of meteorological data for use in agricultural production studies. Agric. System 34:319-336. https://doi.org/10.1016/0308-521X(90)90011-E

Waggoner, P.E. (1989). Anticipating the frequency distribution of precipitation if climate change alters its mean. Agric. For. Meteor. 47:321-337. https://doi.org/10.1016/0168-1923(89)90103-2

Wilks, D. S. (1990). Maximum likelihood estimation for the gamma distribution using data containing zeros. Journal of Climate, 3(12), 1495-1501. https://doi.org/10.1175/1520-0442(1990)003%3C1495:MLEFTG%3E2.0.CO;2

Wilks, D. S. (1998). Multisite generalization of a daily stochastic precipitation generation model. Journal of Hydrology, 210(1-4), 178-191. https://doi.org/10.1016/S0022-1694(98)00186-3

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Academic press. https://doi.org/10.1016/C2017-0-03921-6

Impact of climate change in cities

Benny Istanto — Thu, 25 May 2023 00:00:00 GMT

A new World Bank report is launched, in which I had the opportunity to contribute the analysis.

The report examining the two-way relationship between cities and climate change, including valuable insights to help cities boost resilience and thrive, both now and in the future.

The team are utilizing temperature and precipitation based index: SPEI, SPI, CDD, CWD, number of annual hotdays and annual mean temperature, distance and magnitude of tropical cyclone to the city center, to support the analysis which organized into four inter-related workstreams:

Who is affected?
Stressor that make urban development less green.
Stressors that make urban development less resilient.
Stressors that make urban development less inclusive.

If you are interested to read the story and report, feel free to visit below:

Story: https://www.worldbank.org/en/publication/thriving
Recorded of Live event:https://live.worldbank.org/events/thriving-making-cities-climate-ready
Publication: https://openknowledge.worldbank.org/entities/publication/7d290fa9-da18-53b6-a1a4-be6f7421d937

Fuzzy Inference System (FIS) for Flood Risk Assessment

Benny Istanto — Sat, 15 Apr 2023 00:00:00 GMT

1 Implementation

In the implementation phase of this analysis, we utilized Python and the Scikit-Fuzzy library to develop a fuzzy logic-based flood risk assessment model. This model took into account four essential factors affecting flood risks: precipitation intensity, soil moisture, land cover, and slope. By defining the fuzzy sets and rules for these variables, the model was able to estimate the flood risk for various combinations of input values. The ultimate goal of this implementation was to identify the conditions under which low flood risks could be achieved, even in situations where precipitation intensity was at its maximum.

1.1 How-to?

In the first stage of the analysis, we defined the variables that influence flood risk. These variables include precipitation intensity, soil moisture, land cover, and slope. Each of these variables was represented as a fuzzy variable using the Scikit-Fuzzy library’s Antecedent class. Additionally, we defined the output variable flood_risk using the Consequent class. This stage set the foundation for the fuzzy logic-based flood risk assessment model by establishing the key variables that the model would use to estimate flood risk.

In the second and third stages, we focused on defining the fuzzy sets and their respective membership functions for each of the variables defined in the first stage. We used the automf() function to automatically generate triangular membership functions for precipitation intensity, soil moisture, and flood risk, each with three levels: low, medium, and high. For the land cover and slope variables, we manually defined triangular membership functions, specifying the appropriate ranges for each fuzzy set (urban, vegetation, and bare_soil for land cover, and flat, moderate, and steep for slope). These stages were critical for establishing the relationships between the input variables and the output flood risk, which would later be used to evaluate different combinations of input values in the fuzzy inference process.

In the fourth stage, we defined the rules that describe the relationships between the input variables (precipitation intensity, soil moisture, land cover, and slope) and the output variable (flood risk). We first created a list of classifications for each input variable and the output variable. Using the multiplication principle, we calculated the total number of possible combinations of these classifications, resulting in 81 unique rules.

For each combination of input classifications, we determined the appropriate flood risk level based on a set of predefined conditions. These conditions were based on expert knowledge and domain understanding, considering factors such as high precipitation and soil moisture, bare soil land cover, and steep slopes. After determining the flood risk level for each combination, we created a fuzzy rule using the Scikit-Fuzzy library’s Rule class, linking the input conditions with the corresponding flood risk level. These rules formed the basis of the fuzzy inference system that was used to evaluate different scenarios and estimate the corresponding flood risks.

In the fifth stage, we created the control system and simulation by combining the defined rules from the previous stage. The Scikit-Fuzzy library’s ControlSystem and ControlSystemSimulation classes were used for this purpose. The ControlSystem class takes the set of rules as input and initializes the fuzzy inference system, while the ControlSystemSimulation class initializes a simulation environment that can be used to compute the output based on the input values.

In the sixth stage, we provided example input values for each input variable (precipitation, soil moisture, land cover, and slope) to test the fuzzy inference system. The input values were assigned to their corresponding input variables in the simulation, and the compute method of the ControlSystemSimulation object was called to perform the fuzzy inference process and obtain the output flood risk level.

In the final stage, we output the computed flood risk level and visualize the result using the Scikit-Fuzzy library’s built-in plotting capabilities. The flood risk level was displayed as a numerical value, while the visualization provided a graphical representation of the membership functions and the defuzzified output. This allowed us to assess the performance of the fuzzy inference system and analyze the relationships between the input variables and the flood risk.

This will returned:

Flood Risk Value: 83.33333333333336
Flood Risk Category: high

And a plot below

The initial implementation of the fuzzy inference system for flood risk assessment has been completed successfully. By providing example input values for precipitation (100), soil moisture (50), land cover (25), and slope (30) in Stage 6, we have demonstrated the functionality of the fuzzy system. The system processes these inputs through the defined membership functions, rules, and defuzzification methods to produce an output flood risk value and the corresponding flood risk category.

Upon evaluating the system with the given input values, a flood risk value is generated, and the flood_risk.view(sim=flood_risk_sim) function provides a visual representation of the output. The plot displays the aggregated output membership functions and indicates the defuzzified crisp value. In this case, the plot reflects the flood risk level based on the provided inputs, and the computed flood risk category helps to understand the risk associated with the given conditions. With this initial implementation, we have set the foundation for further analyses and can adapt or extend the fuzzy system as needed to address specific flood risk assessment scenarios.

1.2 Plot the membership function of the input variables

The provided code visualizes the membership functions for each of the input variables (Precipitation Intensity, Soil Moisture, Land Cover, and Slope) and the output variable (Flood Risk Level) in the fuzzy inference system. Here’s a summary of what each part of the code does:

precipitation.view(sim=flood_risk_sim): Plots the membership functions for the Precipitation Intensity variable, displaying how the input values are categorized into low, medium, and high precipitation levels.
soil_moisture.view(sim=flood_risk_sim): Plots the membership functions for the Soil Moisture variable, showing how the input values are categorized into low, medium, and high soil moisture levels.
land_cover.view(sim=flood_risk_sim): Plots the membership functions for the Land Cover variable, illustrating how the input values are categorized into low, medium, and high land cover levels.
slope.view(sim=flood_risk_sim): Plots the membership functions for the Slope variable, demonstrating how the input values are categorized into low, medium, and high slope levels.
flood_risk.view(): Plots the membership functions for the output variable, Flood Risk Level, indicating how the output values are categorized into low, medium, and high flood risk levels.
flood_risk.view(sim=flood_risk_sim): Plots the final Flood Risk Level for given input values, illustrating how the fuzzy inference system computes the flood risk based on the input variable values and the defined fuzzy rules.

To interpret the plots, observe how each input variable is divided into categories (low, medium, high) based on the membership functions. These categories represent the degree to which an input value belongs to a particular category.

The output variable plot shows how the flood risk levels are determined based on the input variables’ membership values and the fuzzy rules defined in the system. The final plot, Flood Risk Level for Given Input Values, displays the aggregated output membership functions and the computed flood risk level as a single value.

1.3 2D Plot

The provided code generates a 2D contour plot of flood risk as a function of Precipitation Intensity and Land Cover, while fixing the values of Soil Moisture and Slope. Here’s a summary of what each part of the code does:

Create grid points for input variables: Define a range of values for each input variable (Precipitation Intensity, Soil Moisture, Land Cover, and Slope) using np.linspace().
Define compute_flood_risk() function: This function takes Precipitation Intensity (P), Soil Moisture (M), Land Cover (L), and Slope (S) as inputs and computes the flood risk using the fuzzy inference system (flood_risk_sim).
Fix Soil Moisture and Slope values: Assign fixed values to Soil Moisture (M_fixed) and Slope (S_fixed).
Create flood risk matrix: Initialize a matrix with the size of the combination of Precipitation Intensity (P_values) and Land Cover (L_values). Iterate through each combination of these values and compute the flood risk using the compute_flood_risk() function with the fixed values of Soil Moisture and Slope.
Plot the 2D contour plot: Using plt.contourf(), create a contour plot that visualizes the flood risk as a function of Precipitation Intensity and Land Cover. The color map ‘viridis’ is used to represent the flood risk levels, with 20 contour levels.
Add colorbar, labels, and title: Add a colorbar to represent the flood risk values, label the axes, and add a title that includes the fixed values of Soil Moisture and Slope.

To interpret the plot, observe how the flood risk values change as the Precipitation Intensity and Land Cover values vary. The plot shows how the flood risk is influenced by these two input variables while keeping the other two (Soil Moisture and Slope) fixed at specific values.

The contour lines in the plot represent different levels of flood risk, with the color intensity indicating the flood risk level. Darker colors represent lower flood risk, and lighter colors represent higher flood risk.

1.4 3D Plot

The provided code generates a 3D surface plot of flood risk as a function of Precipitation Intensity and Land Cover, while fixing the values of Soil Moisture and Slope. Here’s a summary of what each part of the code does:

Create a 3D plot figure: Initialize a new figure using plt.figure() and add a 3D subplot with the projection='3d' argument.
Create the 3D surface plot: Use the ax.plot_surface() function to create a 3D surface plot for Precipitation Intensity (Y-axis) vs Land Cover (X-axis), with the flood risk as the Z-axis. The color map ‘viridis’ is used to represent the flood risk levels.
Add colorbar, labels, and title: Add a colorbar to represent the flood risk values, label the axes, and add a title that includes the fixed values of Soil Moisture and Slope.

To interpret the plot, observe how the flood risk values (Z-axis) change as the Precipitation Intensity and Land Cover values (X and Y axes) vary. The plot shows how the flood risk is influenced by these two input variables while keeping the other two (Soil Moisture and Slope) fixed at specific values. The color intensity on the surface indicates the flood risk level, with darker colors representing lower flood risk and lighter colors representing higher flood risk.

The 3D surface plot provides a more detailed visualization of the relationship between flood risk, Precipitation Intensity, and Land Cover compared to the 2D contour plot. You can observe the shape of the surface to identify areas with high or low flood risk and better understand the interaction between the input variables.

2 Minimizing Flood Risks under Maximum Precipitation

This chapter emphasizes the focus on reducing flood risks under the most challenging conditions (maximum precipitation) while highlighting the three main variables (soil moisture, land cover, and slope) being examined in the analysis.

Flood risk management is a critical aspect of urban planning and environmental protection. Understanding the factors that contribute to flood risks and identifying strategies to minimize these risks is essential for creating resilient communities. In this analysis, we explore the relationships between four key variables - precipitation intensity, soil moisture, land cover, and slope - to determine their influence on flood risk. Our goal is to identify the combinations of these variables that result in low flood risks, even under conditions of maximum precipitation.

Using a fuzzy logic-based simulation model, we examine the interactions between these variables and their impact on flood risk. The model incorporates expert knowledge and rule-based systems to predict flood risk levels based on various input scenarios. By analyzing the simulation results, we aim to provide insights into the conditions that can effectively mitigate flood risks, helping policymakers and urban planners make informed decisions for better flood management strategies.

The analysis includes a scatterplot matrix visualization that highlights the relationships between soil moisture, land cover, and slope under maximum precipitation conditions. By interpreting this matrix, we can identify patterns and correlations between these variables that contribute to lower flood risks. These insights will help guide future efforts in designing urban areas and implementing flood management measures that are both effective and sustainable.

The provided code performs a sensitivity analysis to minimize flood risks under maximum precipitation conditions. It evaluates flood risk categories based on all input variables, generates a dataset of data points with different combinations of soil moisture, land cover, and slope values, and finally creates a scatterplot matrix. Here’s a summary of what each part of the code does:

Define the maximum precipitation intensity: Set the value of max_precipitation to 100, which is considered high precipitation intensity.
Define the flood risk category function: Create a function get_flood_risk_category() that takes precipitation, soil moisture, land cover, and slope as input variables, and returns the flood risk category using the previously defined categorize_flood_risk() function.
Generate data points for soil moisture, land cover, and slope: Create arrays of evenly spaced values for each of these input variables.
Iterate through all combinations of soil moisture, land cover, and slope: For each combination, use the maximum precipitation value and the get_flood_risk_category() function to obtain the flood risk category. If the category is not None, append the combination to the data_points list.
Create a DataFrame containing the data points: Convert the list of data points into a pandas DataFrame, which makes it easier to analyze and visualize the data.
Create a scatterplot matrix: Use the seaborn library’s pairplot() function to create a scatterplot matrix of the data points, with flood risk categories represented by different colors.

Here’s how to read the scatterplot matrix:

The diagonal plots (from the top-left to the bottom-right) are bar plots showing the distribution of each variable. These plots give an idea of the frequency of different values for each variable when the flood risk is low under maximum precipitation conditions.
The off-diagonal plots are scatter plots showing the relationships between pairs of variables. These plots help identify any patterns or correlations between the variables. The color of the dots indicates the flood risk category associated with each data point.
In the off-diagonal plots, if you see that dots of a specific color (in this case, low flood risk) are clustered in a particular region, it indicates that certain combinations of variables are more likely to result in low flood risk conditions.

To interpret the plots, consider the following:

In the scatterplot between soil moisture and land cover, if there is a pattern or a specific region where low flood risk dots are clustered, it would suggest that there’s a relationship between these two variables that contributes to lower flood risks under maximum precipitation conditions.
Similarly, in the scatterplot between soil moisture and slope, look for clusters or patterns of low flood risk dots to identify any relationships between these variables that contribute to lower flood risks.
Finally, in the scatterplot between land cover and slope, examine the distribution of low flood risk dots to determine if there’s a connection between these variables that leads to lower flood risks.

By analyzing these plots, we can gain insights into the relationships between soil moisture, land cover, and slope that contribute to low flood risks even under maximum precipitation conditions.

3 Sensitivity Analysis

After obtaining the flood risk assessment results from the Fuzzy Inference System (FIS), we can assess the quality of the model by comparing its predictions to observed data. To do this, we’ll need a dataset containing historical flood events along with the corresponding values of the input variables (Precipitation Intensity, Soil Moisture, Land Cover, and Slope).

What if observation data on flood events never exist?

If we don’t have any observation data to compare the FIS model results, evaluating the model’s performance becomes more challenging. However, we can still follow some steps to ensure that your FIS model is reasonable and plausible:

Expert knowledge: Consult with experts in the field of flood risk assessment to ensure that your fuzzy sets, membership functions, and fuzzy rules are realistic and based on sound principles. This can help us refine our FIS model even without actual observation data.
Sensitivity analysis: Perform a sensitivity analysis to understand how the output flood risk varies with changes in input variables. By altering the input variables within their expected range and studying the corresponding changes in flood risk, we can gain insight into the behavior of the model and identify any unrealistic responses.
Comparison with other models: If there are other flood risk assessment models available (either deterministic or statistical), compare your FIS model’s predictions with those from the other models. Although this is not a direct comparison with observed data, it can provide some indication of how your model’s performance compares to alternative approaches.
Simulation data: If we have access to hydrological or hydraulic models that can simulate flood events, we can use the simulated data as a proxy for observed data. Although this approach has its limitations, as the simulated data may not perfectly represent real-world conditions, it can still provide valuable information for evaluating your FIS model.
Temporal validation: If we have historical data for some of the input variables but not for the flood risk, we can still evaluate your FIS model by analyzing its performance over time. For instance, we can assess whether the model’s predictions of high flood risk align with periods of heavy rainfall, high soil moisture, or other conditions known to increase flood risk.

Remember that without observed data, it is more challenging to assess the performance of your FIS model accurately. However, following the steps outlined above can help us gain some confidence in your model and identify areas for potential improvement.

Let’s try Sensitivity Analysis

We’ll use the One-at-a-time (OAT) sensitivity analysis method to understand the effect of varying each input variable while keeping the others fixed. Assume that we have the FIS model already built and implemented in Python using the variables and fuzzy rules defined earlier.

This code performs a sensitivity analysis to study the relationship between input variables (Precipitation, Soil Moisture, Land Cover, and Slope) and the output variable (Flood Risk) in a FIS. It evaluates the FIS model for different values of the input variables, keeping the other input variables at their median values.

Here is a summary of the main steps in the code:

Define the range and step size for each input variable.
Calculate the flood risk for each input variable using the FIS model. The sensitivity_analysis function iterates over different values of each input variable while keeping the other input variables fixed at their median values.
Categorize the flood risk levels (low, medium, high) based on the computed flood risk values.
Plot the sensitivity analysis results, showing how flood risk varies with changes in the input variables.

The plot consists of four subplots, one for each input variable, with flood risk on the y-axis and the input variable on the x-axis. The background of each plot is filled with colors corresponding to the flood risk categories (low, medium, and high). The data points are plotted with different markers (‘o’, ‘s’, ‘x’) based on the input variable’s categories (low, medium, and high).

To interpret the plot, observe how the flood risk changes as the input variable value increases or decreases. A steep slope in the plot indicates that the flood risk is highly sensitive to changes in the input variable. If the flood risk remains relatively constant despite changes in the input variable, it suggests that the flood risk is less sensitive to that input variable.

To understand the meaning of the plot, consider that it represents how much the flood risk is affected by each input variable, given that other input variables are kept constant. By analyzing the plot, you can identify which input variables have a more significant impact on flood risk and prioritize interventions or mitigation strategies accordingly.

4 Summary

Flood risk assessment is a critical component of disaster management and urban planning. Accurate and reliable flood risk estimation helps authorities make informed decisions, prioritize resources, and implement effective mitigation strategies. With the increasing impacts of climate change and urbanization, there is a growing need for advanced techniques that can provide better insights into flood risk under varying conditions.

Fuzzy Inference Systems (FIS) offer a robust and flexible approach to model complex relationships between multiple input variables and an output variable, such as flood risk. By incorporating expert knowledge and handling uncertainties, FIS models can capture the intricacies of real-world systems, providing more accurate and reliable estimates of flood risk compared to traditional methods.

FIS models have gained popularity in the field of hydrological modeling and flood risk assessment due to their ability to handle imprecise and incomplete data, as well as their capability to incorporate human reasoning and intuition in the form of linguistic rules. This ability to integrate expert knowledge with quantitative data provides a valuable advantage, especially in situations where data availability is limited or uncertain.

The utilization of FIS in flood risk assessment typically involves defining input variables that influence flood risk, such as precipitation intensity, soil moisture, land cover, and slope. These variables are then used to estimate the flood risk level, which can be categorized into different levels, such as low, medium, or high.

To build an FIS model for flood risk assessment, the first step is to identify relevant input variables and their value domains. Next, fuzzy sets and membership functions are defined for each variable, followed by the formulation of fuzzy rules that describe the relationship between input variables and flood risk. These rules are derived from expert knowledge or empirical data and are used to determine the output flood risk level.

One of the critical aspects of FIS models is their ability to handle uncertainties and vagueness in the input data. This is particularly important in the context of flood risk assessment, where data can be scarce or subject to significant measurement errors. By using fuzzy sets and membership functions, FIS models can accommodate these uncertainties, providing more reliable and robust estimates of flood risk.

Sensitivity analysis is a valuable tool for evaluating the performance of FIS models in flood risk assessment. By varying input variables within their expected range and studying the corresponding changes in flood risk, modelers can gain insight into the behavior of the model and identify any unrealistic responses or potential areas for improvement.

FIS models can be further enhanced by incorporating optimization techniques to identify the most critical factors contributing to flood risk. This can help decision-makers focus on specific areas or interventions that have the most significant impact on reducing flood risk and improving overall resilience.

One of the challenges in applying FIS models for flood risk assessment is the lack of observed data for model validation. In such cases, the performance of the model can be evaluated using expert knowledge, sensitivity analysis, comparison with other models, or the use of simulated data from hydrological or hydraulic models.

In conclusion, FIS provides a promising approach for flood risk assessment, offering a flexible and robust framework for modeling complex relationships and handling uncertainties. By incorporating expert knowledge and quantitative data, FIS models have the potential to significantly improve our understanding of flood risk and support more effective decision-making in disaster management and urban planning.

Visualising the WRF output

Benny Istanto — Sat, 01 Apr 2023 00:00:00 GMT

1 Introduction

The Weather Research and Forecasting (WRF) model is a powerful numerical weather prediction system used to simulate atmospheric phenomena at various scales. WRF produces a significant amount of output data that can provide valuable information for meteorological and climatological research, weather forecasting, and environmental management.

WRF output data is typically stored in netCDF files, which contain multiple variables with different units and dimensions. However, the netCDF files generated by WRF are not always following the Climate and Forecast (CF) metadata convention, which can make it difficult to interpret the data.

The CF metadata convention provides a standardized way of describing the metadata and units of variables in netCDF files. Adhering to this convention makes it easier to interpret the data and compare it with other datasets. However, the WRF output files are often not fully CF-compliant.

Visualizing the output from the WRF model can help researchers and practitioners to gain insights into various meteorological and climatological phenomena, including temperature, wind, precipitation, cloud cover, and atmospheric pressure. Visualization is an important step in understanding the data and extracting meaningful insights. Visualization can help identify patterns, trends, and anomalies in the data that might not be apparent from raw numerical output.

Time series plots are a common way to visualize the temporal evolution of variables such as temperature, precipitation, and wind speed. These plots can reveal patterns and trends in the data and help to identify anomalies or outliers.
Contour plots can be used to visualize the spatial distribution of variables such as temperature, pressure, and precipitation. These plots can show the magnitude and direction of the variables and help to identify patterns and features such as fronts, ridges, and troughs.
Maps are a common way to visualize the spatial distribution of variables over a region of interest. Maps can be used to display variables such as temperature, precipitation, wind speed, and cloud cover, and can help to identify patterns and features such as mountains, coastlines, and rivers.
Animations can be created from the WRF output data to visualize the temporal and spatial evolution of variables over a specific period. Animations can be useful for identifying trends, patterns, and anomalies in the data and for communicating the results to a wider audience.
3D visualizations can be used to represent the three-dimensional structure of atmospheric phenomena such as clouds and fronts. These visualizations can provide a more detailed and realistic representation of the data and help to identify features such as updrafts, downdrafts, and vortices.

There are several tools available for visualizing WRF output, ranging from simple plotting libraries to more advanced graphical user interfaces.

One of the most popular tools for visualizing WRF output is the NCAR Command Language (NCL). NCL is a programming language designed specifically for scientific data analysis and visualization. NCL provides a powerful set of tools for working with NetCDF files, including the ability to subset and manipulate data, generate contour plots and maps, and create animations.
Python is a popular programming language for data processing, analysis, and visualization. Python libraries such as Matplotlib, Cartopy, and Basemap can be used to create a wide range of visualizations from the WRF output data.
Xarray is another popular Python library that can be used to handle and visualize multi-dimensional datasets. Xarray provides a powerful set of functions for data manipulation, analysis, and visualization and can be used to create a variety of plots and maps.
R is a statistical programming language that can also be used for data processing, analysis, and visualization. R provides a wide range of packages for creating static and interactive visualizations, including ggplot2, lattice, and Shiny.
Commercial software packages like ArcGIS and open-source Geographic Information System (GIS) software such as QGIS can be used to visualize WRF output data on maps and perform spatial analysis. QGIS provides a user-friendly interface for creating maps, visualizing data, and conducting geospatial analysis.

Regardless of the tool used, there are some key considerations to keep in mind when visualizing WRF output. These include choosing appropriate color maps, selecting appropriate contour intervals, and ensuring that the data is presented in a clear and understandable way.

It is also important to consider the spatial and temporal scales of the data when visualizing WRF output. For example, high-resolution data may require different visualization techniques than coarser resolution data, and data spanning multiple time scales may require animations or time series plots.

Visualization of WRF output is not only important for understanding the data but also for communicating results to stakeholders and decision-makers. Effective visualization can help convey complex scientific information in a way that is easily understood by non-experts.

Overall, visualization is a crucial step in the WRF modeling process, enabling scientists and researchers to extract meaningful insights from the vast amounts of data generated by the model. While there are many tools and techniques available for visualizing WRF output, it is important to choose the most appropriate tool for the specific task at hand and to consider best practices for data visualization to ensure clear and accurate representation of the data.

2 New CF based NetCDF files from native WRFOUT NetCDF files

The WRFOUT files created by the WRF model are not the most straightforward to interpret, and pose several challenges. These files include a series of three-dimensional fields that cover a specific region over a specific period of time, which can be used to analyze different meteorological phenomena, such as temperature, pressure, wind speed and direction, clouds, and precipitation.

However, WRF output data does not always follow the CF convention, especially when dealing with large datasets. This makes it difficult to interpret and analyze the data in its raw format, without the ability to visualize it. Additionally, the files can be very large and contain many variables related to the WRF simulation that may not be necessary for a research project.

Modifying the WRF registry can help to address some of these issues by allowing the user to change the variables included in the WRFOUT files. This can be a tedious and potentially messy process though, as some variables may be included in WRFOUT for reasons that the user may never know. As such, it is important to consider the potential unintended consequences of removing certain variables. Furthermore, the variables will still be on the staggered grid and on the model vertical levels, so visualization can still be challenging.

To address this issue, users can use the NCL to translate the netCDF output into a format that follows the CF convention. This can help to standardize the metadata and units of the variables, making it easier to interpret and compare the data.

wrfout_to_cf is an NCL based script designed to create CF compliant NetCDF files with user selectable variables, time reference, vertical levels, and spatial and temporal subsetting. This script is designed to be a simple, user-flexible post-processing utility and is particularly useful for research projects because it produces output files that are more convenient to work with.

However, wrfout_to_cf can be a relatively inefficient post-processing utility and may not be suitable for all applications. It is important to consider the pros and cons of each post-processing utility to determine which one is the best fit for any given research project

After running the simulation, WRF will produce three output files, you can check it by typing command in your simulation folder:

It will return

Download wrfout_to_cf.ncl from https://sundowner.colorado.edu/wrfout_to_cf/wrfout_to_cf.ncl and put it in the same folder with WRFOUT file. If you check one of file output using below:

You will get responses and see the netCDF structure, but it is not easy to understand.

Start to convert the native WRFOUT netCDF file to new CF based netCDF file using below command:

You will find a new file wrfpost.nc in the folder. Let check it using ncdump command below:

After you get the wrfpost.nc in place, you are ready to visualize it using various tools.

3 Visualizing the Wind

With 3 days of hourly information, there are several ways to visualize wind speed and direction to best illustrate the information.

Let’s start writing the code using Python.

After importing the library and defining the data, we can start visualizing the wind data.

Here are some options:

3.1 Time series plot

We can create a time series plot showing the variation of wind speed and direction over the three-day period. This type of plot is useful for identifying patterns or trends in the data over time. We can use Python libraries such as Matplotlib or Seaborn to create time series plots.

Above code will produce a map below.

As an alternative, we can have other visualizations for each grid as a line over time.

Above code will produce a chart below.

3.2 Wind rose plot

A wind rose plot can be used to show the distribution of wind direction and speed over the three-day period. This type of plot is useful for identifying the prevailing wind direction and speed. You can use Python libraries such as Windrose or Matplotlib to create wind rose plots.

Above code will produce a chart below.

3.3 Contour plot

A contour plot can be used to show the spatial distribution of wind speed and direction at a particular time during the three-day period. This type of plot is useful for identifying areas with high or low wind speed and direction. You can use Python libraries such as Matplotlib, Cartopy, or Basemap to create contour plots.

Above code will produce a map below.

3.4 Streamline plot

A streamline plot can be used to show the flow of wind direction and speed at a particular time during the three-day period. This type of plot is useful for identifying the flow of wind over a geographic area. You can use Python libraries such as Matplotlib, Cartopy, or Basemap to create streamline plots.

Above code will produce a map below.

Overall, the choice of visualization method will depend on the specific research question or application, and the intended audience. It may be useful to try different visualization methods and compare the results to determine which method is most effective for the task at hand.

4 Notebook

The compilation for all the above code available as a notebook and hosted here: https://gist.github.com/bennyistanto/3f7877b44eebaf0db5e37fa8e7b8603a