Experimental climatological rainfall zone

Remote Sensing

Research

Climate

Climatological rainfall zones represent distinct areas with different rainfall patterns

Author

Benny Istanto

Published

March 10, 2023

Climatological rainfall zones represent distinct areas with different rainfall patterns. They are characterized by specific precipitation behaviors, including factors such as the amount, frequency, and timing of rainfall. Understanding these zones is crucial for a variety of reasons:

Agriculture: Agriculture heavily relies on rainfall patterns. Knowing the climatological rainfall zones can help farmers and agricultural planners understand when to plant crops and which types of crops would be most suitable for a given area.
Water Resource Management: Rainfall contributes significantly to freshwater resources. Understanding the rainfall patterns can aid in planning and managing these resources effectively.
Climate Change Studies: Changes in rainfall patterns can be an indicator of larger climatic changes. Studying these zones over time can provide insights into climate change.
Disaster Planning: Areas prone to heavy rainfall could be at a higher risk of flooding. Knowledge of these zones can inform disaster planning and mitigation strategies.

The Python code provided below aids in the process of identifying climatological rainfall zones. The code does this through the following steps:

Preprocessing: The code first preprocesses the rainfall data to make it suitable for clustering. This includes standardizing the data and reducing its dimensionality using Principal Component Analysis (PCA).
Clustering: The code then applies a clustering algorithm (KMeans or Agglomerative Clustering) to the preprocessed data to identify distinct rainfall patterns, representing different climate zones. The optimal number of clusters is determined using the Calinski-Harabasz and Silhouette methods.
Assignment: Each location is then assigned to the climate zone of the nearest cluster centroid.
Visualization: Finally, the code visualizes the identified climate zones on a map, allowing for easy interpretation and application of the results.

In essence, the code facilitates the data-driven identification and visualization of climatological rainfall zones, providing valuable insights for various applications.

——————————————————————-

This Python code performs an analysis of climatological rainfall zones, with study case for Uganda, applying KMeans or Agglomerative Clustering to precipitation data sourced from CHIRPS (Climate Hazards Group InfraRed Precipitation with Station) for two time periods: 1981-2010 and 1991-2020. This data is contained in a CSV file that includes unique identifiers, longitude, latitude, and dates.

Here’s a summary of what the code does:

Import necessary libraries: The code begins by importing necessary libraries for data manipulation, clustering, standardization, calculating metrics, and visualization.
Choose the clustering method: The variable cluster_method is set to ‘KMeans’ by default, but can be changed to ‘AgglomerativeClustering’.
Load the precipitation data: The CSV data file is loaded into a pandas dataframe. By default, the code reads data for the period 1991-2020, but it can be switched to load the 1981-2010 data.
Data Transformation: The date columns in the dataframe are renamed and reformatted to a datetime object. The dataframe is then ‘melted’ to convert it into long format, which makes it easier to manage and analyze. The ‘date’ column is again converted into a datetime object.
Calculate monthly mean precipitation: The code then calculates the monthly mean precipitation for each location (defined by unique id, longitude, and latitude) by extracting the month from the ‘date’ column and using it to group the data. This monthly mean precipitation data is then rearranged into a pivot table format for further processing.
Data Visualization: Finally, the transformed dataframe is displayed for visual inspection.

This initial portion of the code is focused on loading and preparing the data for clustering analysis, which is performed in subsequent steps (not shown in the provided code). Depending on the chosen method, KMeans or Agglomerative Clustering is applied to this monthly mean precipitation data to classify the different climatological rainfall zones. The number of clusters can be a specific integer or determined using an optimal result generated by the Calinski-Harabasz or Silhouette method.

This section of the code is about data standardization and dimensionality reduction using Principal Component Analysis (PCA).

Here’s the step-by-step explanation:

Remove unnecessary columns: The code first creates a new dataframe X that drops the “id”, “lon”, and “lat” columns from the monthly_precip_df dataframe. This is done because clustering should be based on the rainfall data, not identifiers or coordinates.
Standardize the data: The data is then standardized using StandardScaler(), which scales the data to have a mean of 0 and a standard deviation of 1. This is a common requirement for many machine learning estimators, as they might behave badly if the individual features do not more or less look like standard normally distributed data.
Apply PCA: Next, PCA is applied to the scaled data to reduce its dimensionality. PCA(n_components=0.90) means that PCA will keep enough components to explain 90% of the variance in the data. This is a way to reduce the complexity of the model and avoid overfitting.
Fit and transform the data: The fit_transform() function fits the PCA model with the scaled data X_scaled and applies the dimensionality reduction on X_scaled.

The print statement ‘Done!’ indicates the successful completion of these steps. Now, the data is ready for the clustering step. The transformed data, X_pca, can be used as the input for the clustering algorithms. The PCA transformation is beneficial especially for visualization purposes, as it allows us to plot high-dimensional data in 2D or 3D space, and it can also improve the computational efficiency and performance of the clustering algorithm.

This function get_optimal_plot_calinski() calculates the optimal number of clusters for a given dataset X and a specified clustering method (KMeans or AgglomerativeClustering), and then visualizes the results. It does this using the Calinski-Harabasz criterion, which is a method for determining the optimal number of clusters. It operates on the principle that clusters should be compact and well separated.

Here’s what the code does, step-by-step:

Define the model: Depending on the cluster_method parameter, it sets the model to either KMeans or AgglomerativeClustering. If another value is passed, it raises a ValueError.
Define the compute_score function: This inner function creates a model with k clusters, fits the model to the data X, and returns the Calinski-Harabasz score. The Calinski-Harabasz score is a measure of cluster validity; higher scores indicate better clustering configurations.
Calculate scores for range of clusters: It then calculates the Calinski-Harabasz score for each number of clusters in the range from 2 to 20. These computations are performed in parallel to speed up the process, especially beneficial for large datasets.
Find the optimal number of clusters: The number of clusters (k) that yields the maximum Calinski-Harabasz score is identified as the optimal number of clusters.
Plot the scores: Finally, it visualizes these scores in a plot, where the x-axis represents the number of clusters and the y-axis represents the corresponding Calinski-Harabasz scores. The optimal number of clusters is marked with a vertical red dashed line, and its value is also displayed on the plot.
Return the optimal number of clusters: The function returns the optimal number of clusters as determined by the Calinski-Harabasz criterion.

This function is used to explore and determine the optimal number of clusters for the dataset, which can then be used in the actual clustering process. It’s an essential step in unsupervised machine learning tasks like clustering, as deciding on the number of clusters can often be non-trivial.

This function get_optimal_plot_silhouette() is designed to compute the optimal number of clusters for a given dataset X using either the KMeans or Agglomerative Clustering method, using the silhouette score as a measure of cluster quality. It also produces a plot of the silhouette scores as a function of the number of clusters.

Here’s a step-by-step breakdown of what the code does:

Define the compute_score function: The function compute_score is defined to calculate the silhouette score for a given number of clusters. The silhouette score measures the quality of a clustering. A higher silhouette score indicates that the instances in the same cluster are similar to each other and different from the instances in other clusters.
Calculate scores for range of clusters: It then calculates the silhouette score for each number of clusters in the range from 2 to 20. These computations are performed in parallel to speed up the process, which can be especially beneficial for large datasets.
Find the optimal number of clusters: The number of clusters (k) that yields the maximum silhouette score is identified as the optimal number of clusters.
Plot the scores: The silhouette scores are then plotted against the number of clusters. The optimal number of clusters is marked with a vertical red dashed line, and its value is also displayed on the plot.
Return the optimal number of clusters: Finally, the function returns the optimal number of clusters as determined by the silhouette score.

The silhouette score is an alternative to the Calinski-Harabasz score for finding the optimal number of clusters in a dataset. It considers both the compactness of the clusters (how close the instances in the same cluster are) and the separation between the clusters (how far apart the clusters are). The optimal number of clusters is the one that maximizes the average silhouette score over all instances.

This part of the code calculates the optimal number of clusters for the PCA-transformed data X_pca using two different methods: the Calinski-Harabasz method and the Silhouette method. The clustering method is defined by the variable cluster_method.

Calinski-Harabasz: The function get_optimal_plot_calinski(X_pca, cluster_method) is called to calculate the optimal number of clusters using the Calinski-Harabasz method. This function computes the Calinski-Harabasz scores for different numbers of clusters, plots the scores as a function of the number of clusters, and returns the optimal number of clusters that yields the highest Calinski-Harabasz score. The optimal number of clusters is stored in the variable optimal_c.
Silhouette: Similarly, the function get_optimal_plot_silhouette(X_pca, cluster_method) is called to calculate the optimal number of clusters using the Silhouette method. This function computes the Silhouette scores for different numbers of clusters, plots the scores as a function of the number of clusters, and returns the optimal number of clusters that yields the highest Silhouette score. The optimal number of clusters is stored in the variable optimal_s.

Then, it prints out the optimal number of clusters as determined by both the Calinski-Harabasz and Silhouette methods. The ‘Done!’ print statement indicates the successful completion of these steps.

This section of the code is crucial as it determines the most suitable number of clusters for the data, which is a key parameter for clustering algorithms. The Calinski-Harabasz and Silhouette methods are two popular methods for determining this optimal number, and comparing their results can provide additional validation for the chosen number of clusters.

The cluster_data() function takes in the input data array X_pca, the chosen clustering method cluster_method, and an optional number of clusters n_clusters. It performs clustering on the data and returns the cluster labels.

Here’s a step-by-step breakdown of what the code does:

Check for optimal cluster number: If n_clusters is set to 'optimal_c' or 'optimal_s', the function calls the previously defined functions get_optimal_plot_calinski() or get_optimal_plot_silhouette(), respectively, to compute the optimal number of clusters. It does this for the specified clustering method, either 'KMeans' or 'AgglomerativeClustering'.
Raise an error for invalid input: If n_clusters is not one of the allowed strings and it is not an integer, the function raises a ValueError.
Create the cluster model: Depending on the value of cluster_method, the function creates a KMeans or AgglomerativeClustering model with the specified number of clusters.
Fit the model to the data: The function then fits the clustering model to the input data. It also provides a progress bar to track the process.
Retrieve the cluster labels: After the model has been fitted, the function retrieves the cluster labels, which indicate the cluster to which each data point has been assigned.
Return the labels and number of clusters: Finally, the function returns the cluster labels and the number of clusters used in the clustering model.

By encapsulating the clustering process into a function, the code allows for easy and repeatable clustering of the data using different methods and numbers of clusters. The function also handles the computation of the optimal number of clusters, making it easy to compare the results of different clustering approaches.

This section of the code conducts clustering of the data with the specified clustering method and the defined number of clusters (14 in this case). Once the clusters are determined, it assigns a climatic zone to each row based on the nearest centroid. Here’s a step-by-step breakdown:

Perform Clustering: The function cluster_data(X_pca, cluster_method, n_clusters=14) is invoked to perform clustering on the data. The function returns the labels of the clusters and the number of clusters used, which are stored in labels and n_clusters respectively.
Calculate Centroids: A centroid is a point at the center of each cluster. It’s the mean position of all the points in a cluster. The code calculates these centroids for each cluster and stores them in the centroids array.
Calculate Euclidean Distances: The code then calculates the Euclidean distance between each data point (each row) and each of the cluster centroids. The Euclidean distance is a measure of the straight line distance between two points in a space. These distances are stored in the distances array.
Assign Closest Cluster: For each row in the dataset, the code assigns the cluster that is closest (has the smallest Euclidean distance) to it. This is achieved using the np.argmin() function, which returns the index of the smallest value along an axis. The assigned clusters are added as a new column climate_zone in the monthly_precip_df dataframe.
Grouping Data: The dataset is then grouped by id, lon (longitude), and lat (latitude), and for each group, the mode (the most frequently appearing value) of the climate_zone is calculated. This essentially assigns a single climate zone to each unique location based on the most frequent climate zone assigned to it over the timeseries. This grouped data is stored in monthly_precip_df_centroid.
Completion Message: Upon successful completion of these steps, a ‘Done!’ message is printed.

This section of the code allows each location (represented by a unique combination of id, lon, and lat) in the dataset to be assigned to a specific climatological rainfall zone based on the monthly timeseries precipitation data. This information can be useful for various climatological and environmental studies.

The function plot_climate_zone_map(climate_zone_csv, shapefile_path) is designed to generate a scatter plot of climate zones over a given geographical region, which could be a country or a continent, for instance. The plot utilizes longitude and latitude coordinates from a CSV file and a polygon shapefile to depict the geographical boundaries of the region of interest. Here’s a breakdown of the steps:

Load the Data: The function starts by loading a CSV file containing longitude (lon), latitude (lat), and climate zone (climate_zone) data.
Create a Scatter Plot: A scatter plot is generated using the longitude and latitude values as x and y coordinates, respectively. The climate zone data is used to color-code the points on the scatter plot.
Load the Polygon Shapefile: A polygon shapefile, which represents the geographical boundaries of the region of interest, is loaded using the geopandas library.
Plot the Polygon Shapefile: The loaded polygon shapefile is overlaid on the scatter plot to provide geographical context. The boundaries are shown as black lines.
Add a Colorbar: A colorbar is added to the plot to provide a reference for the color-coding of the climate zones.
Set the Title and Axis Labels: The plot is given a title, and the x and y axes are labeled as longitude and latitude, respectively. A footnote reference to the study providing the climatological rainfall zone is also included.
Display the Plot: Finally, the plot is displayed using plt.show().

This function provides a visual representation of the climate zones within a specific geographical region, which can help researchers and policymakers understand the spatial distribution of different climate characteristics based on rainfall patterns.

The code snippet provided is performing the following steps:

Save DataFrame to CSV: It saves the monthly_precip_df_centroid DataFrame, which contains the ‘id’, ‘lon’, ‘lat’, and ‘climate_zone’ columns, into a CSV file. This file will be saved in the location specified by the path string. The filename is constructed using the method of clustering and the number of clusters.
Print Confirmation Message: After saving the file, a confirmation message is printed stating “Save the output to csv completed”. A separator line is then printed for clarity.
Plot Climate Zone Map: The plot_climate_zone_map function is then called, which generates a scatter plot of climate zones over a given geographical region based on the CSV file saved in the previous step and a shapefile which represents the geographical boundaries.
Print Separator Line: Another separator line is printed for clarity.
Print Completion Message: Finally, a message is printed stating “Done!” to signify the end of the code execution.

Remember to ensure that the directory paths used in the code (“../csv/” and “../shapefiles/”) exist in your current working directory and contain the necessary files. If not, you will need to change these paths to the appropriate ones that are relevant to your working environment.