Bornean orangutan nest identification using computer vision and deep learning models to improve conservation strategies

Donna Simon; Keeyen Pang; Rayner Bili; Song-Quan Ong; Henry Bernard

doi:10.7717/peerj.20333

Bornean orangutan nest identification using computer vision and deep learning models to improve conservation strategies

Donna Simon^1,2, Keeyen Pang³, Rayner Bili⁴, Song-Quan Ong ², Henry Bernard ²

1WWF-Malaysia, Kota Kinabalu, Sabah, Malaysia

2Institute for Tropical Biology and Conservation, Universiti Malaysia Sabah, Kota Kinabalu, Sabah, Malaysia

3Intrajasa Sdn.Bhd, Sandakan, Sabah, Malaysia

4Sabah Forestry Department, Sandakan, Sabah, Malaysia

DOI: 10.7717/peerj.20333

Published: 2025-12-03
Accepted: 2025-10-13
Received: 2025-04-23

Academic Editor: Anthony Caravaggi

Subject Areas: Biodiversity, Computational Biology, Conservation Biology, Zoology, Data Mining and Machine Learning
Keywords: Pongo pygmaeus, Artificial intelligence, Population survey, Drone, Ecology

Copyright: © 2025 Simon et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Simon D, Pang K, Bili R, Ong S, Bernard H. 2025. Bornean orangutan nest identification using computer vision and deep learning models to improve conservation strategies. PeerJ 13:e20333 https://doi.org/10.7717/peerj.20333

The authors have chosen to make the review history of this article public.

Abstract

Background

Regular population surveys are crucial for the evaluation of conservation measures and the management of critically endangered species such as the Bornean orangutans. Uncrewed aerial vehicles (UAV) are useful for monitoring orangutans by capturing images of the canopy, including nests, to monitor their population. However, manually detecting and counting nests from UAV imagery is time-consuming and requires trained experts. Computer vision and deep learning (DL) models for image classification offer an excellent alternative for orangutan nest identification.

Methods

This study investigated DL for nest recognition from UAV imagery. A binary dataset (“with nest” and “without nest”) was created from UAV imagery from Sabah, Malaysian Borneo. The images were captured using a fixed-wing UAV with a complementary metal-oxide semiconductor camera. After image augmentation, 1,624 images were used for the dataset and further split into 70% training, 15% testing and 15% validation for model performance evaluation, i.e., accuracy, precision, recall and F1-score. Four DL models (InceptionV3, MobileNetV2, VGG19 and Xception) were trained to learn from the labeled dataset and predict the presence of nests in new images.

Results

The results show that out of four DL models, Inception V3 has the best model performance with more than 99% accuracy and precision, while VGG19 has the lowest performance. In addition, gradient-weighted class activation maps were used to interpret the results, allowing visualization of the regions used by InceptionV3 and VGG19 for classification. This study shows the potential of integrating DL into orangutan conservation, particularly in monitoring the orangutan population in the protected environment. Future research should focus on the automatic detection of nests to improve UAV-based monitoring of orangutans.

Introduction

All three orangutan species (Sumatran orangutan - Pongo abelii, Bornean orangutan - Pongo pygmaeus and Tapanuli orangutan - Pongo Tapanuliensis), which occur on Borneo and Sumatra, have been listed as “Critically Endangered” on the International Union for Conservation of Nature (IUCN) Red List since 2016, as their populations have declined sharply (Ancrenaz et al., 2023). These population declines are primarily due to habitat loss, degradation and fragmentation, as well as retaliatory killings due to conflict with humans (Ancrenaz et al., 2023). In Sabah, in Malaysian Borneo, several measures have been taken to protect orangutans, including the restoration of forests in degraded areas such as the Bukit Piton Forest Reserve (Mansourian, Vallauri & France, 2020), the expansion of fully protected areas to 30% and a commitment to sustainable timber production (Simon, Davies & Ancrenaz, 2019). In addition, the 10-Year Action Plan for Sabah’s Orangutans (2020–2029) has been developed to ensure the long-term survival of the species in the region (Sabah Wildlife Department, 2020). Continuous monitoring is crucial to evaluate population trends and assess the effectiveness of these conservation measures (Piel et al., 2022).

Orangutans are primarily found in lowland tropical rainforests (<1,000 m altitude), where they spend most of their time in the forest canopy (Manduell, Harrison & Thorpe, 2012). They construct new nests each day, with juveniles relying on their mothers to build them (Permana et al., 2024). These nests are used for both night-time sleeping and daytime resting (Casteren et al., 2012). Since observing orangutans directly is difficult due to the dense canopy and their elusive nature, researchers often monitor populations by counting nests, which serve as reliable indicators of their presence (Kühl et al., 2008; Santika et al., 2019). Population estimates are derived from nest densities (nests per km²), which are converted into orangutan numbers using established statistical methods (Ancrenaz et al., 2005; Kühl et al., 2008; Pandong et al., 2018).

Orangutan nests are distinct from those of other animals. Orangutans typically build their nests in the upper canopy, around 11–20 m above the ground (Casteren et al., 2012), and the nests are about 100 cm wide to accommodate their large body size (Kamaruszaman et al., 2018). The nest’s base is made from thick branches, with thinner branches twisted and bent but not fully broken. This partial break, known as a “greenstick fracture,” is unique to orangutan nests (Casteren et al., 2012). Leaves are added to form a flat sleeping platform. Orangutan nests are usually oval and asymmetrical, with the long axis oriented towards the tree trunk (Biddle, Deeming & Goodman, 2014). While most nests are built in the upper canopy, they can also be found at branch ends or close to the main tree stem (Rayadin & Saitoh, 2009).

Orangutan population and density is usually estimated using two indirect nest counting methods: Standing Crop Nest Count (SCNC) and Marked Nest Count (MNC) (Spehar et al., 2010). SCNC involves conducting a single survey along transects to count visible nests and estimate orangutan density using the formula D = d/(p × r × t), where D is the orangutan density, d is the nest density, p is the proportion of nest builders, r is the nest production rate and t is the nest decay rate (Ancrenaz et al., 2005). While SCNC is efficient and integrates data over time, its accuracy is highly dependent on the estimation of t, which can vary greatly and lead to unreliable results if not measured correctly. In contrast, MNC avoids the need for a decay rate by tracking newly built nests between repeated surveys, making it better suited to detecting population trends over time. However, MNC is more labor intensive, especially in terms of resources for counting orangutan nests.

Various methods are used to count orangutan nests, including ground-based nest surveys (Pandong et al., 2018; Santika et al., 2019), helicopter surveys (Ancrenaz et al., 2005; Payne, 1988; Simon, Davies & Ancrenaz, 2019), and the latest technology involving uncrewed aerial vehicles (UAVs) or drones (Hanggito, 2020; Milne et al., 2021; Wich et al., 2015). Among these methods, drones are becoming increasingly important as they are relatively inexpensive compared to helicopters and can capture images or time-lapse video from the forest canopy, allowing many hard-to-access areas to be studied (Wich & Koh, 2018). In contrast to ground and helicopter surveys, where nests are detected through direct field observations, drone imagery requires careful examination of each image on a computer to identify nests. As nests decay, the fresh green foliage withers and turns brown, making them stand out more clearly against the surrounding green canopy in the images (Fig. 1). During manual nest identification, each nest is marked or labelled and then counted across all images. This allows researchers to calculate nest density, which can be used to estimate the orangutan population size.

Figure 1: Example images for a drone image with orangutan nests circled in red.

Download full-size image

DOI: 10.7717/peerj.20333/fig-1

To classify the images, it is important to consider the canopy classification perspective. Although nests made of branches and leaves can be distinguished from healthy trees as they decay over time (Casteren et al., 2012), a key challenge in using drone imagery to explore orangutan nests is that labeling nests from large volumes of image data still relies heavily on human experts, making the process tedious and time-consuming (Milne et al., 2021; Wich et al., 2015). Therefore, there is a need for an alternative method to identify nests from drone imagery that is as effective as, if not more effective than, human expertise for nest detection.

The integration of artificial intelligence (AI), in particular machine learning (ML), offers an alternative approach to improve the efficiency of orangutan nest detection in drone imagery. This study focuses on supervised learning, where algorithms are trained on labeled/annotated datasets to identify nests based on features such as color, texture, and shape in aerial imagery. Labeled data is essential for developing accurate models. Although ecological studies and ML models use similar evaluation measures such as accuracy and precision, their definitions differ: Ecological accuracy refers to the estimation of population values, while ML accuracy and precision are derived from confusion matrices reflecting true and false classifications (Lebovitz, Levina & Lifshitz-Assaf, 2021). Deep learning (DL), a subset of ML, has been demonstrated as a model for image classification (Pearse et al., 2021). Applications in ecology include identifying tree species or understanding orangutan nesting behavior by combining remote sensing data such as Light Detection and Ranging (LiDAR) with behavioral patterns (Davies et al., 2019).

DL models are well suited for image classification as the architecture uses multiple layers of neural networks consisting of perceptions to model complex data (e.g., images with different colour channels) by learning features from images and making predictions (Smith et al., 2018). Further details on how the DL model works can be found in Chatzilygeroudis, Hatzilygeroudis & Perikos (2021), Purwono et al. (2022), the protocol article by Isawasan et al. (2023) and Madhavan & Jones (2024). In image processing, DL is widely used for image classification and object detection in ecological studies, such as species identification, animal behavior classification and species diversity estimation from camera traps, video and audio recordings (Christin, Hervet & Lecomte, 2019). For orangutan studies, Guo et al. (2020) developed Tri-AI, an automatic recognition system that identifies 41 primates and four carnivores with 94% accuracy. In addition, Desai et al. (2023) developed an annotated database of apes in different poses which enables object recognition for behavioral studies of apes in zoos.

Studies on orangutan recognition through computational methods to detect and count orangutan nests remain limited. Nest building, a unique daily behavior of orangutans for sleeping, offers valuable data for ecological monitoring, and by integrating DL techniques, it could enhance population monitoring efforts. Amran et al. (2023) initiated the study on the use of ML—support vector machine (SVM)—in classifying the objects on the aerial images into branches, buildings and orangutan nests; Teguh et al. (2024) provided the most recent study (at the time of writing this manuscript) on orangutan detection using DL model, the You Only Look Once (YOLO) version 5 with 414 labelled orangutan nests and achieved a precision of 0.973 and a recall of 0.949. However, Teguh et al. (2024) applied an object detection algorithm and demonstrated the effectiveness of a DL model, but this raises additional questions. For instance, YOLO typically identifies and classifies objects in a single step, but alternative classification algorithms may offer improved performance. As biologists and ecologists, it is crucial not to treat these tools as black boxes. This study focuses on interpreting the outputs to gain insight into how DL models ‘visualize’ image patterns and identify the features utilized by neural network layers to classify tree canopy patterns as ‘with nest’ or ‘without nest.’ Understanding this process is essential for accurate ecological interpretation.

Therefore, this study aims to evaluate the effectiveness of different DL models in detecting orangutan nests from aerial images captured at two orangutan sites (Sepilok Virgin Jungle Reserve and Bukit Piton Forest Reserve) in Sabah, Malaysia. More importantly, this study visualizes the model layers to understand how the features and characteristics of orangutan nests are ‘learned’ by the models. Specifically, the aim of this study is to create a labelled dataset of drone images containing both the presence and absence of orangutan nests, and subsequently to develop and compare four DL models for detecting and predicting nest presence from drone images. Additionally, gradient-weighted class activation maps (Grad-CAM) are presented to visualize the activation region used by the models to distinguish orangutan nests from the tree canopy.

Materials & Methods

Study site

Drone surveys were conducted in Sepilok Virgin Jungle Reserve (VJR) (5.865092291902751, 117.94834681807893) and Bukit Piton Forest Reserve (FR) (5.110623189107781, 118.00774396297084) in Sabah, Malaysia (Fig. 2). Both reserves are under the management of the Sabah Forestry Department and are known habitats for orangutans. It is estimated that there are about 200 (Range: 100–300) orangutans in Sepilok (Ancrenaz et al., 2005) and 176 (Range: 119–261) orangutans in Bukit Piton (Simon, Davies & Ancrenaz, 2019). The Sepilok VJR covers an area of approximately 40 km² and is characterized by lowland dipterocarp and heath forests (Ball et al., 2023). The reserve has been designated as a protected area where logging is strictly prohibited to keep the forest canopy intact. In contrast, Bukit Piton FR, which consists mainly of dipterocarp lowland rainforest and is about 120 km² in size, is severely degraded due to heavy logging and forest fires in the past. In 2008, a large-scale project was initiated to restore the forest for orangutans, and the area was declared a protected forest in 2012. Since then, the forest has slowly regenerated, with fast-growing tree species being used by the orangutans for nesting just three years after planting (Mansourian, Vallauri & France, 2020).

Figure 2: Location of Sepilok Virgin Jungle Reserve and Bukit Piton Forest Reserve in the Malaysian state of Sabah, Northern Borneo.

Download full-size image

DOI: 10.7717/peerj.20333/fig-2

Study duration

The Sepilok VJR survey was conducted in July 2015 and covered an area of approximately 0.5 km². A total of three flight missions were conducted to complete the survey yielding 1,720 images. The Bukit Piton FR survey was conducted in January 2016 and covered an area of approximately 0.5 km² resulting in 1,911 images. A total of four missions were flown to survey the area in Bukit Piton FR. All surveys were conducted in the morning on a sunny day (temperature 26−33 °C, relative humidity 73–80%).

Equipment

This study uses UAV imagery captured by a fixed-wing drone and compiled by ConservationDrones.org (https://conservationdrones.org/), which focuses on the use of low-cost drones for conservation applications by conservationists and researchers worldwide. The fixed-wing drone was equipped with an FX-71 body housing a Canon PowerShot S100 digital camera with RBG CMOS sensor (12.1 megapixels, 1/1.7- inch CMOS sensor, a 24–120 mm f/2.0–5.9 lens and the DIGIC 5 image processor, Canon Malaysia). The drone was flown with a flight task that was at least 100 m away from the highest point, which was determined using the Digital Elevation Model (DEM). The waypoint map was created using WaypointMap software (https://www.waypointmap.com/), with 80% of the flight path overlapping with the vertical and horizontal image and time-lapse images captured at 3-second intervals. The aim of this study is to compare DL classifier models used to solve a classification problem where the whole image serves as the target object (nest or non-nest). This is in contrast to another type of DL model, object detection models, where the algorithms usually have to localize and classify an image (two tasks and therefore more computationally complex). Therefore, the dataset was created by combining images from both sites and annotated by six human experts from WWF, who have at least two years of experience in studying nests, and categorized into two binary classes, i.e., images with nests and images without nests. This binary classification is needed to train the model and determine whether an image contains an orangutan nest or not. The field study and the use of the drone for aerial images were conducted in 2014 with the permission of the Sabah Forestry Department under reference number (JPHTN/PP 100 − 22/4/K LT.11(44)).

Pre-processing of the data and categorization

The images of Sepilok VJR (1,720) and Bukit Piton FR (1,911) were mostly repeated and overlaid to create a combined map for the geospatial study. For this study, only one image was selected from the repeated images of the time-lapse photography. In addition, the entire images were classified as either “with nest” or “without nest” using the image classification task. For images with multiple nests, the images were therefore pre-processed by cropping out the nest and labeling it as “with nest”. The total number of drone images from Sepilok and Bukit Piton amounted to 406 images, which were divided into two classes, i.e., with nest (162 images) and without nest (244 images) (Table 1). The original image resolution was 4,000 × 3,000 px. at 180 dpi and was normalized to 300 × 300 px for the development of the deep learning model.

Table 1:

Number of aerial images used in the dataset according to classes and after augmentation.

Classes	Number of images		Train (70%)	Test (15%)	Predict (15%)
	Original	Augmented	Original (Augmented)
With nest	162	648	113 (452)	25 (100)	24 (96)
Without nest	244	976	171 (684)	37 (148)	36 (144)
Total	406	1,624	284 (1,136)	62 (248)	60 (240)

DOI: 10.7717/peerj.20333/table-1

Nests from drone images have been identified by six orangutan field specialists, with more than two years of field experience in conducting ground and helicopter nest surveys. The identification of the orangutan nest at the same sites where drone images were captured is also validated with the ground survey data which confirmed the presence of nests through direct observations. Then, the total number of images in each class was divided into three parts, also known as data splitting, with 70% of the total images used for training, 15% for validation and 15% for testing or a 70:15:15 ratio (Fig. 3). The ratio of data splitting is based on the amount of data used for training and evaluation, and reducing the size of the training dataset tends to result in a poorly performing model. Therefore, an international standard of computer vision and DL competition (Fei-Fei, Deng & Li, 2009) was referenced, along with insights from previous studies (Khan & Ullah, 2022; Ong & Hamid, 2022). Data splitting enables the machine to use the training set to obtain the weights and biases for classification. The validation set helped to better generalize the models to new, unseen data and prevent over-fitting while the testing set is to assess the model’s performance. As the number of images was relatively small, each image was subjected to a rotation expansion of 0°, 90°, 180° and 270° and finally the number of images was increased by a factor of four (Ong et al., 2022; Chen et al., 2021), totaling to 1,624 images used for the model development.

Figure 3: Workflow of image processing, data splitting until performance measure.

Download full-size image

DOI: 10.7717/peerj.20333/fig-3

Models development

Model build-up

To develop the DL models, the convolutional blocks of the pre-trained convolutional neural networks (CNNs) were unfrozen for retraining purposes (a process in which the weights and biases that the model learns from the ImageNet are unlocked for a customized task, i.e., orangutan nest classification). This was done for four DL architectures—InceptionV3, MobileNetV2, VGG19 and Xception—to optimize them for the specific task of identifying nests from aerial images, as described in Ong et al. (2022). The Keras DL framework was executed on the Google Cloud Platform (https://cloud.google.com/) with an NVIDIA Tesla A100 Graphics Processing Unit (GPU), which was used to train and evaluate the models. The models were trained with the Adaptive Moment Estimation (ADAM) optimizer, which improves the stability and efficiency of the training process and enables efficient learning (Okewu, Misra & Lius, 2020). Three learning rates (0.01, 0.001 and 0.0001) with 32 batches were analyzed. The training process was set to 50 epochs, meaning that the model performed 50 complete iterations through the training dataset (Chatzilygeroudis, Hatzilygeroudis & Perikos (2021)). Increasing the number of epochs allows the model to refine its parameters and could improve its performance. After developing the models, the performance of these models was evaluated using the four metrics of accuracy, precision, recall and F1-score (Table 2) (Kumar et al., 2020). In addition, the mean accuracy (number of correct predictions/total number of images) was compared between the models to test the significance of the four DL models. The code used for the model development was publicly available at GitHub with the link https://github.com/songguan26/Bornean-Orangutan-Nest-.

Table 2:

Calculation of evaluation matrices (Li et al., 2019).

Evaluation matrix	Evaluation focus	Formula
Accuracy	Accuracy is the proportion of all classifications that were correct, whether positive or negative	$\frac{TP + TN}{TP + TN + FP + FN}$
Precision	Precision is the proportion of all the model’s positive classifications that are actually positive.	$\frac{TP}{TP + FP}$
Recall	The recall is the proportion of all actual positives that were classified correctly as positives	$\frac{TP}{TP + FN}$
F1-Score	The harmonic mean between recall and precision values.	$\frac{2 p x r}{p + r}$

DOI: 10.7717/peerj.20333/table-2

Notes:

TP: true positive
TN: true negative
FP: false positive
FN: false negative
p: precision
r: recall

Activation map to distinguish orangutan nests from aerial images

To gain further insight into how the neural network in the DL models can recognize the orangutan nest, Grad-CAM was used to visualize the area used by the neural network to classify the orangutan nest with a variety of normal tree canopy backgrounds. In general, one layer at a time was retrieved to extract low- and high-level features. The code used for the model development was publicly available at GitHub with the link https://github.com/songguan26/Bornean-Orangutan-Nest-.

Results

Model performance

Four DL models were attempted, and the images were trained, tested and validated for image classification tasks by classifying UAV images into “without nests” and images “with nests” categories. Figure 4, shows the performance of the four models in predicting the images with presence or absence of nests. It can be seen that VGG19 performs lower than the other models. InceptionV3, MobileNetV2 and Xception were ranked first, second and third, respectively. The Shapiro–Wilk normality test was performed to assess the normality of the accuracy values for the models across three learning rates. The results are as follows: InceptionV3 (W = 0.75, p = 0.0000009), MobileNet (W = 0.99, p = 0.99), VGG19 (W = 0.95, p = 0.566) and Xception (W = 0.89, p = 0.37). Based on these results, only InceptionV3 is not normally distributed (p < 0.05). Therefore, a non-parametric test, the Kruskal Wallis H-test, was used to compare the models based on their accuracy values across three LRs. The result of the Kruskal Wallis H-test shows no significant difference (i.e., at the threshold p-value <0.05) in the accuracy of the four models at three learning rates (H (3) = 6.751, p = 0.087). Additionally, as most of the models are normally distributed except InceptionV3 with a very small p-value, the model performance is presented in Fig. 4 using the mean value to better represent the data.

To assess the generalization capabilities of the model—its ability to make accurate predictions on new data (Caro et al., 2022) the training validation accuracy (TVA) and training validation loss (TVL) of the models across three learning rates on the test set were evaluated and presented in Table 2. The new data was validation splits (15%, in section ‘Methodology’) that were never used in the model development. Although the epochs were set at 50, the early-stopping-method was employed—to prevent overfitting and underfitting (Cai et al., 2022) causing the model computation to halt early once the validation accuracy did not improve (epochs indicated in X-axis). The results of TVA and TVL (Table 3) show that LR 0.001 generally achieves a balance between efficient training and robust generalization across the models. Whereas, LR 0.01 risks instability and overfitting, which occurs when the model fits the training data too closely and failed to generalize to new data (Charilaou & Battat, 2022). Meanwhile, LR 0.0001 results in slow or failed convergence and underfitting is shown by the poor performance of VGG19 model, which is incapable of learning the patterns in the training data (Jabbar & Khan, 2015).

Figure 4: Overall performance of four models.
The bar chart represents the mean probability and the error bar represents the standard error (SE).

Download full-size image

DOI: 10.7717/peerj.20333/fig-4

In addition, the confusion matrix for each model is shown in Table 4 to visualize how well the classification model works by showing the correct and incorrect predictions made by the model, in comparison with the actual answer. The confusion matrix in binary classification consists of four components i.e., true positives (TP) is when the model correctly predicts the positive class; true negatives (TN) is when the model correctly predicts the negative class; false positives (Type-1 error) is when the model incorrectly predicts the positive class and false negative (Type-2 error) when the model incorrectly predicts the negative class (Saito & Rehmsmeier, 2015). InceptionV3 at LR 0.01, LR 0.001 and Xception at LR 0.0001 have made all correct predictions. Meanwhile, InceptionV3 and MobileNetV2 at LR 0.0001, Xception at LR 0.01 and LR 0.001, as well as VGG19 at all LR, have a Type-1 error in nest prediction. Whereas MobileNetV2 at LR 0.001 has a Type-2 error in nest prediction.

The confusion matrix in binary classification consists of four components: true positive (TP), when the model correctly predicts the positive class; true negative (TN), when the model correctly predicts the negative class; false positive (Type 1 error), when the model incorrectly predicts the positive class; and false negative (Type 2 error), when the model incorrectly predicts the negative class (Saito & Rehmsmeier, 2015). InceptionV3 at LR 0.01, LR 0.001 and Xception at LR 0.0001 all made correct predictions. InceptionV3 and MobileNetV2 at LR 0.0001, Xception at LR 0.01 and LR 0.001 and VGG19 at all LRs have a type 1 error in nest prediction. While MobileNetV2 at LR 0.001 has a type 2 error in nest prediction.

Identification and visualization of input features

Heatmaps illustrate which parts of an image the model considers important by highlighting them in warm colors such as yellow, orange and red. Due to the superior overall performance of InceptionV3, five convolutional layers of the InceptionV3 architecture covering the low- and high-level features were used to visualize how the neural network identified the orangutan nest. Table 5 shows some examples of the convolutional layers of InceptionV3 compared to the original image of a human. The most common 2D convolutional layer “Conv2d” (Khan, 2019) is used to visualize the region used by the model for classification. The heatmaps derived from Conv2d_89 and Conv2d_90 highlighted the corners of the images and underlined subtle colors on the nest itself. In contrast, the nest was emphasized in the Conv2d_91 and Conv2d_92 heatmaps. In addition, the upper right corner of the image was emphasized in the heatmap derived from Conv2d_93. Based on the result, the neural network was able to identify the features of the nest –edge, shape and texture –reflected in the different intensities of warm color. As mentioned by LeCun, Bengio & Hinton (2015), there were blocks of low and high feature extraction in InceptionV3. Figure 5 shows an example of the original image used to extract the feature for classification.

Discussion

The increasing use of drones to monitor orangutan populations could be an excellent alternative to improve the monitoring and protection of orangutan populations (Burke et al., 2019). However, the enormous amount of data generated by UAV imagery, which needs to be identified and annotated by trained experts, poses a major time and labor-intensive challenge. Therefore, this study was conducted with the aim of evaluating the feasibility of using computer vision and DL to classify orangutan nests from UAV imagery.

This study focuses on image classification rather than object detection (Sharma, 2019). In particular, it supports the second stage of the two-stage object detection algorithm, which in this case involves the identification of the orangutan nest. The concept of two-stage detection consists of the first stage of detecting the object of interest (usually using the YOLO or SSD algorithm) and the second stage of a classifier by a DL algorithm (the DL models examined in this study). Although many data scientists or ML engineers have proposed only the YOLO algorithm, which can solve both localization (detecting the position of the object of interest on an image) and classification in one step, detecting and classifying an orangutan nest on aerial images of tree canopies is a great challenge in reality (due to the very similar patterns of tree canopies) and requires a large number of aerial images as training data. Therefore, the advantage of the DL model used in this study is that it is computationally more effective because the neural network focuses on classification tasks rather than two tasks in the object detection model. However, the output of the DL model was the result of labeling—images with nest and without nest—and the counting of the nest could only be done by another counting algorithm for the images placed in a folder with nest. The result of this orangutan nest recognition study is consistent with that of Chen et al. (2014), who integrated various AI methods, including ML, optimization algorithms and adaptive decision-making systems, to develop intelligent systems capable of performing complex orangutan nest detection tasks from UAV imagery. In addition, the current study on the use of DL architectures with feature extraction from the images has continued the study of Amran et al. (2023) who used hand-crafted feature extraction and multi-class classification with SVM for orangutan nest in Borneo. Although Teguh et al. (2024) attempted to use YOLOv5 and achieved a precision of 0.973 and a recall of 0.949 when recognizing the orangutan nest from the drone images, this study has shown that orangutan nest recognition can achieve higher accuracy and precision when using lower computational power (and focusing only on the classification task). In addition, this study has shown that unlike YOLO (single-stage recognition algorithm), the use of transfer learning (transferring weights and bias in the classification of ImageNet images to another classification task) also helps to overcome the problem of data scarcity associated with the lack of sufficient training examples. While counting nests from the ground is easier than locating and counting individual orangutans, drone surveys capture only a fraction of nests in aerial views. Nests under the canopy in dense forests are often missed, and fresh green nests or those in advanced decay stages are harder to detect in drone images (Andini et al. 2021). As a result, this may cause insufficient training data for model training. Despite their efficiency, UAV-based images often suffer from occlusion by dense tree canopies, making it difficult to detect nests that are clearly visible from the ground. In addition, variations in lighting conditions, camera angles and flight altitudes can lead to inconsistent image quality, which affects the accuracy of automatic detection models. In contrast, manual observation from the ground allows for more accurate inspection of nest features such as age, freshness and position, which are difficult to reliably assess from aerial imagery alone.

So far, this study is the first to compare four state-of-the-art pre-trained DL models—InceptionV3, MobileNetV2, VGG19 and Xception. The data was further augmented and the hyperparameters were refined by training for nest recognition from UAV imagery, resulting in high accuracies (>96%). The model performance result is in line with Ong & Hamid (2022); Ong et al. (2022), where InceptionV3 is the best model for this task, while VGG19 performs the worst. When comparing between the three learning rates, the learning rate (LR) of 0.001 achieved the optimal performance, with fewer problems related to overfitting and underfitting. InceptionV3 with LR 0.001 performed well and delivered all correct predictions.

It is worth noting that VGG19 performs the worst in this study, in contrast to other studies which showed that VGG19 performs better than InceptionV3 and MobileNet. A look at the layouts of VGG19 (Table 6) compared to InceptionV3 (Table 5) shows that VGG19 is not able to recognize the features of the orangutan nest, which could be the main reason for the poor performance. Nevertheless, there are previous studies that also show that VGG19 performs poorly. This emphasizes the need to compare DL models for a specific task.

Figure 5: When processing the input aerial image of the orangutan nest, the edge of the nest (Conv2d_89 and 90) and the texture of the nest (Conv2d_91 to 93) are recognized step by step.

Download full-size image

DOI: 10.7717/peerj.20333/fig-5

To interpret the result of the computer vision system for the orangutan nest, the layers of the architecture with Grad-CAM were visualized, which to our knowledge is also the first report. Using Grad-CAM, the region of biases and weights defined by the perceptron within the DL architecture was able to highlight the shape and texture of the orangutan nest, which was later used in the classification block for classification. Considering the similarity of the present study to the task of classifying the canopy of a forest, this study result was compatible with that of Nezami et al. (2020), who used a multilayer perceptron (MLP) to classify tree species using aerial images generated from RGB and hyperspectral (HS) images and achieved an accuracy of 99.6% with the best 3D CNN classifier. Moreover, the result of this study in classifying tree canopy with and without orangutan nests is consistent with that of Huang et al. (2023), who used ResNet, ConvNeXt, ViT and Swin Transformer and achieved at least 96% accuracy in classifying tree species from aerial images.

However, there are still many aspects that require further investigation and improvement. One of these is the quality of aerial images. As mentioned by Huang et al. (2023), the degradation of image quality and aerial images at different altitudes needs to be explored further. The key question for future study is to determine what altitude achieves the ideal balance between drone flight feasibility and image quality. In this study, for example, a fixed-wing drone with a Canon Power Shot S100 RGB CMOS sensor was used, which was flown at the highest point of the treetops at an altitude of 100 m. The image quality could be improved by using a multi-rotor UAV with better camera control. Image quality could also be improved by flying at a lower altitude where the camera is closer to the canopy and can capture more detail. However, this depends on the feasibility of the flight, where many factors determine the closest distance between the drone and the tree canopy, such as the availability of the crash sensor. With better image quality, further exploration can be conducted, such as classifying the nest decay stage of nests and increasing the ability to detect fresh green nests. Additionally, there is a need to augment both the quantity and diversity of aerial imagery to increase the robustness and subsequent generalization of the model. The diversity of the data could also include false positives and negatives in the training data to further improve the generalization of the model. Another important consideration is the deployment of the model to ensure its practical applicability. In the field for detecting and counting the number of orangutan nests. Additionally, building a model by local or regional dataset was always facing a challenge in generalizing good results for other similar datasets (e.g., by using the DL model in this study to predict aerial images from Indonesia).

This study presents a comparative analysis of deep learning models for automatic detection of orangutan nests on aerial images. Although the results demonstrate the effectiveness of different deep learning models in classifying orangutan nests, we acknowledge several limitations that limit the novelty and broader applicability of our work. First, classification or object recognition using deep learning is already well established in computer science and machine learning techniques in various fields, including wildlife monitoring. Although our study provides a practical insight into the performance of the models, the comparison of existing models is inherently incremental rather than novel. Therefore, this work primarily contributes to the ability of the models to classify the orangutan “nests” from tree canopies, rather than quantitative data about the species’ ecology. Second, while analyzing the importance of features provides some insight into how different models “see” and interpret orangutan nests, further investigations—such as linking these results to field-based nest features or forest structure—were beyond the scope of this study. Future work could investigate how the interpretability of the models relates to ecological context or how it influences conservation decisions.

Many future studies will aim to improve the model, software and hardware. However, it is vital to ensure that these improvements consistently contribute to orangutan conservation. Streamlining orangutan survey and monitoring processes to be more cost and time efficient, alongside leveraging computer vision and DL models for automatic annotation of orangutan nests from aerial images, could significantly advance orangutan monitoring efforts.

Conclusions

Our results show that InceptionV3 is able to classify the aerial images of orangutan nests with an accuracy and precision of 99%. The present study encourages further development of DL models for the automatic detection of orangutan nests from aerial UAV images. Further research and refinement in this area could lead to more time- and cost-efficient methods of identifying nests and thus monitoring the orangutan population. Nevertheless, additional data sets, especially from different forest types used by orangutans, such as forest patches within plantations, timber plantations, logged and unlogged forests, are crucial to improve the generalization of the model in the field. In the future, other remote sensing data such as through partnerships with other agencies could be incorporated to obtain more imagery and make significant improvements in this area.

Ethics statement

The drone was deployed in the primary protection forest where no residents lived, and only images of the canopy were collected, so there was no risk to people’s privacy. The field study and the use of the drone for aerial photography were conducted in 2014 with permission from the Sabah Forestry Department under reference number (JPHTN/PP 100-22/4/KLT.11(44)).

[1] Amran AA, On CK, Hung LP, Rossdy M, Simon D, See CS. 2023. Bornean orangutan nests classification using Multiclass SVM. In: 2023 IEEE Symposium on Computers & Informatics (ISCI). 1-6

[2] Ancrenaz M, Gimenez O, Ambu L, Ancrenaz K, Andau P, Goossens B, Payne J, Sawang A, Tuuga A, Lackman-Ancrenaz I. 2005. Aerial surveys give new estimates for orangutans in Sabah, Malaysia. PLOS Biology 3(1):e3, 0030-0037

[3] Ancrenaz M, Gumal M, Marshall AJ, Meijaard E, Wich S, Husson S. 2023. Pongo pygmaeus (amended version of 2016 assessment). The IUCN Red List of Threatened Species 2023: e.T17975A247631797.

[4] Andini R, Rahmi E, Rasnovi S, Moulana R. 2021. Nest characteristics of the sumatran orangutan (Pongo abelii) in the wildlife sanctuary soraya station in Aceh Province, Indonesia. Tropical Life Sciences Research 32(3):161

[5] Ball J, Hickman S, Jackson T, Jing KX, Hirst J, Jay WM, Coomes DA. 2023. Accurate delineation of individual tree crowns in tropical forests from aerial RGB imagery using mask r-cnn. Remote Sensing in Ecology and Conservation 9(5):641-655

[6] Biddle L, Deeming D, Goodman A. 2014. Morphology and biomechanics of the nests of the common black bird Turdus merula. Bird Study 62(1):87-95

[7] Burke C, Rashman MF, Longmore SN, McAree O, Glover-Kapfer P, Ancrenaz M, Wich SA. 2019. Successful observation of orangutans in the wild with thermal-equipped drones. Journal of Unmanned Vehicle Systems 7(3):235-257

[8] Cai Y, Wang Z, Yao L, Lin T, Zhang J. 2022. Ensemble dilated convolutional neural network and its application in rotating machinery fault diagnosis. Computational Intelligence and Neuroscience 2022:1-14

[9] Caro MC, Huang HY, Cerezo M, Sharma K, Sornborger A, Cincio L, Coles PJ. 2022. Generalization in quantum machine learning from few training data. Nature Communications 13(1):4919

[10] Casteren AV, Sellers WI, Thorpe SK, Coward S, Crompton RH, Myatt JP, Ennos AR. 2012. Nest-building orangutans demonstrate engineering know-how to produce safe, comfortable beds. Proceedings of the National Academy of Sciences of the United States of America 109(18):6873-6877

[11] Charilaou P, Battat R. 2022. Machine learning models and over-fitting considerations. World Journal of Gastroenterology 28(5):605-607

[12] Chatzilygeroudis K, Hatzilygeroudis I, Perikos I. 2021. Machine learning basics. In: Intelligent computing for interactive system design: statistics, digital signal processing, and machine learning in practice (1st edition). New York, NY, USA: Association for Computing Machinery. 143-193

[13] Chen H, Guo S, Hao Y, Fang Y, Fang Z, Wu W, Li S. 2021. Auxiliary diagnosis for Covid-19 with deep transfer learning. Journal of Digital Imaging 34(2):231-241

[14] Chen Y, Shioi H, Montesinos CF, Koh LP, Wich S, Krause A. 2014. Active detection via adaptive submodularity. In: Proceedings of the 31st international conference on machine learning. 55-63

[15] Christin S, Hervet É, Lecomte N. 2019. Applications for deep learning in ecology. Methods in Ecology and Evolution 10(10):1632-1644

[16] Davies AB, Oram F, Ancrenaz M, Asner GP. 2019. Combining behavioral and LiDAR data to reveal relationships between canopy structure and orangutan nest site selection in disturbed forests. Biological Conservation 232:97-107

[17] Desai N, Bala P, Richardson R, Raper J, Zimmermann J, Hayden B. 2023. Open Ape Pose, a database of annotated ape photographs for pose estimation. elife 12:RP86873

[18] Fei-Fei L, Deng J, Li K. 2009. ImageNet: constructing a large-scale image database. Journal of Vision 9(8):1037

[19] Guo S, Xu P, Miao Q, Shao G, Chapman CA, Chen X, Li B. 2020. Automatic identification of individual primates with deep learning techniques. Iscience 23(8):1-9

[20] Hanggito MS. 2020. Development of an unmanned aerial vehicle-based orangutan population assessment and monitoring method for the multifunctional landscape of East Kalimantan, Indonesia Open Access Theses and Dissertations thesis

[21] Huang Y, Wen X, Gao Y, Zhang Y, Lin G. 2023. Tree species classification in UAV remote sensing images based on super-resolution reconstruction and deep learning. Remote Sensing 15(11):2942

[22] Isawasan P, Abdullah ZI, Ong SQ, Salleh KA. 2023. A protocol for developing a classification system of mosquitoes using transfer learning. MethodsX 10:101947

[23] Jabbar H, Khan RZ. 2015. Methods to avoid over-fitting and under-fittting in supervised machine learning (Comparative study) In: Computer science, communication and instrumentation devices, 70(10.3850). Singapore: Research Publishing Services (RPS). 978-981

[24] Kamaruszaman SA, Fadzly N, Mutalib AH, Muslim AM, Atmoko SSU, Mansor M, Mansor A, Rupert N, Zakaria R, Hashim ZH, Sah ASR, Jamsari FF, Azman NM. 2018. Measuring Orangutan nest structure using Unmanned Aerial Vehicle (UAV) and Image J. BioRxiv.

[25] Khan AA. 2019. What is Keras Conv2D. (accessed 03 June 2024)

[26] Khan MK, Ullah MO. 2022. Deep transfer learning inspired automatic insect pest recognition. In: Proceedings of the 3rd international conference on computational sciences and technologies. Jamshoro, Pakistan: Mehran University of Engineering and Technology. 17-19

[27] Kühl H, Maisels F, Ancrenaz M, Williamson EA. 2008. Best practice guidelines for surveys and monitoring of great ape populations occasional paper of the IUCN Species Survival Commission 36 (36).

[28] Kumar R, Arora R, Bansal V, Sahayasheela VJ, Buckchash H, Imran J, Raman B. 2020. Accurate prediction of COVID-19 using chest X-ray images through deep feature learning model with SMOTE and machine learning classifiers. MedRxiv.

[29] Lebovitz S, Levina N, Lifshitz-Assaf H. 2021. Is AI ground truth really true? The dangers of training and evaluating AI tools based on experts’ know-what. MIS Quarterly 45(3):1501-1526

[30] LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521(7553):436-444

[31] Li J, Xu Q, Shah N, Mackey TK. 2019. A machine learning approach for the detection and characterization of illicit drug dealers on instagram: model evaluation study. Journal of Medical Internet Research 21(6):e13803

[32] Madhavan S, Jones TM. 2024. (accessed 18 September 2024) data

[33] Manduell KL, Harrison ME, Thorpe SKS. 2012. Forest structure and support availability influence orangutan locomotion in Sumatra and Borneo. American Journal of Primatology 74(12):1128-1142

[34] Mansourian S, Vallauri D, France W. 2020. Lessons learnt from 12 years restoring the orangutan’s habitat: the Bukit Piton Forest Reserve in the Malaysian State of Sabah.

[35] Milne S, Martin JGA, Reynolds G, Vairappan CS, Slade EM, Brodie JF, Wich SA, Williamson N, Burslem DFRP. 2021. Drivers of Bornean orangutan distribution across a multiple-use tropical landscape. Remote Sensing 13(3):1-16

[36] Nezami S, Khoramshahi E, Nevalainen O, Pölönen I, Honkavaara E. 2020. Tree species classification of drone hyperspectral and RGB imagery with deep learning convolutional neural networks. Remote Sensing 12(7):1070

[37] Okewu E, Misra S, Lius FS. 2020. Parameter tuning using adaptive moment estimation in deep learning neural networks. In: International conference on computational science and its applications. Cham. Springer International Publishing. 261-272

[38] Ong SQ, Hamid SA. 2022. Next generation insect taxonomic classification by comparing different deep learning algorithms. PLOS ONE 17(12):e0279094

[39] Ong SQ, Nair G, Yusof UK, Ahmad H. 2022. Community-based mosquito surveillance: an automatic mosquito-on-human-skin recognition system with a deep learning algorithm. Pest Management Science 78(10):4092-4104

[40] Pandong J, Gumal M, Alen L, Sidu A, Ng S, Koh LP. 2018. Population estimates of Bornean orangutans using Bayesian analysis at the greater Batang Ai-Lanjak-Entimau landscape in Sarawak, Malaysia. Scientific Reports 10:1-11

[41] Payne J. 1988. Orang-utan Conservation in Sabah (Report No. 3754). WWF-Malaysia, Kuala Lumpur. 274

[42] Pearse GD, Watt MS, Soewarto J, Tan AY. 2021. Deep learning and phenology enhance large-scale tree species classification in aerial imagery during a biosecurity response. Remote Sensing 13(9):1789

[43] Permana AL, Permana JJ, Nellissen L, Prasetyo D, Wich SA, Schaik CPV, Schuppli C. 2024. The ontogeny of nest-building behaviour in Sumatran orang-utans, Pongo abelii. Animal Behaviour 211:53-67

[44] Piel AK, Crunchant A, Knot IE, Chalmers C, Fergus P, Mulero-Pazmany M, Wich SA. 2022. Non-invasive technologies for primate conservation in the 21st century. International Journal of Primatology 43:133-167

[45] Purwono P, Ma’arif A, Rahmaniar W, Fathurrahman HIK, Frisky AZK, Mul Haq Q. 2022. Understanding of convolutional neural network (CNN): a review. International Journal of Robotics and Control Systems 2(4):739-748

[46] Rayadin Y, Saitoh T. 2009. Individual variation in nest size and nest site features of the Bornean orangutans (Pongo pygmaeus) American Journal of Primatology 71(5):393-399

[47] Saito T, Rehmsmeier M. 2015. The precision–recall plot is more informative than the ROC Plot When evaluating binary classifiers on imbalanced datasets. PLOS ONE 10(3):e0118432

[48] Santika T, Wilson KA, Meijaard E, Ancrenaz M. 2019. The power of mixed survey methodologies for detecting decline of the Bornean orangutan. bioRxiv.

[49] Sharma P. 2019. Image classification vs. object detection vs. image segmentation. Analytics Vidhya. (accessed 14 October 2024)

[50] Sabah Wildlife Department. 2020. Orangutan action plan for Sabah 2020–2029 Kota Kinabalu, Sabah, Malaysia.

[51] Simon D, Davies G, Ancrenaz M. 2019. Changes to Sabah’s orangutan population in recent times: 2002–2017. PLOS ONE 14(7):1-14

[52] Smith J, Legg P, Matovic M, Kinsey K. 2018. Predicting user confidence during visual decision making. ACM Transactions on Interactive Intelligent Systems 8(2):1-30

[53] Spehar SN, Mathewson PD, Nuzuar, Wich SA, Marshall AJ, Kuehl H, Meijaard E. 2010. Estimating orangutan densities using the standing crop and marked nest count methods: lessons learned for conservation. Biotropica 42(6):748-757

[54] Teguh R, Maleh IMD, Sahay AS, Pratama MP, Simon O. 2024. Object detection of the Bornean orangutan nests using drone and YOLOv5. IAES International Journal of Artificial Intelligence 2252(8938):1641

[55] Wich S, Dellatore D, Houghton M, Ardi R, Koh LP. 2015. A preliminary assessment of using conservation drones for Sumatran orangutan (Pongo abelii) distribution and density. Journal of Unmanned Vehicle Systems 4(1):45-52

[56] Wich SA, Koh LP. 2018. Conservation drones: mapping and monitoring biodiversity. United Kingdom: Oxford University Press.

Block4_conv2	Block4_conv3	Block5_conv1	Block 5_conv2	Block5_conv3

	Learning rate	Accuracy	Loss
InceptionV3	0.01
	0.001
	0.0001
MobileNetV2	0.01
	0.001
	0.0001
VGG19	0.01
	0.001
	0.0001
Xception	0.01
	0.001
	0.0001