Fairness-enhancing classification methods for non-binary sensitive features—How to fairly detect leakages in water distribution systems
- Published
- Accepted
- Received
- Academic Editor
- Claudio Ardagna
- Subject Areas
- Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning, Spatial and Geographic Information Systems, Neural Networks
- Keywords
- Fairness, Machine learning, Fair machine learning, Disparate impact, Equal opportunity, Leakage detection, Water distribution systems
- Copyright
- © 2024 Strotherm et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
- Cite this article
- 2024. Fairness-enhancing classification methods for non-binary sensitive features—How to fairly detect leakages in water distribution systems. PeerJ Computer Science 10:e2317 https://doi.org/10.7717/peerj-cs.2317
Abstract
Especially if artificial intelligence (AI)-supported decisions affect the society, the fairness of such AI-based methodologies constitutes an important area of research. In this contribution, we investigate the applications of AI to the socioeconomically relevant infrastructure of water distribution systems (WDSs). We propose an appropriate definition of protected groups in WDSs and generalized definitions of group fairness, applicable even to multiple non-binary sensitive features, that provably coincide with existing definitions for a single binary sensitive feature. We demonstrate that typical methods for the detection of leakages in WDSs are unfair in this sense. Further, we thus propose a general fairness-enhancing framework as an extension of the specific leakage detection pipeline, but also for an arbitrary learning scheme, to increase the fairness of the AI-based algorithm. Finally, we evaluate and compare several specific instantiations of this framework on a toy and on a realistic WDS to show their utility.
Introduction
Due to the increasing usage of artificial intelligence (AI)-based decision making systems in socially relevant fields of application, the question of fair decision making gained much importance in recent years (cf. Angwin et al., 2016; European Union, 2019). Fairness is hereby related to the several (protected) groups or individuals, which are affected by the algorithmic decision making and characterized by sensitive features such as gender or ethnicity. Most algorithms on which these tools are based rely on data which can be biased with respect to questions of fairness without intention, resulting in skewed models. Also, the algorithm itself can discriminate against protected groups or individuals without explicitly aiming to do so due to an undesirable algorithmic bias (cf. Mehrabi et al., 2021; Pessach & Shmueli, 2022). This gives rise to the question of how to define fairness and how to mitigate unfairness in case it occurs in the context of machine learning (ML), i.e., in the context of data-driven algorithms.
Background: Fairness definitions Several definitions of fairness as well as approaches to achieve these fairness standards have been theoretically discussed and tested in practice (cf. Barocas, Hardt & Narayanan, 2019; Castelnovo et al., 2022; Dwork et al., 2012; Mehrabi et al., 2021; Pessach & Shmueli, 2022). From a legal perspective, one distinguishes between disparate treatment and disparate impact (DI) (cf. Barocas, Hardt & Narayanan, 2019). While disparate treatment occurs whenever a group or an individual is intentionally treated differently because of their membership in a protected group, disparate impact is a consequence of indirect discrimination happening despite “seemingly neutral policy” (cf. Pessach & Shmueli, 2022).
From a scientific viewpoint, the variety of fairness notions is much larger where many popular approaches focus mainly on (binary) classification tasks (cf. Castelnovo et al., 2022; Mehrabi et al., 2021; Pessach & Shmueli, 2022). Different definitions can be grouped into the concepts of group fairness, individual fairness, causal fairness and dynamic fairness: Group fairness aims at treating different groups equally while individual fairness aims at treating similar individuals similarly. Causal fairness examines the extent to which the sensitive feature, such as gender or ethnicity, influences the prediction of a model and dynamic fairness examines the long-term effects of (supposedly) fair decisions (cf. Strotherm et al., 2023).
The fairness notions that we will discuss in this work belong to the former concept of group fairness. Here, most works focus on fairness definitions with respect to a single binary sensitive feature that splits the underlying population into a discriminated and a privileged group (cf. Feldman et al., 2015; Hardt, Price & Srebro, 2016; Kamiran & Calders, 2009, 2010; Mehrabi et al., 2021; Pessach & Shmueli, 2022; Ruf & Detyniecki, 2021; Zafar et al., 2017a, 2017b). There is some work on fairness definitions based on the independence assumption of the model’s prediction and a single non-binary sensitive feature; however, there is no rigorous theory on how this assumption translates to generalized fairness notions as necessary and sufficient conditions of this independence assumption and their relation to the binary case (cf. Agarwal et al., 2018; Castelnovo et al., 2022). We will build on this point.
Background: Fairness methods Besides the definition of fairness, the problem arises as to how to enhance fairness in well-known ML methods while maintaining a reasonable overall performance of the model. Approaches can hereby be grouped into three categories: Depending on when in the training pipeline the model is enhanced with respect to fairness, we speak about pre-, in- or post-processing techniques (cf. Barocas, Hardt & Narayanan, 2019; Mehrabi et al., 2021; Pessach & Shmueli, 2022).
Pre-processing methods usually modify the training data which is fed to the training algorithm. For example, Kamiran & Calders (2010) use a resampling technique by removing unpreferred samples, i.e., positive outcomes in the privileged group and negative outcomes in the discriminated group, and duplicating preferred samples, i.e., positive outcomes in the discriminated group and negative outcomes in the privileged group, that lie close to the decision boundary of a binary classifier. In another work, they modify the training data by changing the labels of training samples that lie close to the decision boundary of a binary classifier such that negative outcomes in the privileged group and positive outcomes in the discriminated group appear more often (cf. Kamiran & Calders, 2009). While these methods aim at putting more emphasis on the discriminated group and less emphasis on the privileged group, Feldman et al. (2015) modify the non-sensitive features of training samples such that it is not able to predict the sensitive feature from the non-sensitives. This reduces the chance that the model’s predictions, which are based on the non-sensitive features, are correlated with the sensitive feature.
In contrast, post-processing methods modify the model after the training. For example, Pleiss et al. (2017) modify a pre-trained model by randomly changing some outputs of a binary classifier on the group on which the classifier performs better to ensure equal performance over all groups. As another example, Hardt, Price & Srebro (2016) retrain a pre-trained model by optimizing a loss between the new and the pre-trained binary classifier while satisfying fairness-constraints. Another simple approach is to use group-specific thresholds for a threshold-based classifier (cf. Corbett-Davies et al., 2017).
Finally, in-process methods modify the (original) training algorithm. A common way to do so is by adding fairness-constraints (cf. Agarwal et al., 2018; Agarwal, Dudík & Wu, 2019; Calders et al., 2013; Komiyama et al., 2018; Narasimhan et al., 2020; Zafar et al., 2017a, 2017b) or a fairness-regularization-term (cf. Aghaei, Azizi & Vayanos, 2019; Berk et al., 2017; Pessach & Shmueli, 2022) to the loss function that is to be optimized. Next to classification, also regression tasks usually fall into this category (cf. Agarwal, Dudík & Wu, 2019; Aghaei, Azizi & Vayanos, 2019; Berk et al., 2017; Calders et al., 2013; Komiyama et al., 2018; Narasimhan et al., 2020). The methods presented in this work are also in-process methods and are extensions of our work, Strotherm & Hammer (2023), published in Springer’s Lecture Notes in Computer Science. Both of these works are based on the methods of Zafar et al. (2017b), but adapted to more generalized settings, as we will elaborate in the contributions paragraph.
Background: Fairness in water distribution systems (WDSs) The question of fairness becomes especially relevant when the decisions of an ML model impact socioeconomic infrastructure, such as WDSs. To the best of our knowledge, our previous work, Strotherm & Hammer (2023), has been the first approach to introduce fairness within this domain. In that work, Strotherm & Hammer (2023), we address the important problem of leakage detection in WDSs and investigate how far typical models treat different groups of consumers of the WDS (in)equally, and we will extend these considerations in this work as outlined in the next paragraph. As an extended version, portions of this work were previously published as part of the previous version (cf. Strotherm & Hammer, 2023).
Contributions Our approaches to improve group fairness in such a domain of high social and ethical relevance are based on the idea of considering the locality in the WDS as a sensitive feature. Considering the empirical covariance between the sensitive feature(s) and the model’s prediction as a proxy for the fairness measure, similar to Zafar et al. (2017b), but also the generalized fairness notions directly, are the base of all our proposed methods. The advantage of our fairness-enhancing algorithms is that they can handle even multiple non-binary sensitive features and satisfy both the concept of disparate treatment and disparate impact simultaneously, which is an asset towards most fairness-enhancing algorithms (cf. Pessach & Shmueli, 2022; Zafar et al., 2017b). In more detail, our contributions—also in view of what this extension offers compared to our previous work, Strotherm & Hammer (2023)—are as follows:
- We propose group fairness definitions even for multiple non-binary sensitive features, which are generalizations of well-known corresponding fairness notions in the common setting of a binary classifier and a single binary sensitive feature. 
- As an extension to our previous work, we provide details on the mathematical concept of independence, derive easy-to-test independence criteria, and leverage these in order to derive those generalized group fairness definitions. Moreover, we prove that those coincide with the aforementioned well-known corresponding fairness notions. 
- We introduce a common leakage detection pipeline and propose a suitable definition of sensitive features and group fairness in the context of leakage detection in WDSs, with more detail in this work compared to our previous work. Consecutively, we present specific and already existing instantiations of this pipeline and show that common leakage detection methods do not obey these fairness criteria, with one more specific instantiation (based on the more powerful graph convolutional network (GCN) based virtual sensors instead of linear regression based virtual sensors) and with more detail in this work compared to our previous work. 
- We introduce a fairness-enhancing leakage detection framework as an extension of the common leakage detection pipeline, with more detail in this work compared to our previous work. Consecutively, we present specific instantiations of this framework, among others by modifying the ideas of Zafar et al. (2017b) to any (ensemble) classification model instead of a convex margin-based binary classifier, to propose several fairness-enhancing methods, with more specific instantiations, among others based on the ideas in our previous work made on possible modifications of our methodologies. 
- We provide an empirical evaluation of our proposed methods. As an extension to our previous work, next to the application of these methods to the toy WDS Hanoi, we investigate the application to the more complex and realistic WDS L-Town. 
Structure of the work The rest of this work is structured as follows: In section “Group fairness in machine learning”, we introduce definitions of group fairness for multiple non-binary sensitive features, giving the mathematical background for the derivation of such generalized definitions and how they are connected to already existing definitions. Afterwards, in section “Leakage detection in water distribution systems”, we present a standard methodology to detect leakages in WDSs, introduce the meaning of sensitive features in this context and investigate whether the resulting model makes fair decisions with respect to the previously defined notions of fairness. Consecutively, in section “Fairness-enhancing leakage detection in water distribution networks”, we propose and evaluate several adaptations of this methodology that enhance fairness and provide empirical evidence for our theoretical findings regarding the equivalence of different fairness notions in this specific domain of application. Finally, our findings are summarized and discussed in section “Conclusion”.
Group fairness in machine learning
On an abstract level, the concept of group fairness is based on the mathematical concept of (conditional) independence of two random variables (cf. Barocas, Hardt & Narayanan, 2019; Castelnovo et al., 2022). Therefore, in this section, we will first investigate this concept of independence in general (subsection “Independence of two random variables”). Consecutively, we will introduce the mathematical notation required to define an ML task and its group fairness based on this general concept of independence (subsection “Mathematical notation for machine learning”) to be able to derive group fairness definitions in generalized ML tasks, which coincide with well-known definitions in more specific settings (subsection “Generalized notions of group fairness in machine learning”).
Independence of two random variables
As it is the main mathematical concept to characterize different notions of group fairness, for the sake of convenience, we recapitulate the concept of independence of two random variables in this subsection. Moreover, we target an easy, necessary and sufficient condition for this concept, which is particularly simple to apply and test in the context of fairness of ML models. Hence, we derive an equivalent formulation, lemma 2.2, which can be tested on canonical subsets of the full -fields.
For the rest of this subsection, let be a probability space, , measurable spaces and , random variables1 .
Definition 2.1 (Independence of two random variables (cf. Bauer, 1996)). X and Y are independent with respect to the probability measure , if the -fields2 and generated by these random variables are independent with respect to .
Based on that, in Appendix A.2, we derive general necessary and sufficient conditions for independence of two random variables. In the context of fairness of ML models, we are usually interested in a more specific setting, namely the independence of two discrete random variables.
Lemma 2.2 (Independence of two discrete random variables). Assume that X and Y are discrete, i.e., that and holds. Then X and Y are independent with respect to if and only if (iff)
(2.1)
holds for all and for which holds.
Proof. and are -stable generators of the -fields and , respectively. Therefore, by remark A.9 and lemma A.12 (we can replace the -fields and in lemma A.12 by their generators and , respectively), X and Y are independent iff
holds for all and all for which holds. Note that we can omit the cases and , as these are trivially fulfilled.
Lemma 2.2 guarantees independence of two discrete random variables by only testing one-elementary events. As the -fields and , given by all subsets of and , respectively, consist of many more non-trivial events, lemma 2.2 gives us a valuable necessary and sufficient condition for independence of two discrete random variables, which we will make use of in the setting of ML.
Mathematical notation for machine learning
Our next goal is the mathematical definition of group fairness as a formalization of equal treatment of an ML model independent of sensitive attributes. Such attributes, also called sensitive features, provide information about the membership or non-membership of a protected group, such as gender or ethnicity, to which the model should not exhibit any prejudice (cf. Mehrabi et al., 2021; Pessach & Shmueli, 2022). A later goal of this work will be to find a reasonable meaning of sensitive information in the context of WDS.
To formalize this independence and derive easy-to-test notions of group fairness based on the previous subsection “Independence of two random variables” in the next subsection “Generalized notions of group fairness in machine learning”, we need to introduce mathematical notation that allows us to consider independence of two random variables in the context of ML. In such context, probabilities such as in subsection “Independence of two random variables” appear for random variables such as the model’s output , the labels , the features or, in case of fair ML, the sensitive features with target space , feature space and sensitive feature space .
In this work, we will consider a one-dimensional binary classification task and a single discrete but possibly non-binary sensitive feature3 , i.e., the target space equals and the (finite) sensitive feature space equals . Equipping each with the power set makes them a measurable space.
Example 2.3 (Getting an intuition on fair ML). The domain could consist of criminals in the US and the model could predict whether ( ) or whether not ( ) a criminal will be criminal again in the future. This prediction should be independent of their ethnicity (cf. Angwin et al., 2016).
The typical goal of ML is to learn the relation between the features X and the labels Y, i.e., either the distribution of (generative ML) or more often, the distribution of Y, given X, (discriminative ML) for any .4 However, as these distributions are usually unknown, we use training data, i.e., samples5
to estimate , etc. When it comes to fairness, we extend the training data by the sensitive attribute:
Next to these distributions, often, a functional relation between the features X and the labels Y is the object of interest. This is done by learning the overall model , composed of a learnable model (or model function) , applied to the features . In such a case, we consider a hypothesis space , i.e., a (sub-) space of functions mapping from the feature space to the target space . Consecutively, we want to learn the relation between X and Y by finding the optimal function , such that holds. In most cases, the hypothesis space is a set of functions parameterized by a parameter .
Finally, learning the functional relation between the features X and the labels Y by learning the optimal model is done by comparing the results of the model to the desirable results for all from the training data . The comparisons are done by using a suitable loss function which is applied to these magnitudes and optimized with respect to the parameter(s) that characterize(s) .
Remark 2.4. Note that often, in ML-related literature, the introduction of is omitted. Instead, random variables X on , Y on , etc., are introduced. We introduce to guarantee a well-defined usage of probabilities such as for some , etc.
Generalized notions of group fairness in machine learning
Motivation
Reflecting that there is no unique definition of fairness in real life, there is an enormous amount of different definitions of fairness in ML. While focusing on group fairness, even this category can be further grouped into three subcategories: Independence6 , separation and sufficiency. In this context, group fairness can be characterized by some independence connected to the (binary) classification model , the true label , and the sensitive feature . Barocas, Hardt & Narayanan (2019) define these concepts as follows: Independence requires (mathematical) independence between the model’s prediction and the sensitive feature S. Separation requires independence between the model’s prediction and the sensitive feature S, conditioned on events based on the label Y. Sufficiency requires independence between the label Y and the sensitive feature S, conditioned on events based on the model’s prediction . In this work, we will focus on the usually harder to achieve concepts of independence and separation.
However, there are also other definitions that fall under the broad umbrella of group fairness, but which can also be sorted in one of these subcategories (cf. Ruf & Detyniecki, 2021). They are usually defined for a one-dimensional binary classification task and for a single binary sensitive feature only, i.e., in settings where holds (cf. Mehrabi et al., 2021; Pessach & Shmueli, 2022; Ruf & Detyniecki, 2021). For example, one well-known fairness definition is called disparate impact. In most literature, it is assumed that is the class of interest and is the discriminated group, and therefore, holds. In this case, the disparate impact score is defined as the proportion of the passing rate of the discriminated group from the privileged group
(2.2) and should satisfy or for some or (cf. Pessach & Shmueli, 2022). The latter rule is also known as the -rule, and (or is a desirable choice (cf. Pessach & Shmueli, 2022; Zafar et al., 2017b). At the same time, the -rule is also a popular legal term and the reason that the disparate impact score received its importance: It is “designed to mathematically represent the legal notion of disparate impact” (cf. Pessach & Shmueli, 2022), which requires to avoid that “one group’s passing rate is less than 80% of the group with the highest rate” (cf. Biddle, 2006).
The goal of the rest of this subsection is on the one hand to connect these different group fairness notions and on the other hand to introduce generalized notions for more general settings. More precisely, our contribution to this existing research is as follows:
- Starting from the definitions of Barocas, Hardt & Narayanan (2019) for independence (subsection “Independence”) and separation (subsection “Separation”) each, we will make use of the particularly easy necessary and sufficient condition for the independence of two random variables (lemma 2.2), which we derived in subsection “Independence of two random variables” to derive easy-to-test and generalized notions of group fairness in the context of subsection “Mathematical notation for machine learning”. In detail, these notions are applicable for more general settings, i.e., not only for a one-dimensional binary classification task and a single binary sensitive feature, which is the setting on which the majority of the literature focuses (cf. Feldman et al., 2015; Hardt, Price & Srebro, 2016; Kamiran & Calders, 2009; Kamiran & Calders, 2010; Mehrabi et al., 2021; Pessach & Shmueli, 2022; Ruf & Detyniecki, 2021; Zafar et al., 2017a, 2017b). 
- Based on these notions, we will derive generalized empirical notions of the most common group fairness definitions… 
- … and prove that these coincide with the corresponding definitions in the setting of a one-dimensional binary classification task and a single binary sensitive feature. 
While the generalized empirical group fairness definitions will appear to be intrinsic compared to already existing definitions, our theoretical work in subsection “Independence of two random variables” shows that these generalizations display not only a necessary but a sufficient condition for the desired independence criterion on which they are based. As a summary, an overview of already existing definitions and how we extend these is displayed in Tables 1 and 2 (subsection “Summary of generalized notions of group fairness”).
| Definition according to Barocas et al. | Derived necessary and sufficient condition | |
|---|---|---|
| Independence | ||
| = | ||
| Separation | ||
| = | ||
| Derived generalized empirical definitions (multi cases) | Existing empirical definitions (binary cases) | |
|---|---|---|
| Independence: | , arbitrary: | : | 
| Independence: | arbitrary: | : | 
| Separation: | , arbitrary: | : | 
| Separation: | arbitrary, : | : | 
For technical reasons, we assume that all of the following conditional probabilities exist.
Independence
An easy-to-test notion of group fairness
Definition 2.5 (Fairness according to the independence criterion (cf. Barocas, Hardt & Narayanan, 2019)).
The classification model is fair with respect to the sensitive feature S in the sense of the independence criterion if and only if (iff) and S are mathematically independent with respect to .
Based on this definition of Barocas, Hardt & Narayanan (2019), lemma 2.2 induces the following easy-to-test independence criterion in the context of fair ML:
Corollary 2.6 (Fairness according to the independence criterion).
The classification model is fair with respect to the sensitive feature S in the sense of the independence criterion iff
holds for all and all .
Generalized empirical notions of group fairness In practice, exact equality according to corollary 2.6 is usually not achieved. This motivates keeping the difference between both sides of the equation(s) as small as possible, which translates to the following two generalized definitions of disparate impact and demographic parity (DP).
More precisely, while for group fairness, the majority of the literature focuses on a binary classification model and a single binary sensitive feature S (cf. Feldman et al., 2015; Hardt, Price & Srebro, 2016; Kamiran & Calders, 2009; Kamiran & Calders, 2010; Mehrabi et al., 2021; Pessach & Shmueli, 2022; Ruf & Detyniecki, 2021; Zafar et al., 2017a, 2017b), in this work’s definitions, we generalize the understanding of group fairness to a non-binary sensitive feature S, but which can also be used to model even multiple non-binary sensitive features (remark 2.9).
While disparate impact is specifically designed for a binary classification task, i.e., for a setting where holds, and where the class is the preferred one (remark 2.11), the demographic parity score additionally allows generalization to a one- or multidimensional non-binary classifier by definition and based on the theoretical background in subsection “Independence of two random variables”7 :
Definition 2.7 (Disparate impact).
Let . The disparate impact score
measures the (un-)fairness of the classification model with respect to the sensitive feature S in the sense of the independence criterion. For the model , disparate impact is limited to iff holds.
Definition 2.8 (Demographic parity).
Let . The demographic parity score
measures the (un-)fairness of the classification model with respect to the sensitive feature S in the sense of the independence criterion. For the model , demographic parity holds with respect to iff holds.
Remark 2.9 In our previous work (cf. Strotherm & Hammer, 2023), we consider different binary random variables when defining disparate impact. Encoding the single non-binary random variable S from this work to K such binary random variables for all yields the same definition of disparate impact as given in Strotherm & Hammer (2023). We change the notation in this work because it is more intuitive compared to common fairness definitions (e.g., cf. Eq. (2.2) and proof of lemma 2.10) and easily shows how these fairness definitions can be extended even to multiple non-binary sensitive features: In this case, the random vector with and encodes all possibly non-binary single sensitive features for .
Accordance of empirical notions of group fairness in the binary case In case of a binary classification task and a single binary sensitive feature, our definitions coincide with the according definitions known from the before-mentioned literature:
Lemma 2.10 If holds, the disparate impact score and the demographic parity score according to definition 2.7 and 2.8, respectively, coincide with the corresponding definitions known from the literature.
Proof. If holds, the fact that holds implies that the probability measure is uniquely determined by the probability for all . Therefore, the independence criterion (corollary 2.6) becomes
By the same fact,
holds. Therefore, the demographic parity score (definition 2.8) becomes
Moreover, the disparate impact score (definition 2.7) becomes
In most literature, where is assumed to be the discriminated group, and therefore, where holds, this simplifies to
(cf. Eq. (2.2)). These are the definitions of the disparate impact and the demographic parity score usually found in the literature (cf. Mehrabi et al., 2021; Pessach & Shmueli, 2022; Ruf & Detyniecki, 2021; Zafar et al., 2017b).
As already briefly touched on in the subsection “Motivation”, the disparate impact criterion assures that the relative amount of positive predictions within the discriminated group – or in our generalized case of non-binary sensitive features, within the most discriminated group—deviates at most from the relative amount of positive predictions within the privileged group —or in our generalized case, within the most privileged group (definition 2.7). For short and in either way: It aims at obtaining similar or equal success or passing rates among groups.
Similarly, in a binary classification task, the demographic parity criterion assures that the relative amount of positive predictions deviates at most among groups (cf. proof of lemma 2.10 or Table 2). In contrast, in a non-binary classification task, the demographic parity criterion assures that the relative amount of any predictions deviates at most among groups (definition 2.8).
By that, while both criteria assure similar or equal passing rates among groups in the setting of a binary classification task, they assure different things in the setting of a non-binary classification task due to the consideration of all labels in the demographic parity criterion (Table 2).
Remark 2.11 (Generalizability of the disparate impact score). Similar to the demographic parity score (definition 2.8), one could ask whether it makes sense to generalize the disparate impact score to arbitrary discrete target spaces by
(2.3) However, this generalized definition would not coincide with the common one from Eq. (2.2) in the setting of lemma 2.10: For example, consider the case
In this case, the disparate impact score according to definition 2.7 is equal to , which usually is a score considered to be fair. In contrast, the disparate impact score according to Eq. (2.3) is equal to , which usually is a score considered to be unfair. The reason is that the idea of disparate impact relies on the fact that the class is the desired one and only the relative amount of positive predictions among groups is of interest (cf. Pessach & Shmueli, 2022). Therefore, it does only make sense to define disparate impact score as we do in definition 2.7.
Separation
An easy-to-test notion of group fairness Depending on the application, one disadvantage of fairness notions that belong to the fairness concept independence could be the missing dependence on the true label Y. In such case, even if the model was perfect, i.e., if held, it would be denoted as unfair if the relative amount of positive training labels differed significantly among groups (cf. Hardt, Price & Srebro, 2016).
The solution to that yields the fairness concept separation, which in contrast to the fairness concept independence requires (mathematical) independence between the model’s prediction and the sensitive feature S, conditioned on Y:
Definition 2.12 (Fairness according to the separation criterion (cf. Barocas, Hardt & Narayanan, 2019)).
The classification model is fair with respect to the sensitive feature S in the sense of the separation criterion iff and S are mathematically independent with respect to for all .
Using the modified probability measure for , lemma 2.2 again induces the following easy-to-test separation criterion in the context of fair ML:
Corollary 2.13 (Fairness according to the separation criterion).
The classification model is fair with respect to the sensitive feature S in the sense of the separation criterion iff
holds for all and all .
Generalized empirical notions of group fairness Again, in practice, exact equality according to corollary 2.13 is usually not achieved. Therefore, again, keeping the difference between both sides of the equation as small as possible motivates the following generalized definitions, where similar to the previous subsection “Independence”, the second one is specifically designed for a binary classification task, i.e., for settings where holds, and where the class is the preferred one.
Definition 2.14 (Equalized odds).
Let . The equalized odds scores
measure the (un-)fairness of the classification model with respect to the sensitive feature S in the sense of the separation criterion for all . For the model , equalized odds hold with respect to iff holds for all .8
Definition 2.15 (Equal opportunity).
Let . The equal opportunity (EO) score
measures the (un-)fairness of the classification model with respect to the sensitive feature S in the sense of the separation criterion. For the model , equal opportunity holds with respect to iff holds.
Similar arguments as compared to subsection “Independence” also show how these definition(s) allow a generalized understanding of group fairness for non-binary and even multiple non-binary sensitive features S, and for a one- or multi-dimensional non-binary classifier .
Accordance of empirical notions of group fairness in the binary case In case of a binary classification task and a single binary sensitive feature, our definitions coincide with the according definitions known from other literature:
Lemma 2.16. If holds, the equalized odds scores for and the equal opportunity score according to definition 2.14 and 2.15, respectively, coincide with the corresponding definitions known from the literature.
Proof. If holds, similar to the proof of lemma 2.10, the separation criterion (definition 2.13) becomes
for , and
holds.9 Therefore, the equalized odds scores (definition 2.14) become
(comparison of true positive rates (TPRs) and false positive rates (FPRs) among groups) and the equal opportunity score (definition 2.15) becomes
(comparison of TPRs among groups). These are the definitions of the equalized odds and the equal opportunity score(s) usually found in the literature (cf. Mehrabi et al., 2021; Pessach & Shmueli, 2022; Ruf & Detyniecki, 2021; Zafar et al., 2017a).
While equalized odds ensure that the true positive rates (TPRs) and true negative rates (TNRs) (or equivalently, false positive rates (FPRs)) among groups differ at most % in a binary classification task, equal opportunity only concentrates on TPRs among groups. In contrast, in a non-binary classification task where the TPRs and FPRs are not well-defined, equalized odds refer to similar or equal correct classification rates per label among groups (cf. definition 2.14) and display a natural generalization of equal opportunity in this setting (cf. definition 2.15).
Remark 2.17. Nevertheless, we will not make use of equalized odds in this work, as the TNRs and FPRs given by and for any , respectively, do not exist in our domain of application although being in the setting of a binary classification task, as we will see in subsection “Fairness in leakage detection”.
Summary of generalized notions of group fairness
To conclude, in this section, we derived generalized exact and empirical notions of group fairness based on the mathematical concept of independence and suitable for a single, but also multiple non-binary sensitive feature(s). All exact and some empirical notions are suitable for not only one-, but also multi-dimensional non-binary classification models. We additionally showed that the notions coincide with common group fairness definitions in the case of a binary classification task and a single binary sensitive feature.
A summary of such already existing definitions and our contributions are summarized in Tables 1 and 2.
Remark 2.18 (Computation of group fairness scores in practice). In practice, the true distributions and , on which the probabilities displayed in Tables 1 and 2 are based, are unknown. Therefore, as elaborated in subsection “Mathematical notation for machine learning”, the fairness scores are computed using the empirical approximations and based on the training data , respectively, yielding the required approximated probabilities
Leakage detection in water distribution systems
In view of the AI Act, by being part of the critical infrastructure, WDSs belong to high-risk systems (cf. Veale & Borgesius, 2021). In this context, “(m)uch attention has been paid to the potential for AI systems to facilitate indirect discrimination, (which is) in principle illegal under EU law” (cf. Veale & Borgesius, 2021). One requirement of such systems is therefore to check the system for bias and to document the system’s performance for different demographic groups (cf. Strotherm et al., 2023). While this could suggest the use of group fairness definitions implicitly, the guidelines for trustworthy AI explicitly name fairness as one of the seven essential requirements for such systems (cf. European Union, 2019).
A key challenge in the domain of WDSs where AI, or more precisely, ML, is used, is to detect leakages (cf. Artelt et al., 2022; Guo et al., 2021; Li et al., 2022; Romero-Ben et al., 2022; Steffelbauer et al., 2022; Vrachimis et al., 2022). The main components of a WDS relevant for this work are nodes and pipes, through which water can be supplied to end users such as private households, hospitals or schools located at the nodes of the network, but which are also vulnerable to leakages. To detect these is therefore crucial to guarantee consistent water supply, but can also affect other important tasks such as short-term decision making and long-term planning of WDSs.
Therefore, as requested by the AI Act and the guidelines for trustworthy AI, in this section, we present a common ML-based pipeline and concrete instantiations of how to detect leakages in WDSs (subsection “Methodology of leakage detection”). Consecutively, we investigate what fairness can mean (subsection “Fairness in leakage detection”) and whether it is satisfied (subsections “Application domain and data set and experimental results” and “Analysis: Residual-based ensemble leakage detection does not obey fairness”) in this context according to common group fairness notions as introduced in subsection “Generalized notions of group fairness in machine learning”.
Methodology of leakage detection
In the task of leakage detection, the domain (cf. subsection “Mathematical notation for machine learning”) corresponds to possible states of a WDS, determined by time-dependent demands of the end users located at the nodes in the system and which may be affected by leakages. We assume that among these nodes, nodes are provided with sensors (usually, ), which deliver pressure measurements at different times and which can be used for the task at hand. As the sensors usually measure pressure values within fixed time intervals , we introduce the notation , where is some fixed reference point with respect to time.
There are several methodologies that make use of such pressure measurements to approach the problem of leakage detection using ML. Using the notation from subsection “Mathematical notation for machine learning”, the goal is to train a binary classifier with that predicts the true state of the WDS with respect to the question whether a leakage is active ( ) or not ( ). Hereby, the feature space depends on the specific method but is related to the before-mentioned pressure measurements.
One standard approach comes in three steps (cf. Isermann, 2006): First of all, so called virtual sensors are trained, i.e., regression models that are able to predict the pressure at some time and at a node (or even ), based on the pressure measurements observed at that (or earlier) time(s) and at (a choice of) the sensor nodes . Subsequently, these virtual sensors are used to compute pressure residuals of measured and predicted pressure. Finally, these pressure residuals are fed into a leakage detector that is able to predict whether a leakage is present in the WDS at the time of the used residual (cf. Isermann, 2006). An overview of this pipeline is displayed in Fig. 1.
Figure 1: Standard leakage detection pipeline.
The approach can differ depending on the concrete instantiation of virtual sensors and the leakage detector. In this subsection, we first formalize the idea of the general leakage detection pipeline described above in more detail (subsection “Leakage detection pipeline”). Consecutively, we present two concrete instantiations of such (subsection “Leakage detection instantiations”), which we will investigate with respect to the question of fairness in the rest of this section.
Leakage detection pipeline
Virtual sensors Based on vector inputs that are based on the pressure measurements observed at (multiple) times and at the sensor nodes in the WDS, so called virtual sensors, i.e., regression models
that predict the pressure at times and at the sensor node are trained for each sensor node . Hereby, the dimension and the inputs depend on the specific model architecture used (cf. subsection “Leakage detection instantiations” and Artelt et al., 2022; Ashraf et al., 2023; Isermann, 2006).
Pressure residuals Independent of what specific instantiations of virtual sensors for are used, standard leakage detection methods rely on the pressure residuals
we obtain from the pressure measurements and the pressure predictions at (possibly unseen) times and at the sensor node for all (cf. Artelt et al., 2022; Isermann, 2006).
Leakage detection Based on pressure residuals (i.e., ) at times and at the sensor nodes in the WDS, a classification model —or more precisely, the learnable model which is applied to the feature pressure residuals (cf. subsection “Mathematical notation for machine learning”).
that predicts whether ( ) or not ( ) a leakage is present in the WDS is defined or trained. Hereby, indicates a choosable or trainable (hyper-)parameter and the hypothesis space depends on the specific model architecture used (subsections “Leakage detection instantiations”, “Fairness-enhancing leakage detection in water distribution networks” and Artelt et al., 2022; Isermann, 2006).
Leakage detection instantiations
The previous subsection gives a general pipeline on how to detect leakages in a WDS based on pressure measurements, pressure predictions based on virtual sensors, resulting pressure residuals and finally, the leakage detection itself (cf. Fig. 1). In this subsection, we present specific instantiations of this approach.
Linear virtual sensors The first approach is based on the work of Artelt et al. (2022): In this case, each virtual sensor at each sensor node corresponds to a linear regression model. The inputs at times consist of rolling means
at all sensor nodes except the node and with a to be chosen time window . By that, each regression model’s input dimension equals .
Based on that, the virtual sensors for each sensor node are trained on leakage free training data . More precisely, holds for all realisations of the label Y.
GCN virtual sensors In contrast, the second approach is based on the work of Ashraf et al. (2023): In this case, each virtual sensor at each sensor node is obtained by training a single GCN model.
The GCN model is trained on leakage free training data
More precisely, the GCN model inputs the sparse pressure measurements at the sensor nodes and outputs the pressure predictions at each node of the WDS. However, for this work, the pressure predictions at the sensor nodes are enough: The virtual sensors at each sensor node can be considered as the entry-wise output of the overall GCN model .
By that, the inputs at times are given by the node-independent pressure measurements themselves for all sensor nodes , and each regression model’s input dimension equals .
Ensemble leakage detection: The H-method Independent on the choice of virtual sensor, based on the pressure residuals at times we obtain from these, a simple leakage detection method performing good on standard benchmarks, is the threshold-based ensemble classification introduced by Artelt et al. (2022): Without any further training, we can choose a node-dependent hyperparameter to define a (local) classifier for each sensor node by
We then obtain an ensemble classifier with feature space and hyperparameter (i.e., ) that predicts whether there is a leakage present in the WDS at time or not, defined by
(3.4) Simply put into words, a node-dependent classifier detects a leakage when the node-dependent pressure-residual at time exceeds the node-dependent threshold and the ensemble classifier detects a leakage when one of the node-dependent classifiers for any does.
We call this overall instantiation of the standard leakage detection pipeline (cf. Fig. 1) independent of the instantiation of the virtual sensors and characterized by choosing the Hyperparameter the H-method. Note that the H-method does not need further training once it has access to feature pressure residuals . How to introduce a trainable structure to this last component of the pipeline will be part of subsection “Fairness-enhancing leakage detection in water distribution networks”.
Fairness in leakage detection
After having introduced a pipeline to define a leakage detection model and possible concrete instantiations of such in the previous subsection, the question arises as to how leakage detection is related to fairness in the sense of subsection “Generalized notions of group fairness in machine learning”. One key contribution of this work is to answer this question, i.e., to introduce the notion of fairness in the application domain of WDSs by defining suitable sensitive features in the context of leakage detection or other ML-based services in WDSs.
Sensitive features in ML-based services in WDS Knowing that each node of the WDS corresponds to a group of consumers, a natural question is whether these local groups benefit from the WDS and its related services, such as leakage detection, in equal degree. To ensure that the methods that will be presented in subsection “Methodology of fairness-enhancing leakage detection” scale to larger WDSs, we do not consider single nodes but groups of nodes in the WDS as protected groups in terms of fairness. Then, given that a leakage is active in the WDS, i.e., that holds, we define the sensitive feature to answer the question where, i.e., in which protected group , this leakage is active.10 In terms of equal service, one would expect an equally good detection of leakages independent on the leakage location, i.e., the protected group. This understanding of sensitive features, protected groups and consecutively, fairness in WDS, can of course be adapted to other ML-based services in WDS, for example, to contamination detection.
Fairness notions in ML-based services in WDS In this work, we will focus on the evaluation of fairness by choosing one fairness notion each from the fairness concepts independence and separation (subsections “Independence” and “Separation”): Disparate impact for independence (definition 2.7) due to its importance also in the legal context (cf. subsection “Motivation”), and equal opportunity for separation (definition 2.15). Regarding the latter concept, considering that our sensitive feature S is defined on the event , this shows why using equalized odds is not possible in this setting, as already mentioned in remark 2.17 and as shown in the proof of the next lemma 3.1.
Fairness properties in ML-based services in WDS Given this definition of a non-binary sensitive feature S in the WDS, we obtain the following important results with regard to the notions of fairness chosen.
Lemma 3.1 (Equivalence of disparate impact and equal opportunity in WDSs). Let be the sensitive feature describing where a leakage in one of the protected groups of the WDS is active. Moreover, let and define .
1. If disparate impact is limited to , equal opportunity holds with respect to .
2. If equal opportunity holds with respect to , disparate impact is limited to .
Proof. First of all, note that for any for which there exists a such that holds, must hold by definition of the sensitive feature S (this is why it only makes sense to define S on ). Therefore, is empty for all . Subsequently, we obtain
and thus, for all .
Secondly, we also define . Then, we easily find that and, together with the first observation, holds (cf. definition 2.7 and 2.15).
Now the rest follows by simple equivalent transformations: In setting 1, we find that
(3.5) holds. In setting 2, we obtain
(3.6)
Corollary 3.2. Given the setting of lemma 3.1,
- 1. for and 
- 2. for holds. 
Proof. This is a direct consequence of lemma 3.1, where we choose in setting 1 and in setting 2, and where we can work with equalities instead of estimations in Eqs. (3.5) and (3.6), respectively.
Application domain and data set
After having introduced an appropriate definition of a sensitive feature and protected groups in WDSs in the previous subsection “Fairness in leakage detection”, in order to test whether the concrete instantiations of leakage detection methods presented in subsection “Leakage detection instantiations” are fair in this sense, we need to generate suitable data based on given WDS structures.
The WDSs considered are Hanoi (cf. Santos-Ruiz et al., 2022; Vrachimis et al., 2018) and L-Town (cf. Vrachimis et al., 2022; Vrachimis et al., 2020) displayed in Figs. 2 and 3, respectively. While Hanoi consists of 32 nodes, among which three are provided with sensors, and 34 links, L-Town displays a more realistic WDS consisting of 785 nodes, among which 33 are provided with sensors, and 909 links. The latter is constructed in a way to mimic a true WDS while satisfying security defaults and displays one of the state-of-the-art WDSs in the water domain.
Figure 2: The Hanoi WDS, its sensor nodes (IDs 3, 10 and 25) and the protected groups, each highlighted in another color (group 1 on the left side in light shade, group 2 in the middle in dark shade, group 3 on the right side in middle shade).
The sensor nodes are colored in the same color as the protected group to which they belong and highlighted with a grey circle.Figure 3: The L-Town WDS, its sensor nodes and the protected groups, each highlighted in another color (group 1, also called area C, on the top left in middle shade; group 2, also called area A, in light shade; group 3, also called area B, on the bottom in the middle in dark shade).
The sensor nodes are colored in the same color as the protected group to which they belong and highlighted with a grey circle.Pressure measurement simulation For security reasons, only a limited number of real-world data sets based on such systems are available. Therefore, to evaluate methods such as the H-method presented in subsection “Leakage detection instantiations”, data has to be simulated.
For Hanoi, we generate pressure measurements with a time window of min. using the atmn toolbox (cf. Vaquet et al., 2023). The pressure is simulated at the sensor nodes displayed in Fig. 2 and for different leakage scenarios, which differ in the leakage location and size. As the WDS is relatively small, we are able to simulate a leakage at each node in the system and for three different diameters . In total, the data set is balanced with respect to the label, i.e., the fact whether ( ) or not ( ) a leakage present at the time of the considered sample.
For L-Town, we generate pressure measurements with a time window of min. as used in the work of Ashraf et al. (2023). The pressure is simulated at the sensor nodes displayed in Fig. 3 and for different leakage scenarios. Due to the larger system size, we are only able to simulate a leakage at some nodes in the system and for three different diameters cm11 .
Pressure residual computation Consecutively, in order to obtain the pressure residuals required for the H- or other method(s), virtual sensor predictions have to be generated (Fig. 1). For Hanoi, we train and use linear virtual sensors with a preprocessing hyperparameter of as done by Artelt et al. (2022). For L-Town, we train and use GCN virtual sensors, (cf. subsection “Leakage detection instantiations”).
Protected groups Finally, the protected groups as introduced in subsection “Fairness in leakage detection” are displayed in Figs. 2 and 3 as well. Here, we work with different groups for both the Hanoi and the L-Town WDS.
Experimental results and analysis: Residual-based ensemble leakage detection does not obey fairness
In Table 3, the results of the H-method presented in subsection “Leakage detection instantiations” are shown. The hyperparameter is chosen manually per diameter such that the test accuracy is close to maximal. On the one hand, we see that independent of the WDS and the virtual sensors used, in general, the larger the leakage size, the better the method performs in terms of accuracy , as larger leakages are associated with larger pressure drops. Moreover, the method is capable of detecting even small leakages with high(er) accuracy in larger (and therefore, more realistic) WDSs (cf. footnote 11 for details).
| 5 | 0.6223 | 0.8468 | 0.4880 | 0.5763 | 0.3558 | 0.5763 | 0.3588 | 
| 10 | 0.7998 | 0.9983 | 0.6372 | 0.6383 | 0.3611 | 0.6383 | 0.3611 | 
| 15 | 0.8837 | 1.0000 | 0.6402 | 0.6402 | 0.3598 | 0.6402 | 0.3598 | 
| (a) Hanoi WDS and linear virtual sensors. | |||||||
| 1.9 | 0.7034 | 0.8935 | 0.4828 | 0.5404 | 0.4107 | 0.5404 | 0.4107 | 
| 2.3 | 0.8346 | 1.0000 | 0.6652 | 0.6652 | 0.3348 | 0.6652 | 0.3348 | 
| 2.7 | 0.8476 | 1.0000 | 0.4254 | 0.4254 | 0.5746 | 0.4254 | 0.5746 | 
| (b) L-Town WDS and GCN virtual sensors. | |||||||
On the other hand, and more importantly, we see that independent of the WDS and the virtual sensors used, the method is unfair in terms of disparate impact score DI, where a value of 0.8 or larger is desirable (cf. Zafar et al., 2017b), and equal opportunity score EO. However, the experimental evaluation confirms the mathematical findings of corollary 3.2 by comparing the column of the disparate impact score calculated according to definition 2.7 (DI) to the one according to corollary 3.2.2 ( ), and the column of the equal opportunity score calculated according to definition 2.15 (EO) to the one according to corollary 3.2.1 ( ). This also justifies that in our setting, the usage of one of the two scores is sufficient. Therefore, from now on, we mostly work with the disparate impact score DI only.
Fairness-enhancing leakage detection in water distribution networks
Motivated by the result that the standard leakage detection method presented in subsection “Leakage detection instantiations” does not satisfy the notions of fairness, as another main contribution of this work, we modify this H-method to enhance fairness as introduced in subsection “Generalized notions of group fairness in machine learning”. The main idea is based on the fact that in the H-method the only models trained are the virtual sensors for all (cf. subsection “Leakage detection instantiations”). However, given these virtual sensors and resulting residuals , as well as labels for times , we can turn the choice of the hyperparameter of the ensemble classifier (cf. Eq. (3.4)) into an optimization problem (OP), where now acts as a parameter. The corresponding hypothesis space is (cf. subsection “Mathematical notation for machine learning”).
In the following section, we therefore propose (subsection “Methodology of fairness-enhancing leakage detection”) and evaluate (subsection “Experimental results and analysis”) different, in contrast to the H-method optimization-based, methods that aim at optimizing the parameter in order to obtain an optimal ensemble classifier . Optimality hereby depends on the OP at hand: These methods on the one hand are further baselines, where treating the modeling problem as an OP enables us to optimize the result of the H-method itself without fairness considerations. On the other hand, we consider fairness-enhancing methods, where the parameter needs to be optimized such that the resulting ensemble classifier is simultaneously as accurate and fair on the given training data as possible.
Methodology of fairness-enhancing leakage detection
The following methods define training algorithms to find an optimal ensemble classifier . The scores considered in these algorithms rely on labeled training data 12 , which also holds samples based on leaky states of the WDS. For simplicity, we omit the dependence of all considered functions on the training data .
Fair leakage detection framework
In general, a learning problem such as the training of an optimal ensemble classifier can be phrased as an OP, where the objective is to minimize some suitable loss function over the hypothesis space , or more precisely, with respect to the parameter , based on its evaluations on the training data :
(4.7) The advantage of redefining the choice of hyperparameters (which is what we do in the H-method) as an OP is that we can now extend this OP by side constraints :
(4.8) A typical way of optimizing a constrained OP is to integrate the side constraints in the objective in order to apply unconstrained optimization algorithms. This can be done using a barrier- or penalty function (cf. Nocedal & Wright, 2006). Using such functions, the constrained OP Eq. (4.8)13 can be transformed to
(4.9) Hereby, the hyperparameter regulates the importance of the constraints for all compared to the loss function L.
Fair leakage detection instantiations
Equation (4.9) gives a general framework on how to train a (fair) leakage detection model based on the general leakage detection pipeline presented in subsection “Leakage detection pipeline”. While the H-method presented in subsection “Leakage detection instantiations” is an instantiation of this pipeline that only requires the training of the virtual sensors, i.e., the first component of the pipeline, the following methods also require the training of the leakage detection model itself, i.e., the third component of the pipeline (cf. Fig. 1).
More precisely, the following methods are instantiations of this third component using the framework proposed in the previous subsection “Fair leakage detection framework”. Hereby, the ensemble classifier on which the loss function L and the side constraints for rely is of the same structure as in the H-method (cf. Eq. (3.4)); the resulting optimal models only differ in their optimal parameter .
We obtain such different optimal parameters by choosing different loss functions L, different side constraints for and different algorithmic choices. In the following, we propose such possible choices. The indices (loss index, constraint index, optimization index and barrier or penalty function index) introduced along the way will later be used for the names of the resulting explicit methods as combinations of such choices. A general scheme of this overall idea as an extension of Fig. 1 is displayed in Fig. 4.
Figure 4: Fair leakage detection framework as an extension of Fig. 1.
Optimizing performance as baseline methods By choosing a typical evaluation score as the loss function L and not using any further (fairness) constraints (i.e., or ), we obtain further baseline methods which output optimized parameters compared to the H-method and by that, with respect to the performance of the leakage detection model , but not with respect to its fairness.
Typical such evaluation scores for a binary classification task are:
- The negative accuracy, i.e. (loss index “ACC”), 
- the negative difference between the TPR and the FPR (loss index “TFPR”). 
Optimizing Performance under Fairness Constraints For the following approaches, the loss function L controls the performance while the constraints control the fairness for all .
Choice of performance loss functions: When optimizing the performance under fairness constraints, we choose the same loss functions as when optimizing the performance without fairness constraints as introduced in the previous paragraph.
Choice of fairness constraints: As done in our previous work (cf. Strotherm & Hammer, 2023), in terms of fairness constraints, we make use of the covariance between the sensitive feature(s) and the prediction of the ensemble classifier. For technical reasons14 , we have to transform the non-binary sensitive feature to K binary sensitive features , which gives answer to the question of whether ( ) or not ( ) a leakage is active in group for all .15 Using that holds for all realisations , for all binary sensitive features for , the empirical covariance between a single sensitive feature and the model is given by
(4.10) The usage of the (empirical) covariance as a proxy for fairness is based on the idea that group fairness of a model , or more precisely, a high disparate impact score on which we focus in this work, relies on the assumption of being independent of the sensitive feature S (cf. subsection “Independence”), or in our case, each of the sensitive features for . As the independence of two random variables implies their covariance being equal to zero, the latter can be interpreted as a necessary condition for fairness. For more information on this intuition, but also on how our contributions are generalizations of the work of Zafar et al. (2017b), we refer to our previous work Strotherm & Hammer (2023).
Motivated by that, we require to hold, or, equivalently formulated in standard form:
- We require to hold for all (i.e., in Eq. (4.8)). Hereby, the hyperparameter regulates how much the covariance’s absolute value is bounded and therefore, the desired fairness (constraint index “COV”). 
Optimizing Fairness under Performance Constraints For the following approaches, the loss function L controls the fairness while the constraints control the performance for all .
Choice of fairness loss functions: As done in our previous work Strotherm & Hammer (2023), we choose the disparate impact score as a loss function. Moreover, as elaborated in the conclusion of our previous work Strotherm & Hammer (2023) and similar to Zafar et al. (2017b), we additionally change the role of the empirical covariance by optimizing a fairness proxy similar to the one introduced in Eq. (4.10) directly. Therefore, taking into account that we have multiple sensitive values, two reasonable loss functions are:
- The sum of absolute values of the empirical covariance between a single sensitive feature and the model for all (loss index “Cov”), 
- the negative disparate impact score (definition 2.7) (loss index “DI”). 
Choice of performance constraints: In terms of performance constraints, we stick to the choice of the accuracy , which is only allowed to differ by some percentage of the optimal accuracy obtained when training without fairness constraints (cf. Strotherm & Hammer, 2023; Zafar et al., 2017b). More precisely, we require or, equivalently formulated in standard form:
- We require to hold (i.e., in Eq. (4.8)). Hereby, the hyperparameter regulates how much the accuracy is allowed to differ from the optimal accuracy received, e.g., by another baseline method (constraint index “ACC”). 
By that, it indirectly regulates the fairness as well, as the more the accuracy is allowed to differ from the optimal accuracy, the larger the feasible subspace of gets and by that, the more the fairness as the loss in the objective can be optimized.
Algorithmic choices Next to the choices of loss function and constraints, the final methods also differ in dependence of what algorithmic choices are made, e.g., what optimization algorithm as well as what barrier or penalty function is used (cf. Eq. (4.9)).
One question to answer when choosing an optimization algorithm is whether the considered objective of the OP is (continuously) differentiable. In the setting of ML, the objective clearly depends on the model’s prediction or more precisely, on for all . However, in view of the ensemble classifier’s definition (cf. Eq. (3.4)), is not differentiable with respect to .
Therefore, in dependence on the fact whether we chose a differentiable (db) or non-differentiable (ndb) optimization algorithm, we need to approximate the model:
- If we want to use a gradient-based optimization technique, we make differentiable by approximating each indicator function by the sigmoid function with hyperparameter (optimization index “db”). Replacing the ensemble classifier’s prediction (cf. Eq. (3.4)) by for all yields a differentiable approximation of the model . Hereby, we replace the threshold 1 of the exact ensemble classifier with a hyperparameter to handle the insecurity of the sigmoid function around zero. 
- If we want to use a non gradient-based optimization technique, we do not make any changes (optimization index “ndb”). 
For more details on that, we refer to our previous work Strotherm & Hammer (2023). Depending on what optimization algorithm is used, different (differentiable or non-differentiable) barrier or penalty functions can be used. In this work, we make use of
- the barrier function (barrier function index “log”) and 
- the penalty function (penalty function index “max”). 
Explicit methods Finally, after having presented all possible choices, we obtain the following explicit methods using the following nomenclature:
loss index+[constraint index–optimization index–barrier/penalty function index]. Each resulting fairness-enhancing method comes with a corresponding baseline method to which it will be compared in the evaluation:
- the fairness-enhancing TFPR+COV-db-log-method with corresponding baseline TFPR-db-method, 
- the fairness-enhancing TFPR+COV-ndb-log- and TFPR+COV-ndb-max-method with corresponding baseline TFPR-ndb-method, 
- the fairness-enhancing ACC+COV-db-log-method with corresponding baseline ACC-db-method, 
- the fairness-enhancing ACC+COV-ndb-log- and ACC+COV-ndb-max-method with corresponding baseline ACC-ndb-method, 
- the fairness-enhancing COV+ACC-ndb-log- and COV+ACC-ndb-max-method also with corresponding baseline ACC-ndb-method and 
- the fairness-enhancing DI+ACC-ndb-log- and DI+ACC-ndb-max-method also with corresponding baseline ACC-ndb-method. 
The first two notes refer to the fairness-enhancing methods where performance is optimized under fairness constraints and the last four notes refer to the fairness-enhancing methods where fairness is optimized under performance constraints.
Experimental results and analysis
Based on the pressure measurements in the Hanoi WDS and the pressure residuals we obtain from these measurements by making use of the virtual sensors (cf. subsection “Application domain and data set”), we test all methods introduced in subsections “Leakage detection instantiations” (H-method) and “Fair leakage detection instantiations in practice”. Afterwards, we will test the best performing method on the data associated with the more complex and more realistic L-Town WDS.
Training and testing setup: To test the considered methods, a model is trained per method and per leakage diameter on training data (40% of the overall data) and evaluated on test data (60% of the overall data).16 For the training, the different OPs presented in subsection “Methodology of fairness-enhancing leakage detection” are solved using the BFGS algorithm (cf. Nocedal & Wright, 2006) in case of a differentiable OP and the Downhill-Simplex-Search algorithm, also known as the Nelder-Mead algorithm, (cf. Gao & Han, 2012) in case of a non-differentiable OP in order to find the optimal parameter of the leakage detection model .
The implementation of all methods and all our results can be found on GitHub (https://github.com/jstrotherm/FairnessInWDSs_extended).
Hanoi
Initial parameters Optimization algorithms require an initial start parameter. For the experiments on the Hanoi WDS, we use the hyperparameter found for the H-method (cf. subsection “Methodology of leakage detection”) as an initial parameter for the remaining optimization-based methods (cf. subsection “Methodology of fairness-enhancing leakage detection”).
Hyperparameters While the parameters are now outputs of these optimiza-tion-based methods, these are subordinate to other hyperparameters. In Table 4, an overview of these hyperparameters are displayed per method (and if required, per diameter ). We choose suitable hyperparameters and and keep them fixed afterwards. In contrast, the fairness-hyperparameters or , i.e., the hyperparameters that regulate the fairness directly or indirectly, respectively, are changed to obtain different score combinations of performance, measured by the accuracy score ACC, and fairness, measured by the disparate impact or equal opportunity score DI or EO, respectively. We do so by starting with a hyperparameter or that causes perfect fairness, i.e., a disparate impact score of 1.0, whenever possible and in- or decrease the hyperparameter by 0.01 until the disparate impact score of the resulting fairness-enhanced model achieves an equal or worse disparate impact score than its corresponding baseline method, respectively (cf. paragraph “Explicit methods” in subsection “Fair leakage detection instantiations” or Table 4 for the corresponding baseline method per fairness-enhancing method).
| Method | ( 5, 10, 15) | T | |||
|---|---|---|---|---|---|
| TFPR-db (b) | – | – | – | 100 | 0.8 | 
| TFPR+COV-db-log | – | 0.10 0.20 0.20 | 100 | 0.8 | |
| TFPR-ndb (b) | – | – | – | – | – | 
| TFPR+COV-ndb-log | – | 0.20 0.25 0.25 | – | – | |
| TFPR+COV-ndb-max | – | 100 | – | – | |
| ACC-db (b) | – | – | – | 100 | 0.8 | 
| ACC+COV-db-log | – | 0.15 0.05 0.05 | 100 | 0.8 | |
| ACC-ndb (b) | – | – | – | – | – | 
| ACC+COV-ndb-log | – | 0.2 0.3 0.05 | – | – | |
| ACC+COV-ndb-max | – | 100 | – | – | |
| COV+ACC-ndb-log | – | 0.01 0.01 0.01 | – | – | |
| COV+ACC-ndb-max | – | 100 | – | – | |
| DI+ACC-ndb-log | – | 0.05 0.025 0.04 | – | – | |
| DI+ACC-ndb-max | – | 100 | – | – | 
Results With these settings in mind, we obtain the following results. As we in total test five baseline methods (the H-method and the ones proposed in subsection “Fair leakage detection instantiations”) and 10 fairness-enhancing methods (cf. subsection “Fair leakage detection instantiations”), and by that, many methods, we only present the key findings in this section and further detailed findings regarding the comparison of all these methods in Appendix B.
Moreover, for a better overview of the results, we divide the ten fairness-enhancing methods into four subcategories: The TFPR-methods including all methods with loss index “TFPR”, and analogously the ACC-methods, the COV-methods and the DI-methods.
In some of the results, these methods are represented together with their corresponding baseline methods. Note that two methods from the same subcategory can have different baseline methods as corresponding baseline methods (cf. paragraph “Explicit methods” in subsection “Fair leakage detection instantiations” or Table 4).
Increasing fairness: In Fig. 5, we see the performance and fairness of some exemplary trained ensemble classifiers measured in accuracy and disparate impact score, respectively. For the fairness-enhancing methods, testing different hyperparameters or cause error bars for these methods. The height of the bars with error bars corresponds to the mean accuracy and disparate impact score achieved by each method over all hyperparameter values tested. The error bars themselves reach from the lowest to the largest score of the two scores considered.
Figure 5: Accuracy and disparate impact score per method and leakage diameter in the Hanoi-WDS as well as for different hyperparameters or .
We see that the fairness-enhancing methods on average increase fairness while on average decrease accuracy compared to their corresponding baseline methods. However, the average increase in fairness is larger than the average decrease in accuracy. For details regarding different diameters , the score ranges and the other methods, we refer to Appendix B. Based on these, one can say that fairness and overall performance are mutually dependent to about the same extent.
In addition to that, Fig. 6 shows the performance and indirectly, also the fairness of some exemplary trained ensemble classifiers measured by the TPR per group. The height of the bars and the range of the error bars behave analogously to Fig. 5.
Figure 6: TPR per method, group and leakage diameter in the Hanoi-WDS as well as for different hyperparameters or .
In view of the definition of the equal opportunity score (cf. definition 2.15) and due to the fact that this score is equivalent to the disparat impact score in our domain of application (cf. lemma 3.1 and corollary 3.2), in our context, the more similar the TPRs per group are, the fairer a method is. This is what we observe in Fig. 6 (and Fig. B.2) when comparing the TPRs among groups for the fairness-enhancing methods to the TPRs among groups for their corresponding baseline methods. Even more, Fig. 6 (and Fig. B.2) show(s) that the increase in fairness that we observe in Fig. 5 (and Fig. B.1) on average is not only obtained by decreasing the performance of the (in the corresponding baseline method) best-performing group but also by increasing the performance of the (in the corresponding baseline method) worst-performing group. By some methods, even all TPRs per group are increased on average.
The coherence of fairness and overall performance, and non-optimality: While Figs. 5 and 6 only hint at the relationship between fairness and overall performance, measured in disparate impact and accuracy score, respectively, a more detailed visualization of how fairness is related to the overall performance of a model can be found in Fig. 7. For each tested hyperparameter or , respectively, depending on what fairness-enhancing method was used, the obtained score combinations, i.e., the accuracy and the disparate impact score, are visualized for some exemplary trained ensemble classifiers.
Figure 7: Coherence of accuracy and disparate impact score for the different fairness-enhancing methods and different leakage sizes in the Hanoi-WDS, based on different hyperparameters or .
The cross data points visualize the accuracy and disparate impact score of the corresponding baselines methods (cf. paragraph “Explicit methods” in subsection “Fair leakage detection instantiations” or Table 4).The characteristic curve that can be observed in most of all sub-images is called the pareto-front, visualizing that the increase in fairness is accompanied by the reduction in accuracy score and vice versa. Note that the non-optimal solutions apart from the pareto-front in Fig. 7 and also later on, the local jumps recognized in Fig. 8, can be explained by the non-convexity of the objective functions. Because of that, the found solutions strongly depend on the initialized parameter and might not correspond to the global optimum.
Figure 8: Coherence of accuracy, disparate impact, equal opportunity and the training hyperparameter for different fairness-enhancing methods and different leakage sizes in the Hanoi-WDS.
Nevertheless, by most fairness-enhancing methods, a desired disparate impact score of about 0.8 can be achieved by a decrease of accuracy by approximately 0.03–0.06 points below the optimal accuracy obtained by the corresponding baseline methods (depending on the specific method used). Hereby, both fairness and overall performance can be influenced by the fairness-hyperparameters or , respectively. Deciding which choice of fairness-hyperparameter is optimal and by that, deciding on the trade-off between fairness and overall performance, is a difficult task that depends on the extent of the decisions of the underlying model as well as legal requirements. Regarding legal requirements, by not using the sensitive features for the decision making of the algorithms, the methods presented can satisfy the legal definition of disparate treatment and disparate impact (depending on the hyperparameter chosen) simultaneously.
Another observation is that the largest accuracies of the fairness-enhancing methods are usually approximately as good as the accuracy of their corresponding baseline methods while achieving equal or better fairness results. In the opposite direction, perfect fairness of can be achieved at a cost of the worst possible accuracy of . Depending on the method, the jump in disparate impact and accuracy score is rather abrupt or more fine-grained when reaching this extreme of : Especially the COV- and the DI-methods relying on the optimization of fairness while constraining the accuracy using the hyperparameter allow the latter, because the accuracy constraint is less sensitive than the covariance constraints, controlled by the hyperparameter .17
However, also some of the TFPR- and the ACC-methods relying on the optimization of performance while constraining the fairness using the hyperparameter allow fine-grained variations in both scores. This motivates us to investigate the different methods also within the chosen subcategories. We do so in Appendix B.
Here, we find that the DI+ACC-ndb-max-method provides the best results while also providing the benefit of only requiring a few hyperparameters which are easy to choose. This finding makes the DI+ACC-ndb-max-method the best candidate to be tested on a more complicated and by that, more realistic, WDS, as we will do in subsection “L-town”. However, before we do so, we investigate more the relation between the performance and fairness scores and the fairness-hyperparameters and .
The influence of the fairness-hyperparameters on fairness and overall performance: In Fig. 8, for the best-performing method of the TFPR-methods and the DI-methods—the results for the ACC-methods look similar to the ones of the TFPR-methods and the results for the COV-methods look similar to the ones of the DI-methods –, we show how the hyperparameters are related to disparate impact and accuracy, but this time, also equal opportunity score. Each of the three scores is plotted against the used hyperparameter of the corresponding fairness-enhancing method tested.
For the TFPR+COV-ndb-log-method (and the ACC+COV-ndb-log-method), the decrease of the hyperparameter is accompanied by the improvement of the fairness measures as well as the decrease of the performance measure. This can be explained as follows: A high empirical covariance between a sensitive feature and the prediction of the ensemble classifier means that the relative number of positive predictions within the related group differs significantly from the relative number of positive predictions within a group with small covariance. Thus, the more the covariance is constrained by the hyperparameter , the less such extreme differences in the relative number of positive predictions across groups occur, leading to a better fairness score. In the case of disparate impact, therefore, a (better) higher score at the expense of a (worse) lower overall performance–compared to the overall performance that occurs in the unconstrained case or for a looser constraint, that is a larger bound by ,–appears. In the case of equal opportunity, however, a (better) lower score at the expense of a (worse) lower overall performance appears.
In contrast, for the DI+ACC-ndb-max-method (and the COV+ACC-ndb-log-method), the increase of the hyperparameter is accompanied by the improvement of the fairness measures as well as the decrease of the performance measure due to the fact that a higher hyperparameter allows a larger deviation of the optimal accuracy score. Thus, the feasible search space is extended and a worse accuracy is penalized less or not at all, so that the fairness score in the objective can be optimized to a larger extent.
Equivalence of disparate impact and equal opportunity: Moreover, especially to mention is the observation of our theoretical results from lemma 3.1 and corollary 3.2 in practice: For the coherence of equal opportunity score and the hyperparameters, the results in Fig. 8 equal the ones for disparate impact score in the same figure, but reflected along the horizontal axis through the point (0, 0.5). This proves the equivalence of both fairness measures as theoretically proven in lemma 3.1 and corollary 3.2. Nevertheless, note that this is an application specific result and does not hold in general.
Finally, as another new contribution compared to our previous work Strotherm & Hammer (2023), we will test the best-performing DI+ACC-ndb-max-method on a more complex and by that realistic WDS, L-Town, using the more powerful GCN-virtual sensors incorporated into the leakage detection method.
L-Town
Initial parameters While the dimension of the search space is equal to (with the number of sensors) in Hanoi, it extends to in L-Town (cf. subsection “Application domain and data set”). By that, chances are high that the graph of the objective function that needs to be optimized in each of the presented optimization-based methods (cf. subsection “Methodology of fairness-enhancing leakage detection”) gets more complex and exhibits more saddle points and local minima. This intuition turns out to be true in practice, where the choice of the initial start parameters are crucial to the success of the methods tested. Therefore, for the experiments on the L-Town WDS, we use the hyperparameter found for the H-method (cf. subsection “Methodology of leakage detection”) only as an initial parameter for the ACC-ndb-method, which is the corresponding baseline method for the DI+ACC-ndb-max-method (paragraph “Explicit methods” in subsection “Fair leakage detection instantiations” or Table 4) that turned out to work best in the previous subsection “Hanoi”. Using the same initial parameter for the DI+ACC-ndb-max-method itself did not provide optimal results (–the pareto-front obtained here did not end up in the score combination of the corresponding baseline method). Therefore, consecutively, we use the hyperparameter found by the ACC-ndb-method as an initial parameter for the DI+ACC-ndb-max-method.
Hyperparameters In view of Table 4, the ACC-ndb-method does not require the choice of any hyperparameters. For the DI+ACC-ndb-max-method, we vary the fairness-hyperparameter and also choose as discussed in subsection “Hanoi”.
Results Similar to Fig. 7 for Hanoi, Fig. 9 shows the relation between the fairness and the overall performance of the trained model applied to L-Town.
Figure 9: Coherence of accuracy and disparate impact score for the DI+ACC-ndb-max-method and different leakage sizes in the L-Town-WDS.
The cross data points visualize the accuracy and disparate impact score of the corresponding baseline, the ACC-method.The observations for L-Town are similarly well compared to those on Hanoi. Although while at first, it seems that there are less score combinations apart from, or more precisely, below, the pareto-front compared to the results of the same method applied to Hanoi, some score combinations above the seemingly optimal pareto-front may give rise to the existence of an even better pareto-front, which is not observed completely due to non-convexity of the OP.
Nevertheless, a desired disparate impact score of about 0.8 can be achieved by a decrease of accuracy by approximately 0.1 points for and 0.01 points for below the optimal accuracy obtained. For , the leakages are already almost detected perfectly and fair by the corresponding baseline ACC-ndb-method. Anyways, the fairness-enhancing DI-ACC-ndb-max-method is better by approximately 0.015 points in disparate impact score with barely no loss in accuracy.
Finally, similar to Fig. 8 for Hanoi, Fig. 10 shows how the hyperparameters are related to accuracy, disparate impact and equal opportunity score in the setting of L-Town. The results go hand in hand with the observations found for Hanoi, and also the equivalence between the two fairness scores can be observed again.
Figure 10: Coherence of accuracy, disparate impact, equal opportunity and the training hyperparameter for the DI+ACC-ndb-max-method and different leakage sizes in the L-Town-WDS.
Additionally, we see by the position of the accuracy curves and the slope of the fairness curves that on the one hand, the better the model performs in general, measured by the accuracy score, the fairer the model is initially, and on the other hand, the harder it is to make the model even fairer.
Conclusion
In this work, we introduced the notion of group fairness in an application domain of high social and ethical relevance, namely in the field of water distribution systems (WDSs). This required the generalization of common group fairness definitions to a single or possibly multiple non-binary sensitive feature(s). To do so, we gave a detailed introduction on the concept of group fairness based on the mathematical concept of independence, derived these generalized definitions from this concept and proved that they coincide with common group fairness definitions in the case of a binary sensitive feature and a binary classification task. We then investigated on the fairness issue in the area of leakage detection within WDSs. We showed that standard approaches are not fair in the context of different groups related to the locality within the network. As a remedy, we presented multiple methods that increase fairness of the leakage detection model with respect to the introduced fairness notion while satisfying the legal notions of disparate treatment and disparate impact simultaneously. We tested these methods not only on the Hanoi WDS, but also on the more complex and by that more realistic L-Town WDS. We empirically demonstrated that fairness and overall performance of the model are interdependent and the use of hyperparameters provides the ability to trade off fairness and overall performance. However, this trade-off lies in the responsibility of the policy maker.
From a practical perspective, this trade-off can be achieved by testing different hyperparameters during training, which requires multiple runs of training. Hereby, one limitation of the proposed methods is their non-convexity and scalability to larger networks, which affects the training time. Future work could investigate this issue. Moreover, the fact that increasing the fairness of a model comes with a loss in accuracy leads to the question of whether this loss can be granted. While in leakage detection, in practice, detecting as many leakages as possible without observing false positives is a priority, there are further applications in the domain of WDS even more relevant for fairness. So far, tackling these use-cases has failed due to the lack of necessary data, which remains for future work. To conclude, the notion of fairness within the water domain is still at its beginning and further work on other cases of application within this domain is crucial.
 
                








