Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation without the selected completely at random assumption

View article
PeerJ Computer Science
A significant portion of the text and experimental results presented in this manuscript was previously published as part of a preprint (https://arxiv.org/abs/2303.08269; Kumar & Lambert, 2023).

Main article text

 

Introduction

  • 1)

    We propose PULSCAR, a PU learning algorithm for estimating α when the SCAR assumption holds. It uses kernel density estimates of the positive and unlabeled distributions of ML probabilities to estimate α. The algorithm employs the beta distribution to estimate density and introduces an objective function that provides a rapid, robust estimate of α.

  • 2)

    We propose PULSNAR, a PU learning algorithm for estimating α when the positives are SNAR. It employs a clustering approach to group SNAR positives into subtypes and estimates separate α for each subtype using PULSCAR on positives from each cluster and all unlabeled instances. The final α is calculated by aggregating the α estimated for each subtype.

  • 3)

    We propose methods to calibrate the probabilities of PU examples to their true (unknown) labels and improve the classification performance in SCAR and SNAR settings.

Problem formulation and algorithms

SCAR assumption and SNAR assumption

PU data assumptions

Positive and Unlabeled Learning Selected Completely At Random (PULSCAR) algorithm

Beta kernel density estimation

for x [0,1], where Γ is the gamma function, a=1+zbw and b=1+1zbw, with z the bin edge (a value in an array of evenly spaced n_bins numbers over the interval [0, 1]), and bw the bandwidth.

Histogram bin count

  • i)

    Square root method: Numberofbins=n

  • ii)

    Sturges’ rule: Numberofbins=1+log2(n)

  • iii)

    Rice’s rule: Numberofbins=2×n1/3

  • iv)

    Scott’s rule:

    h=3.5×StandardDeviation(pr)n1/3, Numberofbins=max(pr)min(pr)h

  • v)

    Freedman–Diaconis (FD) rule:

    h=2×InterQuartileRange(pr)n1/3, Numberofbins=max(pr)min(pr)h

Beta kernel bandwidth estimation

Positive and Unlabeled Learning Selected Not At Random (PULSNAR) algorithm

Clustering rationale

Determining the number of clusters in the positive set

Calculating calibrated probabilities

Improving classification performance

Experimental methods

Synthetic data

SCAR data

SNAR data

ML benchmark datasets

Estimation of fraction of positives among unlabeled examples

Using the PULSCAR algorithm

Using the PULSNAR algorithm

Results

Synthetic datasets

ML benchmark datasets

SCAR data

SNAR data

Probability calibration

Classification performance metrics

Execution time of PU methods

Discussion and conclusion

Appendix

Algorithm for calibrating probabilities

where pc is the probability of an unlabeled example from cluster c. Since we do not know the subclass of an unlabeled example, this Eq. (6) calculates one minus the probability that it is in none of the subclasses, and constrains the probability to be 1.

Experiments and results

Improving classification performance with pulscar and pulsnar

Experiments and results

Dedpul vs. pulsnar: alpha estimation

Dedpul vs. pulsnar: classification performance

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Praveen Kumar conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Christophe G. Lambert conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The full source code for our algorithms (and the code to generate all of the simulated datasets) is available at GitHub and Zenodo:

- https://github.com/unmtransinfo/PULSNAR.

- Praveen Kumar, & Christophe Lambert. (2024). unmtransinfo/PULSNAR: PULSNAR 0.0.1 (PULSCAR). Zenodo. https://doi.org/10.5281/zenodo.13126647.

The code that generates the SCAR simulated data is available at GitHub: https://github.com/unmtransinfo/PULSNAR/blob/main/examples/pulscar_syn_alpha_estimation_1.py.

This code generates the SNAR simulated data is available at GitHub: https://github.com/unmtransinfo/PULSNAR/blob/main/examples/pulscar_syn_alpha_estimation_2.py.

The third-party SCAR data and SNAR data are available at GitHub: https://github.com/unmtransinfo/PULSNAR/tree/main/examples/UCIdata

The third party datasets are available at:

- KDDCup 2004 repository: https://kdd.org/kdd-cup/view/kdd-cup-2004

- UCI Machine learning repository: https://archive.ics.uci.edu/datasets.

Funding

This research was supported by the National Institute of Mental Health of the National Institutes of Health under award numbers R01MH129764 and R56MH120826. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

2 Citations 576 Views 21 Downloads