Similarity to a single set
- Published
- Accepted
- Subject Areas
- Data Mining and Machine Learning, Theory and Formal Methods
- Keywords
- binary similarity measure, set similarity, classification, clustering, diagnostic test, data mining
- Copyright
- © 2016 Naish
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
- Cite this article
- 2016. Similarity to a single set. PeerJ PrePrints 4:e1713v1 https://doi.org/10.7287/peerj.preprints.1713v1
Abstract
Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets - an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered - the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.
Author Comment
This is the first complete version of this paper.