Geographic Feature Type Topic Model (GFTTM): grounding topics in the landscape

Computer Science, Centre for eResearch, The University of Auckland, Auckland, New Zealand
DOI
10.7287/peerj.preprints.816v1
Subject Areas
Data Mining and Machine Learning, Data Science, Natural Language and Speech, Spatial and Geographic Information Systems
Keywords
Text mining, Topic modeling, Volunteered geographic information, Bayesian inference
Copyright
© 2015 Adams
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Adams B. 2015. Geographic Feature Type Topic Model (GFTTM): grounding topics in the landscape. PeerJ PrePrints 3:e816v1

Abstract

Probabilistic topic models are a class of unsupervised machine learning models used for understanding the latent topics in a corpus of documents. A new method for combining geographic feature data with text from geo-referenced documents to create topic models that are grounded in the physical environment is proposed. The Geographic Feature Type Topic Model (GFTTM) models each document in a corpus as a mixture of feature type topics and abstract topics. Feature type topics are conditioned on additional observation data of the relative densities of geographic feature types co-located with the document's location referent, whereas abstract topics are trained independently of that information. The GFTTM is evaluated using geo-referenced Wikipedia articles and feature type data from volunteered geographic information sources. A technique for the measurement of semantic similarity of feature types and places based on the mixtures of topics associated with the types is also presented. The results of the evaluation demonstrate that GFTTM finds two distinct types of topics that can be used to disentangle how places are described in terms of its physical features and more abstract topics such as history and culture.

Author Comment

This preprint will be a submission to PeerJ CS for review.

Supplemental Information

gfttm plate notation

Geographic feature type topic model plate notation

DOI: 10.7287/peerj.preprints.816v1/supp-1

LDA plate notation

LDA plate notation

DOI: 10.7287/peerj.preprints.816v1/supp-2

Screenshot of geonames features and related Wikipedia articles

Sample of geonames.org features and georeferenced Wikipedia articles in the vicinity of Yosemite valley. The `W' icons refer to Wikipedia articles and the numbered icons refer to geonames.org features. The colors correspond with the 9 broad feature type categories defined by geonames.org.

DOI: 10.7287/peerj.preprints.816v1/supp-3