Evaluating named entity recognition tools for extracting social networks from novels

View article
PeerJ Computer Science
We follow (Sainte-Beuve, 1910) here in defining a classic novel not as one written by the ancient Greeks or Romans (‘the classics’) but to canonical works.
A gazetteer is a list of names
MetaCPAN is a search engine for Perl code and documentation: https://metacpan.org/source/BRIANL/Lingua-EN-Nickname-1.14/nicknames.txt (last retrieved: 30 October 2017).

Main article text

 

Introduction

  • To what extent are off-the-shelf NER tools suitable for identifying fictional characters in novels?

  • Which differences or similarities can be discovered between social networks extracted for different novels?

Materials and Data Preparation

Corpus selection

Data preprocessing

Annotation

Annotation data

Annotation instructions

Named Entity Recognition Experiments and Results

Network Analysis

Network construction

Network features

  1. Average degree is the mean degree of all the nodes in the network. The degree of a node is defined as the number of other nodes the node is connected to. If the degree of a node is zero, the node is connected to no other nodes. The degree of a node in a social network is thus is measure of its social ‘activity’ (Wasserman & Faust, 1994). A high value—for example, in Ulysses—indicates that the characters interact with many different other characters. Contrarily, a low value—for example, in 1984—indicates that the characters only interact with a small number of other characters.

  2. Average Weighted Degree is fairly similar to the average degree, but especially in the sense of social networks, a distinction must be made. It differs in the sense that the weighted degree takes into account the weight of each of the connecting edges. Whereas a character in our social network could have a high degree—indicating a high level of social activity—if the weights of all those connected edges are relatively small, this suggests only superficial contact. Conversely, while the degree of a character could be low—for example, the character is only connected to two other characters—the two edges could have very large weights, indicating a deep social connection between the characters. Newman (2006) underlines the importance of this distinction in his work on scientific collaborations. To continue the examples of Ulysses and 1984; while their average degrees are vastly different (with Ulysses being the highest of its class and 1984 the lowest), their average weighted degrees are comparable.

  3. Average Path Length is the mean of all the possible shortest paths between each node in the network; also known as the geodesic distance. If there is no path connecting two nodes, this distance is infinite and the two nodes are part of different graph components (see item 7, Connected Components). The shortest path between two nodes can be found by using Dijkstra’s algorithm (Dijkstra, 1959). The path length is typically an indication of how efficiently information is relayed through the network. A network with a low path length would indicate that the people in the network can reach each other through a relatively small number of steps.

  4. Network Diameter is the longest possible distance between two nodes in the network. It is in essence the longest, shortest path that can be found between any two nodes in the network, and is indicative of the linear size of the network (Wasserman & Faust, 1994).

  5. Graph density is the fraction of edges compared to the total number of possible edges. It thus indicates how complete the network is, where completeness would constitute all nodes being directly connected by an edge. This is often used in social network analysis to represent how closely the participants of the network are connected (Scott, 2012).

  6. Modularity is used to represent community structure. The modularity of a network is ‘...the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random’ (Newman, 2006). Newman shows modularity can be used as an optimisation metric to approximate the number of community structures found in the network. To identify the community structures, we used the Louvain algorithm (Blondel et al., 2008). The identification of community structures in graph is useful, because the nodes in the same community are more likely to have other properties in common (Danon et al., 2005). It would therefore be interesting to see if differences can be observed between the prevalence of communities between the classic and modern novels.

  7. Connected components are the number of distinct graph compartments. That is, a graph component is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the supergraph. In other words, it is not possible to traverse from one component to another. In most social communities, one ‘giant component’ can typically be identified, which contains the majority of all vertices (Kumar, Novak & Tomkins, 2010). A higher number of connected components would indicate a higher number of isolated communities. This is different from modularity in the sense that components are more strict. If only a single edge goes out from a subgraph to the supergraph, it is no longer considered a separate component. Modularity attempts to identify those communities that are basically ‘almost’ separate components.

  8. Average clustering coefficient is the mean of all clustering coefficients. The clustering coefficient of a node can perhaps best be described as ‘all-my-neighbours-know-each-other’. Social networks with a high clustering coefficient (and low average path length) may exhibit small world (https://en.wikipedia.org/wiki/Smallworld_experiment) properties (Watts & Strogatz, 1998). The small world phenomenon was originally described by Stanley Milgram in his perennial work on social networks (Travers & Milgram, 1967).

Results of network analysis

Network exploration

Discussion and Performance Boosting Options

The Black Company

The Three Musketeers

Conclusion and Future Work

  • To what extent are off-the-shelf NER tools suitable for identifying fictional characters in novels?

  • Which differences or similarities can be discovered between social networks extracted for different novels?

Appendix: Additional Statistics

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Niels Dekker conceived and designed the experiments, performed the experiments, analysed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft.

Tobias Kuhn contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.

Marieke van Erp conceived and designed the experiments, contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.

Data Availability

The following information was supplied regarding data availability:

Code and data are available at GitHub: https://github.com/Niels-Dekker/Out-with-the-Old-and-in-with-the-Novel.

Funding

The authors received no funding for this work.

23 Citations 8,807 Views 1,490 Downloads