Posted on December 15, 2013 by Doori Lee No Comments
In machine learning, a topic model is a type of statistical model for discovering the “topics” that occur in a corpus composed of documents. The Latent Dirichlet Allocation (LDA) model is one of the most commonly used topic models that represents the corpus as a network of topics.
I have been using the LDA model to see how specific philosophical topics relate to each other in a selection of 1315 volumes in Hathi Trust library. The LDA assumes that a given textual corpus has K number of topics and each document in the corpus is a mixture of topics. A “topic” is defined as a probability distribution over words and often represented as a list of most probable words in the topic. The number of topics is selected by the user when the model is trained. Thus the LDA model can be trained over the same textual corpus with different number of topics.
The number of topics is important in topic modeling as it determines the extent of a topic. Previous research states there is a natural number of topics for a given corpus “On Finding the Natural Number of Topics with Latent Dirichlet Allocation : Some Observations” R. Arun et al. 2010. However depending on the task, a small or large number of topics, in other words, broad or more specific topics, may be suitable.
By visualizing topic networks, we investigate the connections between the LDA models trained over the same corpus with different number of topics.
In this experiment, we compare the LDA models with different number of topics trained over the same corpus to investigate the relationships between models. We train the LDA model with different number of topics (K=20,40,160) and find similar topics between models using similarity functions from Indiana Philosophy Ontology project vector space model toolkit. For example, for every topic in 20, 40-topic model we find similar topics in 160 topics. The pair of models (e.g. 20 and 160-topic LDA models) is combined in a graph using Gephi. Below, the graphs show the network of topics by color-coded clusters based on modularity.
The graphs show topics from the K=20, 40 models as T# and topics from K=160 model as plain numbers. The graph distinguishes modules (clusters) with different colors and a module contains similar topics measured by how much internal structure there is within the module.
In Graph 1, each topic in the K=20 LDA model is mapped to 8 similar topics in the K=160 model. The 20 topics are grouped into 9 clusters. In Graph 2, each topic in the K=40 LDA model is mapped to 4 similar topics in K=160 model and the 40 topics are grouped into 15 clusters.
The tables below show a sample of topic clusters from each network graphs. The first two rows are topics from the K=20, 40 models (labeled T#) and the following rows are from the K=160 model. In these tables, a topic is labeled with an arbitrary number that is assigned to identify the topic and represented by 5 words that most commonly occur in the topic. The blue bold topics are the topics from the K=160 model that are similar to all topics in the topic cluster.
In Table 1, the related topics are regarding church, gods, and people. In Table 2, Topics 17 and 35 from the K=40 model shares 3 common topics from the K=160 model which relates to ‘social’, ‘individual’, ‘life’.
Through visualizing topic networks, we observe that various numbers of topics can be grouped into clusters by modularity or semantic similarity. Further research could compare clustering algorithms and the LDA models. For example, comparing the semantic closeness of N topic clusters from (K > N) LDA model with topics in (K = N) LDA model could help us obtain high quality topics.