Latent Dirichlet Allocation Model Using Prior Derived from Empirical Data
One of the problems confronting the full use of the power of the computers is that they understand very little of the meaning of human language. Significant progress is therefore being made to develop computational tools that will help organise data (text corpora) that will support computer users to quickly find relevant information from the sea of collective knowledge that are digitised and stored in online databases. Many semantic models, including latent Dirichlet allocation, have been proposed to help computers to deal with the potential vagueness that may arise due to variability in word usage. Latent Dirichlet allocation model reduces each document in a text collection to a mixture of topics that summarises the main themes in the collection. The Latent Dirichlet Allocation (LDA) model has been widely used to identify and extract hidden structures in data. Recent literature, however, reported that the model suffers from the restriction that the values of its controlling parameters, namely, prior distributions for the computation of the mixture components for theme extractions are not derived from data. Rather, pre-allocated, fixed priors are adopted and used irrespective of domain of application. The use of pre-allocated priors is based on the assumption that the computation of thematic structures is independent of the occurrence of words and documents in text collections. This assumption is, however, too strong and it has been observed that usage of pre-allocated priors which are often not consistent with the underlying data has led to some well-developed models failing to produce reasonable predictions in real application. In this study, empirical prior latent Dirichlet allocation (epLDA) model that uses latent semantic indexing framework to derive the priors required for topics computation from data is presented. The derived priors incorporate knowledge from the data into the LDA model. The parameters of the priors so obtained are related to the parameters of the conventional LDA model using exponential function. The model was implemented using C# programming language and tested on benchmarked data. It achieved higher prediction accuracy than the conventional latent Dirichlet allocation (LDA), supervised latent Dirichlet allocation (sLDA) and other existing models that have used the same data set for predictive tasks. It was observed that the epLDA model consistently outperforms the conventional LDA on different datasets; its performance falls within highly sure confidence level. The best known reported model in literature, Random Walk Heterogeneous Graph (RWHG), achieves a prediction accuracy of 90.36 percent while the proposed model achieves a prediction accuracy of 92.15 percent thereby providing higher prediction confidence. The model also achieves lower perplexity resulting in better generalisation performance than the conventional LDA model on the same dataset. The average generalisation performance of the model on test data is 65.46 while that of the conventional LDA on the same dataset is 72.94.