کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
558481 | 874939 | 2011 | 20 صفحه PDF | دانلود رایگان |
A novel updating method for Probabilistic Latent Semantic Analysis (PLSA), called Recursive PLSA (RPLSA), is proposed. The updating of conditional probabilities is derived from first principles for both the asymmetric and the symmetric PLSA formulations. The performance of RPLSA for both formulations is compared to that of the PLSA folding-in, the PLSA rerun from the breakpoint, and well-known LSA updating methods, such as the singular value decomposition (SVD) folding-in and the SVD-updating. The experimental results demonstrate that the RPLSA outperforms the other updating methods under study with respect to the maximization of the average log-likelihood and the minimization of the average absolute error between the probabilities estimated by the updating methods and those derived by applying the non-adaptive PLSA from scratch. A comparison in terms of CPU run time is conducted as well. Finally, in document clustering using the Adjusted Rand index, it is demonstrated that the clusters generated by the RPLSA are: (a) similar to those generated by the PLSA applied from scratch; (b) closer to the ground truth than those created by the other PLSA or LSA updating methods.
Research highlights
► A novel updating method for PLSA (RPLSA), when new documents are added incrementally to an initial document collection, is proposed.
► Both the asymmetric and the symmetric PLSA formulations are examined.
► Two initialization schemes for the probability distribution of the added documents are tested.
► RPLSA outperforms the established PLSA and LSA updating methods in terms of accuracy and in document clustering performance.
► RPLSA is more time consuming than the PLSA foldin-in but less than the PLSA rerun from the breakpoint.
Journal: Computer Speech & Language - Volume 25, Issue 4, October 2011, Pages 741–760