homeresources & publications › modeling information scent: a comparison of lsa, pmi and glsa similarity measures on common tests and corpora

TECHNICAL PUBLICATIONS:

Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora

 

In this paper we describe a comparison among three systems that estimate semantic similarity between words: Latent Semantic Analysis (Landauer & Dumais, 1997), Pointwise Mutual Information (Turney, 2001), and Generalized Latent Semantic Analysis (Matveeva, Levow, Farahat, & Royer, 2005). We compare all these techniques on a unique corpus (TASA) and, for PMI and GLSA, we also report performance on a larger web-based corpus. The evaluation is carried out through two kinds of tests: (1) synonymy tests, and (2) comparison with human word similarity judgments. The results indicate that for large corpora PMI works best on word similarity tests, and GLSA on synonymy tests. For the smaller TASA corpus, GLSA produced the best performance on most tests. A large corpus improved the performance of PMI, but, in most cases, did not improve that of GLSA.

 
citation

Budiu, R. ; Royer, C. ; Pirolli, P. L. Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. 8th RIAO Conference; 2007 May 30 - June 1; Pittsburgh; PA.