Читать книгу Semantic Web for Effective Healthcare Systems - Группа авторов - Страница 21
1.4.2 Latent Semantic Indexing (LSI)
ОглавлениеThe Term Document matrix is very sparse in nature, if it is built for all words in the document collection. The terms may be unique across the documents and may not be repeated in all documents. It increases the size of matrix. The disadvantage of simple vector space model is that it cannot relate two synonymous words present in the document. In order to reduce the sparseness in the matrix and to address the synonymy issue, the vector space model can be extended and Latent Semantic Indexing (LSI) can be used for document indexing. Figure 1.4 shows the synonymy and polysemy representation of words in English language. LSI technique analyzes the text documents to determine the hidden meaning or concepts. For example, when the word “bank” comes along with other words like mortgage and loan, then it can be concluded that it is associated with a financial sector. If the word “bank” comes along with other words like fish and pond, then it is associated with the water body. This problem is solved by LSI technique by merely not comparing the words in the document space but does comparison of both words and documents in the concept space.
Figure 1.4 Synonymy and polysemy issues in English.
LSI uses Singular Value Decomposition (SVD) to reduce the dimensions of TD matrix. It reconstructs the matrix with the least possible information. It is a matrix factorization technique that factors “m x n” matrix into three matrices USVT where matrix U represents the term matrix in concept space, matrix VT represents the document matrix in concept space, and the S matrix is of singular values by which the number of dimensions or concepts can be selected. The complexity in SVD lies in figuring out how many dimensions or concepts that do exist in the document collection while approximating the matrix. The original TD matrix can be approximated to “k” dimensions, where k is much smaller than the rank of TD matrix. The value of “k” can be determined empirically. Usually, its value ranges between 100 and 350 for the large data collection. Figure 1.5 shows the schematic representation of truncated matrix on TD matrix.
Figure 1.5 Approximated TD matrix by SVD.
LSI indexes words using low dimensional representation and word co-occurrence. The association of terms with documents, i.e., the semantic structure improves the relevancy of results for queries [56]. Value of “k” in low hundreds improves precision and recall value. LSI has its own disadvantages like more computation time and negative values in the approximated TD matrix.