Читать книгу Semantic Web for Effective Healthcare Systems - Группа авторов - Страница 20
1.4.1 Vector Space Model
ОглавлениеIn the vector space model, the documents are represented as vectors through BagOfWords (BoW) model. It considers the set of words as “bag” but not the order of words present in the document. This may use Boolean model or vector space model to denote the weight of terms. Boolean model gives the weight as 1 or 0 based on the presence or absence of word in the document. Vector space model uses the term frequency as the weight of terms. Term weighting is the important factor in the document representation, which decides the efficiency of the IR system. It includes three components like Term Frequency (TF), Inverse Document Frequency (IDF) and the document length normalization. TF gives the distribution of each word in the document whereas IDF expresses the importance of each word in the document. Higher the number of occurrences of word yields lesser IDF value. Equation 1.1 determines the weight of word by using TF-IDF scheme.
where tfij is the term frequency of term “i” in document “j,” N is the total number of documents in the collection, dfi is the document frequency of term “i” in the collection, and wij is the weight for term “i” in document “j.” Generally, the Term Document (TD) matrix of size “m x n” is built between words and documents, where “m” represents the terms (rows) and “n” represents the documents (columns), “w” represents the weight of the term.