Читать книгу Semantic Web for Effective Healthcare Systems - Группа авторов - Страница 23

1.4.4 Topic Modeling

Оглавление

Vector Space Models (VSM) get the raw text data and indexes the term against the documents. LSI technique shows 30% improved accuracy compared with traditional VSMs [29]. Further, the previous research carried out in IR domain with LSI technique says that LSI improves recall but precision is not comparatively improved [22, 26]. This disadvantage can be overcome by processing the raw data before applying indexing technique. Hidden concepts in the document collection can be included while indexing the terms and documents which substantially improves the accuracy [58]. Latent Dirichlet Allocation (LDA) technique [58] uncovers latent “topics” in a document collection where topics are a kind of features. It is a language model for modeling the topics of documents in a probabilistic approach. Each document may contain a mixture of different topics. Each topic may contain many occurrences of words related to it in documents. Figure 1.6 shows the framework of LDA model for topic (or feature) categorization of text documents.


Figure 1.6 LDA framework.

For example, the Word 2 is categorized under two different topics say, “topic 1” and “topic 2.” The context of this word varies and it is determined by the co-occurrence of other words. So, the word 2 with the context “topic 1” is more relevant to “Doc 1” and the same word with the context “topic 2” is more relevant to “Doc 2.” Identifying latent concepts thus improves the accuracy of feature categorization.

LDA is a matrix factorization technique. It reduces TD matrix into two low dimensional matrices, M1 and M2. M1 is a document-topic matrix (NxK) and M2 is the topic-term matrix (KxM), where N is the number of documents, K is the number of topics and M is the number of terms. LDA uses sampling techniques to improve these matrices. The model assumes all word–topic mapping is correct except the current word. This technique iterates for each word “w” for each document “d” and adjusts the current topic assignment of “w” by multiplying two probabilities p1 and p2, where p1 is p(topict/documentd) and p2 is p(wordw/topict), the probability of assignment of topic “t” over all the documents for the word “w.” Steady state is reached after the number of iterations and distributions of words-topics and topics-documents may fairly be good at one instant.

For LDA model, the number of topics K has to be fixed in prior. It assumes the generative process for a document w = (w1, . . . ,wN) of a corpus D containing N words from a vocabulary consisting of V different terms, w ϵ {1, …, V} for all i = {1, … , N}. LDA consists of the following steps [12]

1 (1) For each topic k, draw a distribution over words Φ(k) ~ Dir(α).

2 (2) For each document d,(a) Draw a vector of topic proportions θ(d) ~ Dir(β).(b) For each word i,(i) Draw a topic assignment zd,i ~ Mult(θd), zd,n ϵ {1, …, K},(ii) Draw a word wd,i ~ Mult(Φz d,i), wd,i ϵ {1, …, V}

where α is a Dirichlet prior on the per-document topic distribution, and β is a Dirichlet prior on the per-topic word distribution. Let θtd be the probability of topic t for document d, zdi be the topic distribution, and let Φtw be the probability of word w in topic t. The probability of generating word w in document d is:

(1.2)

Equation 1.2 gives the weighted average of the per-topic word probabilities, where the weights are the per-document topic probabilities. The resulting distribution p(w|d) varies from document to document, as the topic weights change among documents. Corpus documents are fitted into LDA model by inferring a collection of hidden variables. These variables are denoted by θ = {θtd}, the |K| × |D| matrix of per-document topic weights, and Φ = {Φtw}, the |K| × |N| matrix of per-topic word weights. Inference for LDA is the problem of determining the joint posterior distribution of θ and Φ after observing a corpus of documents, which are influenced by LDA parameters.

Simple LDA model gives term-topic probabilities for all terms under each topic. According to literature [59, 60], only the top 5 or 10 terms under each topic were selected for modeling. However, the CFSLDA model (Contextual Feature Selection LDA) selects the set of probable terms from the data set which represent the topic or concept of a domain. It builds the contextual model using LDA and correlation technique for selecting the list of probable and correlated terms under each topic (or feature) for the data set. The lists of terms represent the topic or concept of a domain and establish the context between the terms. The plate notation of CFSLDA topic modeling is shown in Figure 1.7. The notations used in CFSLDA model are:

 D—number of documents

 N—number of words or terms

 K—number of topics

 α—a Dirichlet prior on the per-document topic distribution

 β—a Dirichlet prior on the per-topic word distribution

 θtd—probability of topic t for document d

 Φtw—probability of word w in topic t

 zd,i—topic assignment of term “i”

 wd,i—word assignment of term “i”

 C—correlation between the terms


Figure 1.7 Plate notation of CFSLDA model.

LDA associates documents with a set of topics where each topic is a set of words. Using the LDA model, the next word is generated by first selecting a random topic from the set of topics T, then choosing a random word from that topic's distribution over the vocabulary W. The hidden variables θ and Φ are determined by fitting the LDA model to a set of corpus documents. CFSLDA model uses Gibbs sampling for performing the topic modeling of text documents. Given values for the Gibbs settings (b, n, iter), the LDA hyper-parameters (α, β, and k), and TD matrix M, a Gibbs sampler produces “n” random observations from the inferred posterior distribution of θ and Φ [60].

Here Gibbs parameters include “b”—burn-in iterations, “n”—number of samples, and “iter”—number of sample intervals. Gibbs sequences produce θ and Φ from the desired distribution, but only after a large number of iterations. For this reason, it is necessary to discard (or burn) the initial observations “b” [60]. The Gibbs setting “n” determines how many observations from the two Gibbs sequences are kept. The setting “iter” specifies how many iterations the Gibbs sampler runs before returning the next useful observation [60]. The procedure CFSLDA model is shown:

Semantic Web for Effective Healthcare Systems

Подняться наверх