Читать книгу Semantic Web for Effective Healthcare Systems - Группа авторов - Страница 19
1.4 Feature Extraction
ОглавлениеThe World Wide Web has large amount of text documents and the necessity of annotating them has become vital. Customers would express their views by writing product reviews, user feedback in the descriptive format, which is mostly in unstructured. This format makes difficult for the machines to process these text documents. Hence, it becomes necessary to annotate large volume of text in order to develop business intelligence or automated solutions. These data have to be analyzed and modelled for enabling the decision-making process. The challenging task of extracting the information is made easier by adding, annotating documents, which in turn paves the way for automated solutions [53].
Feature extraction is the process of building dataset with informative and non-redundant features from the initial set of data. The subsequent methods like feature selection reduces the amount of resources required for its representation. Many machine learning algorithms like classification and clustering are used for extracting the features such as entities and attributes from the text documents using their properties which are similar. The challenges like absence of semantic relations between entities while feature selection and lack of prior knowledge in domain may be overcome by applying suitable NLP and IE techniques [54, 55].
Feature extraction from product or service review documents often includes different steps like data pre-processing, document indexing, dimension reduction, model training, testing, and evaluation. Labeled data set of document collection is used to train or learn the model. Further, the learned model is used for identifying unlabeled concept instances from the new set of documents. Document indexing is the most critical and complex task in text analysis. It decides the set of key features to represent the document. It also enhances the relevancy between the word (or feature) and the document. It needs to be very effective as it decides the storage space required and query processing time of documents.
Pre-processed data is built into Term Document (TD) matrix by using term weighing schemes, as shown in Figure 1.3. The text documents are converted into numbers using Information Retrieval (IR) techniques, which vary in their weight allocation for each term.
Figure 1.3 Term weighing schemes for feature extraction.
IR like Vector Space Model (VSM), Latent Semantic Indexing (LSI), topic modeling techniques, and clustering techniques are used in the feature extraction of text documents for term weighing process. The following sub sections describe the rationales of different feature extraction techniques used in text analysis.