Читать книгу Enterprise AI For Dummies - Zachary Jarvinen - Страница 34
Text mining
ОглавлениеText mining deals with unstructured data, which must be organized and structured before applying data modeling and analytics. Using natural-language processing (NLP), text-mining software can extract data elements to populate the structured metadata fields such as author, date, and content summary that enable analysis.
Text mining can go beyond data mining to synthesize vast amounts of content to identify people, places, things, events, and time frames mentioned in written text, assign emotional tone to each mention of them (negative, positive, or neutral), and even understand whether the document is factual or opinion.
Text mining is important for its ability to digest unstructured textual data, which contains more context and valuable insights than structured, transactional data, because it reflects the author’s opinion, intention, emotion, and conclusions.
In 2018, Google introduced a technique for NLP pre-training called Bidirectional Encoder Representations from Transformers (BERT). This technique replaces ontologies with statistical-based mining to ratchet up the relevance of search results.
With AI and machine learning comes an assumption that the more clean data you have, the more accurate your predictions become. But this also assumes you have the horsepower to process and analyze that data quickly, at scale, without dimming the city’s lights. To be effective at customer analysis, AI solutions must process immense amounts of data efficiently and scale to meet increasing volumes of data over time as it is collected and persisted.
Table 1-2 compares and contrasts the properties and uses of data mining versus text mining.
TABLE 1-2 Data Mining Versus Text Mining
Data Mining | Text Mining | |
Overview | Data mining searches for patterns and relationships in structured data. | Text mining transforms unstructured textual data into structured information to enable data analysis. |
Data Type | Structured data from large datasets is found in systems such as databases, spreadsheets, ERP, and accounting applications. | Unstructured textual data is found in emails, documents, presentations, videos, file shares, social media, and the Internet. |
Data Retrieval | Structured data is homogenous and organized, making it easy to retrieve. | Unstructured textual data comes in many different formats and content types located in a more diverse range of applications and systems. |
Data Preparation | Structured data is formal and formatted, facilitating the process of ingesting data into analytical models. | Linguistic and statistical techniques — including NLP keywording and meta-tagging — must be applied to turn unstructured into usable structured data. |
Taxonomy | There is no need to create an overriding taxonomy. | A global taxonomy must be applied to organize the data into a common framework. |