Читать книгу Linked Data Visualization - Laura Po - Страница 11
ОглавлениеCHAPTER 2
Principles of Data Visualization
Information visualization aims at visually representing different types of data (e.g., geographic, numerical, text, network) in order to enable and reinforce cognition. Information visualization offers intuitive ways for information perception and manipulation that essentially amplify the overall cognitive performance of information processing, especially for non-expert users. Visual analytics combines information visualization with data exploration capabilities. It enables users to explore and analyze unknown (in terms of semantics and structure) sets of information, discover hidden correlations and causalities, and make sense of data in ways that are not always possible with traditional quantitative data analysis and mining techniques. This is of great importance, especially given the massive volumes of digital information concerning nearly every aspect of human activity that are currently being produced and collected. The so-called Big Data era refers to this tremendous volume of information collected by digital means and analyzed to produce new knowledge in a plethora of scientific domains.
The LOD cloud is one of the main pillars of the so-called Big Data era. The number of datasets published on the Web, the amount of information and the interconnections established between disparate sources being available in the form of LOD are nowadays ever expanding, making traditional ways of analyzing them insufficient and posing new challenges in the way humans can explore, visualize, and gain insights out of them.
In this chapter, we present the basic principles and tasks for data visualization. We first provide the tasks for the preparation and visualization of data. We then present the most popular ways for graphically representing data according to the type of data and then we provide an overview of the main techniques for users to interact with the data. Finally, we show the main techniques used for visualizing and interacting with Big Data.
2.1 DATA VISUALIZATION DESIGN PROCESS
Information visualization requires a set of preprocessing tasks, such that data can be first extracted from data sources, transformed, enriched with additional metadata and properly modeled before they can be visually analyzed. An abstract view of this process along with its constituent tasks is shown in Figure 2.1. It presents a generic user-driven end-to-end roadmap of the tasks, problems, and challenges relating to the visualization of data in general.
Figure 2.1: Process for the visualization of data.
The information is usually present in various formats depending on the source; the appropriate data extraction and analysis technique must be selected to transform the raw information into a more semantically rich, structured format. A set of data processing techniques are then applied to enhance the quality of the collected data; these include cleaning data inconsistencies, filling in missing values and detecting and eliminating duplicates. After that, the data is enriched and customized with visual characteristics, meaningful aggregations, and summaries which facilitate user-friendly data exploration; finally, proper indexing is performed to enable efficient searching and retrieval. The details for each step are presented below.
Data Retrieval, Extraction. The first step concerns the retrieval and extraction of the data to be visualized. Raw data exists in various formats: e.g., books and reports describe phenomena in plain text—unstructured information, websites, and social networks contain annotated text and semi-structured data, whereas open data sources and corporate databases provide structured information. Data must first be retrieved in a digital format that is appropriate for processing (e.g., digital text files from newspapers). The core modeling concepts and observations are then extracted. Especially when the source data is in plain text, this is usually performed in an iterative human-curated way that refines the quality of the extracted data. For structured data sources, the process involves the extraction of the source concepts and their mapping to the target modeling.
Linked Data usually comes in structured formats, i.e., conforms to well-defined ontologies and schemas. Linked data are usually the result of a data extraction and transformation task, which has turned raw data into semantically rich data representations. Therefore, LD visualization usually starts from the next step, that of preparing already structured data for visualizing them.
Data Preparation. Input data are provided either in databases or data files (e.g., .csv, .xml). This step involves identifying all the concepts within the input datasets and representing them in a uniform data model that supports their proper visualization and visual exploration. For example, the multidimensional model is largely employed in social and statistical domains and represents concepts as observations and dimensions. Observations are measures of phenomena (e.g., indices) and dimensions are properties of these observations (e.g., reference period, reference area). Thus, a first step is to analyze the datasets and identify the different types of attributes they contain (date, geolocation, numeric, coded lists, literal). Each attribute is mapped to the corresponding concept of the multidimensional model, such as dimension, observation, coded list, etc. In addition, data processing requires a set of quality improvement activities that eliminate data inconsistencies and violations in source data. For example, missing or inconsistent codes are filled in for coded list attributes, date and time attributes are transformed to the appropriate format, and numerical values are validated so that wrong values can be corrected. Moreover, input data from multiple sources usually contain duplicate facts and a deduplication of the dataset must be performed. Deduplication is the process of identifying duplicate concepts within the input dataset based on a set of distinctive characteristics. A final task concerns the enrichment or the interlinking of the data with information from external sources. For example, places and locations are usually extracted as text; they can subsequently be annotated and enriched with spatial information (e.g., coordinates, boundaries) from external web services or interlinked with geospatial linked data sources (e.g., geonames.org) for their proper representation on maps.
Visual Preparation. This set of tasks involves the enrichment and customization of the data with characteristics that enable the proper visualization of the underlying information. These characteristics extend the underlying data model with visual information. For example, colors can be assigned to coded values and different types of diagrams can be bound to different types of data, timelines to date attributes and maps to geographical values. Thus, customization and the building of the visual model are necessary tasks before data visualization. In addition, the production of visual summaries and highlights is a common task, especially in the visualization of very large datasets. Summaries provide the user with overviews of the visualized data, and visual highlights are used to present interesting charts and findings. Finally, all the data are stored and indexed in a format that will enable efficient exploration and searching. For example, traditional RDBMS systems can be used for the exploration and visualization of tabular data, NoSQL database systems such as RDF and Graph databases can be used for visualizing the network, and graph data and inverted indexes can be used to support text search capabilities.
Data Visualization. The final task in this process is the actual visualization of the data. This involves the provision of the different types of charts, maps, and graphs that present the data and the different visual means (e.g., searching, browsing, filtering) for performing data analysis. Different types of charts can be provided according to the type of information: numerical and tabular data can be presented through typical charts, such as bar and line diagrams, pies, stacked and scatter diagrams, areas, etc.; temporal (dates and time periods) information can be visualized with timeline diagrams; hierarchical information (e.g., hierarchical coded lists) can be explored with hierarchical diagrams such as tree maps and nested diagrams; network data are visualized as graph diagrams which provide an explicit representation of the interrelationships between the visualized objects; and, finally, geographic information is usually visualized on maps—choropleth, heatmaps, and bubble maps capture the density of an observation over a region, while point and clusters can be used for presenting the location of individual entities on maps. Multiple charts can also be combined to provide the user with more sophisticated visualizations.
Allowing the user to choose from different kinds of visualization is crucial, since no single visualization configuration suits every data analysis context. For example, map-related visualizations such as choropleth maps and heatmaps are suitable for geographical data, while network data are usually represented using graph-related visualizations, and statistical data and indices may be better visualized via traditional charts such as line and area charts, timelines, pies, stacked diagrams and scatter plots. In the next subsections, an overview of the most common techniques used for the visualization of different types of data are presented.
Figure 2.2: A bar chart visualization of the number of people voted per constituency in Greek Elections of January 2015.
2.2 DATA VISUALIZATION TYPES
In most visualization scenarios we are interested in graphically representing numerical values, i.e., amounts corresponding to a real-world observation (e.g., the population) which is measured along a list of categorical data points (e.g, the population of countries in EU). The categories can exhibit a flat, such as a list of colors or a hierarchical structure, such as the organization of regional information in cities, countries, and continents).