Читать книгу Big Data - Seifedine Kadry - Страница 34
1.8.3.4 Data Transformation
ОглавлениеData transformation refers to transforming or consolidating the data into an appropriate format and converting them into logical and meaningful information for data management and analysis. The real challenge in data transformation comes into the picture when fields in one system do not match the fields in another system. Before data transformation, data cleaning and manipulation takes place. Organizations are collecting a massive amount of data, and the volume of the data is increasing rapidly. The data captured are transformed using ETL tools.
Data transformation involves the following strategies:
Smoothing, which removes noise from the data by incorporating binning, clustering, and regression techniques.
Aggregation, which applies summary or aggregation on the data to give a consolidated data. (E.g., daily profit of an organization may be aggregated to give consolidated monthly or yearly turnover.)
Generalization, which is normally viewed as climbing up the hierarchy where the attributes are generalized to a higher level overlooking the attributes at a lower level. (E.g., street name may be generalized as city name or a higher level hierarchy, namely the country name).
Discretization, which is a technique where raw values in the data (e.g., age) are replaced by conceptual labels (e.g., teen, adult, senior) or interval labels (e.g., 0–9, 10–19, etc.)