Читать книгу Data Analytics in Bioinformatics - Группа авторов - Страница 69

3.3.1 Data Pre-Processing and its Necessity

After collecting data from database it goes through several processes because data present in the databases are often raw, noisy, incomplete or inconsistent due to these reasons data cannot be used directly for mining process because it may produce unsatisfactory mining result. In order to enhance the classification result, a pre-processing step is initiated as an essential step before mining the data. It usually includes following methods such as data cleaning, data integration, data transformation, dimensionality reduction and so on [11]. Data pre-processing technique significantly improves the quality of data, performance of the classification model and minimizes the time required for actual mining.

We will address some of the problems which need to be solved to achieve better classification result. It involves cleaning of noisy data, missing data, duplicate data, etc. from the database to smoothly conduct the classification process. Noisy data refers to the unnecessary information available in the dataset which is meaningless and cannot be interpreted by machines. It can also be called corrupt data. Presence of these data can affect the data preparation process. To smooth the noisy data binning, clustering, regression methods can be used or can simply be deleted from large datasets based on the amount of noise present [11, 12]. Another biggest problem in biological data is the absence of values. In complex biological datasets this issue greatly impacts the performance of accuracy of the model. So to handle missing values various imputation techniques have to be used [12]. Data duplication are ongoing data quality problem testified in diverse domains, including health care, business and molecular biology, etc. [13] Presences of duplicate data leads to data inconsistency and redundancy which produces several consequences in classification problem. This issue can be handled by detecting and eliminating duplicate values.

Biological data may contain thousands of features because they are highly dependent on comparing the behavior of various biological units. These data often contains large amount of irrelevant data which affect the classification accuracy and machine learning efficiency [14]. Dimensionality reduction technique focuses on reducing the number of input features which aids to reduce computation time and redundant data. Dimensionality can be reduced using different method such as Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), feature selection, etc.

Подняться наверх