Читать книгу Data Mining and Machine Learning Applications - Группа авторов - Страница 19

1.4 Data Mining Algorithms

Оглавление

Adaboost, KNN, PageRank, Naïve Bayes, Support Vector Machine (SVM), Apriori, and C 4.5 are some data mining algorithms. Data mining algorithms are primarily used for predictive modeling, which includes clustering and classification problems. Let us discuss each of them in detail [1–6].

 Classification

It is a task in data mining where data can be modeled and distinguished into classes. One can say it is a process where given objects are classified/categorized to form a new class. Initially, the training set is identified, and new observations are derived. Hence, this task is classified into two phases, i.e., the learning/training phase and the classification of the given objects. E.g., a bank manager can wish to classify the loans borrowed by customers based on risky category, less risky category and trustworthiness, etc. To execute this classification technique on the given objects, the idea is to use classifier/s—where rules are applied, training is given, and given data is classified into the desired classes. The following are the classification algorithms that can be used in data mining:

 Logistics regression

 Naïve Bayes

 K nearest

 Decision tree

 Random forest

 Support Vector Model.

 Clustering

It is a grouping of objects based on similarity. A threshold is applied, and an object can be added to the specific cluster where the criteria can be satisfied. This technique is helpful in various applications such as—

 Market basket analysis

 Pattern recognition

 Image processing

 Financial analysis.

It is categorized as unsupervised learning, where the given data is used to compare with the threshold (predefined value). The clustering approach can be categorized into intra-cluster and inter-cluster.

 Types of Clustering

Clustering is nothing but a grouping of elements based on similarity and its unsupervised learning technique. One can apply partition clustering, which is also known as non-hierarchical clustering, to classify the data/records/values into ‘k’ groups/clusters. This is an iterative process and works until the last element is processed. Users can use the SVM model—support vector machine, where ‘n’ features will be identified in the initial phase, and then those features will be processed to identify the relevant results.

 ◦ K-means clustering algorithm can be used to train the samples. Using this clustering method, it is possible to identify the nearest cluster by training the samples. Training the samples is nothing but finding the distance between samples and the nearest clusters. Distance is calculated between the samples, and the sample with a larger distance is likely to be selected as a center point. (One can use Euclidean distance metric in this case). K-means stores centroids (‘k’ points) that it uses to define the clusters to be formed. An object/value is considered to be in a specific cluster if it is closer to that cluster’s centroid.

 ◦ Hierarchical: It is one of the popular algorithms used in data mining and machine learning. The idea is to find the two clusters which are closer to each other and merge them to form a single cluster. Repeat this process until all the desired clusters are merged. This is categorized into top-down and bottom-up approaches, i.e., known as agglomerative and divisive approaches. We can define this type as the nesting of clusters that can be nested together to form a tree (merged cluster).

 ◦ Fuzzy: Clusters are treated as fuzzy sets and allocate the objects to these clusters. It is unsupervised, and as its name suggests, one can check the probability of each point whether it belongs to multiple clusters instead of belonging to a single cluster. It is also treated as soft clustering. One of its popular applications is pattern recognition. Minimization of the objective function is its primary objective, and hence the number. of iterations may increase. As for the number of iterations are ‘n’, it may increase the time complexity of the algorithm.

Data Mining and Machine Learning Applications

Подняться наверх