Читать книгу Data mining. Textbook - Vadim Shmal - Страница 6
Clustering
ОглавлениеClustering is the task of discovering groups and structures in data that are «similar» to some extent, not by using known structures in the data, but by learning from what is already there.
In particular, clustering is used in such a way that new data points are only added to existing clusters, without changing their shape to fit the new data. In other words, clusters are formed before data is collected, rather than fixed after all data is collected.
Given a set of parameters for data that is (mostly) variable, and their «collinearity», clustering can be thought of as a hierarchical algorithm for finding clusters of data points that satisfy a set of criteria. Parameters can be grouped into one of two categories: parameter values that define the spatial arrangement of clusters, and parameter values that define relationships between clusters.
Given a set of parameters for a dataset, clustering can be thought of as discovering those clusters. What parameters do we use for this? The implicit clustering method, which finds the nearest clusters (or, in some versions, clusters more similar to each other) with the least computational cost, is probably the simplest and most commonly used method for doing this. In clustering, we aim to keep the clusters as closely related to each other as possible – whether we do this by taking more measurements or by using only a certain technique to collect data.
But what is the difference between clustering and splitting data into one or more datasets?
The methods of implicit clustering and managed clustering are actually very similar. The only difference is that we use different parameters to determine in which direction we should split the data. Take as an example a set of points on a sphere that define an interconnected network. Both methods aim to keep the network as close as possible to the network defined by the two nearest points. This is because we don’t care if we are very far from one or the other. So, using the implicit clustering algorithm (cluster distance), we will divide the sphere into two parts that define very different networks: one will be the network defined by the two closest points, and the other will be the network defined by the two farthest points. The result is two completely separate networks. But this is not a good approach, because the further we move away from the two closest points, the smaller the distance between the points, the more difficult it will be to find connections between them – since there is a limited number of points that are connected by a small distance.
On the other hand, the method of controlled clustering (cluster distance) would require us to measure the length between each pair of points, and then perform calculations that make the networks closest to each other the smallest distance possible. The result is likely to be two separate networks that are close to each other but not exactly the same. Since we need two networks to be similar to each other in order to detect a relationship, it is likely that this method will not work – instead, the two clusters will be completely different.
The difference between these two methods comes down to how we define a «cluster». The point is that in the first method (cluster distance) we define a cluster as a set of points belonging to a network similar to a network defined by two nearest points. By this definition, networks will always be connected (they will be the same distance apart) no matter how many points we include in the definition. But in the second method (clustering control), we define clusters as pairs of points that are the same distance from all other points in the network. This definition can make finding connected points very difficult because it requires us to find every point that is similar to other points in the network. However, this is an understandable compromise. By focusing on finding clusters with the same distance from each other, we are likely to get more useful data, because if we find connections between them, we can use this information to find the relationship between them. This means that we have more opportunities to find connections, which will make it easier to identify relationships. By defining clusters using distance measurements, we ensure that we can find a relationship between two points, even if there is no way to directly measure the distance between them. But this often results in very few connections in the data.
Looking at the example of creating two datasets – one for implicit clustering and one for managed clustering – we can easily see the difference between the two methods. In the first example, the results may be the same in one case and different in another. But if the method is good for finding interesting relationships (as it usually is), it will give us useful information about the overall structure of the data. However, if the technique is not good at identifying relationships, then it will give us very little information.
Let’s say we are developing a system for determining the direction of a new product and want to identify similar products. Since it is not possible to measure the direction of a product outside the system, we will have to find relationships between products based on information about their names. If there is a good rule that we can use to establish relationships between similar products, then this information is very useful as it allows us to find interesting relationships (by identifying similar products that appear close to each other). However, if the relationship between two products isn’t very obvious, it’s likely that it’s just an unrelated relationship – which means the feature detection method we choose may not matter much. On the other hand, if the relationship is not very obvious but extremely useful (as in the example above), then we can start to learn how the product name is related to the process the product went through. This is an example of how different methods can produce very different results.
Unlike the characteristics of different methods, you also have different possible techniques. For example, when I say that my system uses image recognition, it doesn’t necessarily mean that the process the product goes through uses image recognition. If there are product images that we have taken in the past, or if we have captured some input from a product image, the resulting system will probably not use image recognition. It could be something completely different – something much more complex. Each of these methods is capable of identifying very different things. The result may depend on the characteristics of the actual data or on the data used. This means it’s not enough to look at a specific type of tool – we also need to look at what type of tool will be used for a particular type of process. This is an example of how data analysis should not be focused only on the problem being solved. Most likely, the system goes through many different processes, so we need to look at how different tools will be used to create a relationship between two points, and then decide which type of data to consider.
Often, we will be more concerned with how the method will be applied. For example, we might want to see what type of data is most likely to be useful for finding a relationship. We see that there is not much difference in how natural language processing is applied. This means that if we want to find a relationship, natural language processing is a good choice. However, natural language processing does not solve every possible relationship. Natural language processing is often useful when we want to take a huge number of small steps, but natural language processing does nothing when we want to go really deep. A look at natural language processing allows you to establish relationships between data that cannot be done using other methods. This is one of the reasons why natural language processing can be useful but not necessary.
However, natural language processing often doesn’t find as strong connections as image recognition because natural language processing focuses on simpler data whereas image recognition looks at very complex data. In this case, natural language processing is not very good, but can still be useful. Considering natural language processing is not always the best way to solve a problem. Natural language processing can be useful if the data is simple, but sometimes it is not possible to work with very complex data.
This example can be applied to many different types of data, but natural language processing is generally more useful for natural language data such as text files. For more complex data (such as images), natural language processing is often not enough. If there is a problem with natural language processing, it is important to consider other methods such as detecting words and determining what data is actually stored in an image. This data type will require a different data structure to find the relationship.
With the increasing complexity of technology, we often don’t have time to look at the data we’re looking at. Even if we look at the data, we may not find a good solution, because we have a large number of options, but not much time to consider them all. This is why many companies have a data scientist who can make many different decisions and then decide what works best for the data.