Читать книгу Data protection for the prevention of algorithmic discrimination - Alba Soriano Arnanz - Страница 13
2.2. Supervised and unsupervised learning
ОглавлениеThe biggest difference between general machine learning systems and the subset of deep learning tools is the degree to which the system needs instructions. Programmes do end up developing knowledge on their own in both cases. However, in more traditional machine learning systems, this autonomous development is heavily directed by coding, especially in the training stages.57 Conversely, in the case of deep learning, the computer is able to draw conclusions, develop knowledge and tune itself with very little instructions.58 In these cases, the machine is coded so that it is mostly autonomous and can learn from its surroundings, reaching conclusions in a very similar manner to the way a human brain works.59
These new technologies can be used in order to confirm suspected correlations between variables. In fact, algorithms are constantly put to use in order to verify relationships between data.60 This kind of analysis is known as top-down or supervised learning and is much closer to traditional statistical analysis61 as it requires feeding the computer with a selected sample of data in order for it to extract the relevant relationships. This initial set of data is known as training data and it needs to be collected and prepared before it is processed.
However, a different way in which machine learning can be put to use is through bottom-up analysis, also known as unsupervised learning. These techniques greatly differ from traditional statistical analysis since, instead of developing a hypothesis, which is then tested over and over again with the available data; data is fed into a computer programme, which then extracts the relevant hypotheses.62
The inversion of the traditional process is made possible due to the vast amount of data that is now available and the development of the necessary technologies to process it. The enormous volume of data makes it very hard for human beings to be able to detect the possible relationships in it, thus rendering the use of automated systems necessary. In addition, the availability of such large quantities of data in theory ensures that data processing computer programmes will reach more accurate results than when processing smaller datasets.63
Through the following example we aim to illustrate the difference between supervised and unsupervised learning. Imagine a supermarket chain believed that there was a group of employees who were stealing cleaning products and it asked the firm’s in-house IT department to develop an algorithm to verify this suspicion and identify the responsible parties. The individuals developing the system would select a representative sample of employees and inform the algorithm of what behaviour is considered stealing and what behaviour is not. In other words, the data provided would be labelled. Then, the company would follow the daily work-routine of the selected sample of employees so as to determine which staff members were stealing and which were not. This information would be improved by other data such as employee schedules, positions in the company’s hierarchy, their behaviour in the moments before and after stealing and days and times in which the products were stolen, amongst others.
All the information gathered would enable the firm to estimate the cost of stolen cleaning products for the selected sample and extrapolate it to the entire firm. More importantly, by feeding this information into a machine learning algorithm, the supermarket chain would be able to build a model to predict which employees would steal cleaning products and flag them so that enhanced supervision was placed on suspicious workers.
For the bottom-up or unsupervised approach, all of the data regarding employees, their behaviours and schedules would be introduced into the machine learning algorithm in order for it to figure out any relevant relationships, not just with regard to the stealing problem but to any issue, such as job performance. In this case the historical data fed to the algorithm would not be labelled and the relationships between data and results would be completely unknown.
Hence, the key element in supervised learning is that the information that is being searched for is already known and labelled; this enables the organisation to feed the algorithm with already selected and partly pre-processed data.64 Conversely, unsupervised learning is used in order to extract completely unknown and unsuspected relationships in the data, leading to a reduction in the human control of the process.65 Nonetheless, although the information that is being searched for is already known in supervised learning, which undoubtedly facilitates the implementation of tools aimed towards controlling these systems, algorithms and models using supervised learning are still highly opaque.
Supervised and unsupervised learning can be complementary seeing as, once the subjects have been clustered according to the relationship between their input variables, these groups can be used to develop supervised analysis.66 Supervised learning algorithms are used in applications such as predictive policing, credit scoring, and predicting employee performance and thus currently have greater implications from a legal perspective.67