Читать книгу Data protection for the prevention of algorithmic discrimination - Alba Soriano Arnanz - Страница 9

1.1. The three Vs in big data

The existence of the three Vs in big data were first pointed out by a paper that referred to the effects on data management of the surge in e-commerce in the early 2000s19 but did not mention the phrase “big data”. This first element in the conceptualisation of what big data is mostly refers to the characteristics that a dataset must have in order to be labelled as big data.

The first V, volume, refers to the depth and amount of available data.20 The amount of digitally stored information has undergone a massive increase during the past couple of decades, from a 25% of the world’s information being digitally stored in the year 2000 to a 98% in 2013.21 In addition, this increase in digitally stored information must also be considered with regard to the absolute increase in available information. Hence, in general, there is an unprecedented amount of information available, most of which is digitally stored, meaning it can be processed using the technologies that are the object of this study.

These large quantities of data are created and delivered at a very high speed (velocity) and appear in increasingly different formats (variety). Variety means that the data may come in semi-structured or unstructured formats. A typical example of structured data is an excel sheet that can include information on a series of people, such as their names and contact information.22 Semi-structured data do not present a defined structure and include much more information than structured formats. Word documents or emails are included within this type of data format. While extracting and organising the information they contain is not as straightforward as it is when presented with tables of data, they do, nonetheless, offer a certain organisation to them and, consequently, a blueprint for systematising data and obtaining information.23 For example, e-mail metadata enhances the possibility of classifying the information contained in the actual message. Unstructured data may appear in many kinds of formats such as text, video or audio. In general, it is much harder to develop relationships between pieces of semi-structured or unstructured data than between pieces of structured data.24

For example, it is much harder, both for machines and human beings, to extract the relevant information from a picture than from a table of data. If the objective of an algorithm is to figure out which neighbourhoods different families live in and the only pieces of data available are pictures posted on social media (unstructured data), it will be much more difficult to extract the relevant information from said type of raw data than from an excel document in which the names and addresses of individuals appear (structured data). The similarities between unstructured and semi-structured formats lead some tech experts to only differentiate between structured and unstructured types of data.

Data protection for the prevention of algorithmic discrimination

Подняться наверх