Читать книгу Bots - Nick Monaco - Страница 15
Crawlers, web-indexing bots
ОглавлениеThe World Wide Web became widely available in the early 1990s, growing exponentially more complex and difficult to navigate as it gained more and more users. Gradually, people began to realize that there was simply too much information on the web for humans to navigate easily. It was clear to companies and researchers at the forefront of computer research that they needed to develop a tool to help humans make sense of the vast web. Bots came to fill this void, playing a new infrastructural role as an intermediary between humans and the internet itself. Computer programs were developed to move from webpage to webpage and analyze and organize the content (“indexing”) so that it was easily searchable. These bots were often called “crawlers” or “spiders,”6 since they “crawled” across the web to gather information. Without bots visiting sites on the internet and taking notes on their content, humans simply couldn’t know what websites were online. This fact is as true today as it was back then.
The basic logic that drives crawlers is very simple. At their base, websites are text files. These text files are written using hypertext markup language (HTML), a standardized format that is the primary base language of all websites.7 HTML documents can be accessed with an HTTP call. Users submit an HTTP call every time they type a webpage’s URL into a browser and press enter or click on a link on the internet. One of the core features of HTML – the one that enables the World Wide Web to exist as a network of HTML pages – is the ability to embed hypertext, or “links,” to outside documents within a webpage. Crawler bots work by accessing a website through an HTTP call, collecting the hyperlinks embedded within the website’s HTML code, then visiting those hyperlinks using another HTTP call. This process is repeated over and over again to map and catalogue web content. Along the way, crawler bots can be programmed to download the HTML underneath every website, or process facts about those sites in real time (such as whether it appears to be a news outlet or e-commerce site).
Initially, these bots crawled the web and took notes on all the URLs they visited, assembling this information in a database known as a “Web Directory” – a place users could visit to see what websites existed on the web and what they were about. Quickly, advertisers and investors poured funds into these proto-search engines, realizing how many eyes would see them per day as the internet continued to grow (Leonard, 1996).
Though Google eventually became the dominant search engine for navigating the web, the 1990s saw a host of corporate and individual search engine start-ups, all of which used bots to index the web. The first of these was Matthew Grey’s World Wide Web Wanderer in 1993. The next year, Brian Pinkerton wrote WebCrawler, and Michael Mauldin created Lycos (Latin for “wolf spider”), both of which were even more powerful spiders than the World Wide Web Wanderer. Other search engines, like AltaVista and (later) Google, also employed bots to perfect the art of searching for8 and organizing information on the web9 (Indiana University Knowledge Base, 2020; Leonard, 1997, pp. 121–124). The indexable internet – that is, publicly available websites on the World Wide Web that allow themselves to be visited by crawler bots and be listed in search engine results – is known as the “clear web.”10