Читать книгу Bots - Nick Monaco - Страница 16
Spambots and the development of the Robot Exclusion Standard
ОглавлениеWe have already seen that bots can be used for either good or bad ends, and World Wide Web bots were no different. Originally used as a solution to the problem of organizing and trawling through vast amounts of information on the World Wide Web, bots were quickly adapted for more devious purposes. As the 1990s went on and the World Wide Web (and other online communities like Usenet and IRC) continued to grow, entrepreneurial technologists realized that there was a captive audience on the other end of the terminal. This insight led to the birth of the spambot: online automated tools to promote commercial products and advertisements at scale.
One of the very first spambots was on Usenet. In April 1994, two lawyers, Laurence Canter and Martha Siegel, contracted a programmer to help promote an advert for their law firm’s assistance in the US Green Card Lottery. The programmer decided to use automation to reach as many users as possible. His bot – considered the first spambot on the modern internet – posted the ad to 6,000 newsgroups in under ninety minutes. The incident elicited a strongly negative response from the Usenet community and, in response, one user built a cancelbot that removed all of the spambot’s posts from targeted newsgroups (Leonard, 1997, pp. 165–167).
Usenet was a precursor to more widespread spambot swarms on the internet at large, especially email (Ohno, 2018). Incidents like the botwars on Usenet news groups and IRC servers had, by the late 1990s, made it all too clear that bots would not be only a positive force on the internet. Negative uses of bots (spreading spam, crashing servers, denying content and services to humans, and posting irrelevant content en masse, just to name a few) could easily cause great harm – perhaps most damagingly, crawling websites to gather private or sensitive information.
To solve the problem of bots crawling sensitive websites, a Dutch engineer named Martijn Koster developed the Robot Exclusion Standard11 (Koster, 1994, 1996). The Robot Exclusion Standard (RES) is a simple convention that functions as a digital “Do Not Enter” sign. Every active domain on the internet has a “robots.txt” file that explains what content the site allows bots to access. Some sites allow bots to access any part of their domain, others allow access to some (but not all) parts of the website, and still others disallow bot access altogether. Any site’s robots.txt file can be found by navigating to the website and adding “/robots.txt” to the end of the URL. For instance, you can access Facebook’s instructions for crawler bots at facebook.com/robots.txt. As you would expect, this file disallows nearly all forms of crawling on Facebook’s platform, since this would violate users’ privacy, as well as the platform’s terms of service.
The late 1990s saw several high-profile examples of controversial bots that followed these standards, while arguably violating their intentions, and others who proudly flouted them. RoverBot, a crawler that was created in 1996, was one of these controversial bots. RoverBot was a crawler that retrieved a set of websites relating to a pre-specified topic and scraped email addresses from them. The company that built RoverBot then sold these lists of email addresses to paying customers, who used them to send out spam advertisements. While RoverBot certainly had its detractors, the firm behind it insisted that it followed rules (such as the RES) while scraping the web.
Other spambots did not even follow the letter of the law. For example, a bot known as ActiveAgent ignored the RES altogether, scraping any website it could find looking for email addresses, regardless of the site’s policies on bot access. The anonymous developer behind ActiveAgent had a different business model, though. Rather than selling the email addresses it collected, it sold its source code to aspiring spammers for $100. Buyers could then modify this code for their own purposes, sending out spam emails with whatever message or product they wanted (Leonard, 1997, pp. 140–144). Thanks in part to malicious developers like those behind ActiveAgent, new spamming techniques quickly multiplied as the web grew. Today, spambots and spamming techniques are still evolving and thriving. Estimates vary greatly, but some firms estimate that as much as 84 percent of all email is spam, as of October 2020 (Cisco Talos Intelligence, 2020).
Clearly, the RES is not an absolute means of shutting down crawler bot activity online – it’s an honor system that presumes good faith on the part of bot developers, who must actively decide to make each bot honor the convention and encode these values into the bot’s programming. Despite these imperfections, the RES has seen success online and, for that reason, it continues to underlie bot governance online to this day. It is an efficient way to let bot designers know when they are violating a site’s terms of service and possibly the law.