Читать книгу The Creativity Code - Marcus du Sautoy - Страница 16
Desert island algorithm
ОглавлениеOne of the most extraordinary algorithms of the modern age is the one that helps millions of us navigate the internet every day. If I were cast away on a desert island and could only take one algorithm with me, I’d probably choose the one that drives Google. (Not that it would be much use, as I’d be unlikely to have an internet connection.)
In the early days of the internet (we’re talking the early 1990s) there was a directory that listed all of the existing websites. In 1994 there were only 3000 of them. The internet was small enough for you to pretty easily thumb through and find what you were looking for. Things have changed quite a bit since then. When I started writing this paragraph there were 1,267,084,131 websites live on the internet. A few sentences later that number has gone up to 1,267,085,440. (You can check the current status here: http://www.internetlivestats.com/.)
How does Google figure out exactly which one of the billion websites to recommend? Mary Ashwood, an 86-year-old granny from Wigan, was careful to send her requests with a courteous ‘please’ and ‘thank you’, perhaps imagining an industrious group of interns on the other end sifting through the endless requests. When her grandson Ben opened her laptop and found ‘Please translate these roman numerals mcmxcviii thank you’, he couldn’t resist tweeting the world about his nan’s misconception. He got a shock when someone at Google replied with the following tweet:
Dearest Ben’s Nan.
Hope you’re well.
In a world of billions of Searches, yours made us smile.
Oh, and it’s 1998.
Thank YOU
Ben’s Nan brought out the human in Google on this occasion, but there is no way any company could respond personally to the million searches Google receives every fifteen seconds. So if it isn’t magic Google elves scouring the internet, how does Google succeed in so spectacularly locating the answers you want?
It all comes down to the power and beauty of the algorithm Larry Page and Sergey Brin cooked up in their dorm rooms at Stanford in 1996. They originally wanted to call their new algorithm ‘Backrub’, but eventually settled instead on ‘Google’, inspired by the mathematical number for one followed by 100 zeros, which is known as a googol. Their mission was to find a way to rank pages on the internet to help navigate this growing database, so a huge number seemed like a cool name.
It isn’t that there weren’t other algorithms out there being used to do the same thing, but these were pretty simple in their conception. If you wanted to find out more about ‘the polite granny and Google’, existing algorithms would have identified all of the pages with these words and listed them in order, putting the websites with the most occurrences of the search terms up at the top.
That’s OK but easily hackable: any florist who sticks into their webpage’s meta-data the words ‘Mother’s Day Flowers’ a thousand times will shoot to the top of every son or daughter’s search. You want a search engine that can’t easily be pushed around by savvy web designers. So how can you come up with an unbiased measure of the importance of a website? And how can you find out which sites you can ignore?
Page and Brin struck on the clever idea that if a website has many links pointing to it, then those other sites are signalling that it is worth visiting. The idea is to democratise the measure of a website’s worth by letting other websites vote for who they think is important. But, again, this could be hacked. I just need to set up a thousand artificial websites linking to my florist’s website and it will bump the site up the list. To prevent this, they decided to give more weight to a vote that came from a website that itself commanded respect.
This still left them with a challenge: how do you rank the importance of one site over another? Take this mini-network as an example.
We want to start by giving each site equal weight. Let’s think of the websites as buckets; we’ll give each site eight balls to indicate that they have equal rank. Now the websites have to give their balls to the sites they link to. If they link to more than one site, then they will share their balls equally. Since website A links to both website B and website C, for example, it will give 4 balls to each site. Website B, however, has decided to link only to website C, putting all eight of its balls into website C’s bucket.
After the first distribution, website C comes out looking very strong. But we need to keep repeating the process because website A will be boosted by the fact that it is being linked to by the now high-ranking website C. The table below shows how the balls move around as we iterate the process.
At the moment, this does not seem to be a particularly good algorithm. It appears not to stabilise and is rather inefficient, failing two of our criteria for the ideal algorithm. Page and Brin’s great insight was to realise that they needed to find a way to assign the balls by looking at the connectivity of the network. It turned out they’d been taught a clever trick in their university mathematics course that worked out the correct distribution in one step.
The trick starts by constructing a matrix which records the way that the balls are redistributed among the websites. The first column of the matrix records the proportion going from website A to the other websites. In this case 0.5 goes to website B and 0.5 to website C. The matrix of redistribution is therefore given by the following matrix:
The challenge is to find what is called the eigenvector of this matrix with eigenvalue 1. This is a column vector that does not get changed when multiplied by the matrix.1 Finding these eigenvectors or stability points is something we teach undergraduates early on in their university career. In the case of our network we find that the following column vector is stabilised by the redistribution matrix:
This means that if we split the balls in a 2:1:2 distribution we see that this weighting is stable. Distribute the balls using our previous game and the sites still have a 2:1:2 distribution.
Eigenvectors of matrices are an incredibly potent tool in mathematics and the sciences more generally. They are the secret to working out the energy levels of particles in quantum physics. They can tell you the stability of a rotating fluid like a spinning star or the reproduction rate of a virus. They may even be key to understanding how prime numbers are distributed throughout all numbers.
By calculating the eigenvector of the network’s connectivity we see that websites A and C should be ranked equally. Although website A is linked to by only one site (website C), the fact that website C is highly valued and links only to website A means that its link bestows high value to website A.
This is the basic core of the algorithm. There are a few extra subtleties that need to be introduced to get the algorithm working in its full glory. For example, the algorithm needs to take into account anomalies like websites that don’t link to any other websites and become sinks for the balls being redistributed. But at its heart is this simple idea.
Although the basic engine is very public, there are parameters inside the algorithm that are kept secret and change over time, and which make the algorithm a little harder to hack. But the fascinating thing is the robustness of the Google algorithm and its imperviousness to being gamed. It is very difficult for a website to do anything on its own site that will increase its rank. It must rely on others to boost its position. If you look at the websites that Google’s page rank algorithm scores highly, you will see a lot of major news sources and university websites like Oxford and Harvard. This is because many outside websites will link to findings and opinions on university websites, because the research we do is valued by many people across the world.
Interestingly this means that when anyone with a website within the Oxford network links to an external site, the link will cause a boost to the external website’s page rank, as Oxford is sharing a bit of its huge prestige (or cache of balls) with that website. This is why I often get requests to link from my website in the maths department at Oxford to external websites. The link will help increase the external website’s rank and, it is hoped, make it appear on the first page of a Google search, the ultimate holy grail for any website.
But the algorithm isn’t immune to clever attacks by those who understand how the mathematics works. For a short period in the summer of 2018, if you googled ‘idiot’ the first image that appeared was that of Donald Trump. Activists had understood how to exploit the powerful position that the website Reddit has on the internet. By getting people to vote for a post on the site containing the words ‘idiot’ and an image of Trump, the connection between the two shot to the top of the Google ranking. The spike was smoothed out over time by the algorithm rather than by manual intervention. Google does not like to play God but trusts in the long run in the power of its mathematics.
The internet is of course a dynamic beast, with new websites emerging every nanosecond and new links being made as existing sites are shut down or updated. This means that page ranks need to change dynamically. In order for Google to keep pace with the constant evolution of the internet, it must regularly trawl through the network and update its count of the links between sites using what it rather endearingly calls ‘Google spiders’.
Tech junkies and sports coaches have discovered that this way of evaluating the nodes in a network can also be applied to other networks. One of the most intriguing external applications has been in the realm of football (of the European kind, which Americans think of as soccer). When sizing up the opposition, it can be important to identify a key player who will control the way the team plays or be the hub through which all play seems to pass. If you can identify this player and neutralise them early on in the game, then you can effectively close down the team’s strategy.
Two London-based mathematicians, Javier López Peña and Hugo Touchette, both football fanatics, decided to see whether Google’s algorithm might help analyse the teams gearing up for the World Cup. If you think of each player as a website and a pass from one player to another as a link from one website to another, then the passes made over the course of a game can be thought of as a network. A pass to a teammate is a mark of the trust you put in that player – players generally avoid passing to a weak teammate who might easily lose the ball, and you will only be passed to if you make yourself available. A static player will rarely be available for a pass.
They decided to use passing data made available by FIFA during the 2010 World Cup to see which players ranked most highly. The results were fascinating. If you analysed England’s style of play, two players, Steven Gerrard and Frank Lampard, emerged with a markedly higher rank than others. This reflects the fact that the ball very often went through these two midfielders: take them out and England’s game collapses. England did not get very far that year in the World Cup – they were knocked out early by their old nemesis, Germany.
Contrast this with the eventual winners: Spain. The algorithm shared the rank uniformly around the whole team, indicating that there was no clear hub through which the game was being played. This is a reflection of the very successful ‘total football’ or ‘tiki-taka’ style played by Spain, in which players constantly pass the ball around, a strategy that contributed to Spain’s ultimate success.
Unlike many sports in America that thrive on data, it has taken some time for football to take advantage of the mathematics and statistics bubbling underneath the game. But by the 2018 World Cup in Russia many teams boasted a scientist on board to crunch the numbers to understand the strengths and weaknesses of the opposition, including how the network of each team behaves.
A network analysis has even been applied to literature. Andrew Beveridge and Jie Shan took the epic saga A Song of Ice and Fire by George R. R. Martin, otherwise known as Game of Thrones. Anyone who knows the story will be aware that predicting which characters will make it through to the next volume, let alone the next chapter, is notoriously tricky, as Martin is ruthless at killing off even the best characters he has created.
Beveridge and Shan decided to create a network between characters in the books. They identified 107 key people who became the nodes of the network. The characters were then connected with weighted edges according to the strength of the relationship. But how can an algorithm assess the importance of a connection? The algorithm was simply asked to count the number of times the two names appeared in the text within fifteen words of each other. This doesn’t measure friendship – it indicates some measure of interaction or connection between them.
They decided to analyse the third volume in the series, A Storm of Swords, as the narrative had settled down by this point, and began by constructing a page rank analysis of the nodes or characters in the network. Three characters quickly stood out as important to the plot: Tyrion, Jon Snow and Sansa Stark. Anyone who has read the books or seen the series would not be surprised by this revelation. What is striking is that a computer algorithm which does not understand what it is reading achieved this same insight. It did so not simply by counting how many times a character’s name appears – that would pull out other names. It turned out that a subtler analysis of the network revealed the true protagonists.
To date, all three characters have survived Martin’s ruthless pen which has cut short some of the other key characters in the third volume. This is the mark of a good algorithm: it can be used in multiple scenarios. This one can tell you something useful from football to Game of Thrones.