Search
Search

HOW AI GOT HERE

From zeros and ones to board game champion: How AI learned to strategize

In 2019, the world's top Go player retired from the game after stating that, three years after a computer beat him at the game with a surprising move that its engineers thought was a mistake, "there is an entity that cannot be defeated."

From zeros and ones to board game champion: How AI learned to strategize

Journalists watch a big screen showing live footage of the third game of the Google DeepMind Challenge Match between Lee Se-Dol and AlphaGo at a hotel in Seoul, South Korea, on Mar. 12, 2016. (Jung Yeon-Je/AFP)

This article is the first in a series being produced in partnership with Urban Analytica at SAIL, a cross-disciplinary initiative at the American University of Beirut that aims to democratize knowledge around artificial intelligence and data analytics for advocacy projects within the Arab world.

Today, artificial intelligence can do what was once thought of as uniquely human: write poetry, create images, and even propose tax systems. The AI Economist, for instance, uses AI to simulate entire economies in search of fairer ways to balance productivity with equality.

But before machines could optimize policies, they had to learn something basic: how to learn from experience.

This is what’s called “reinforcement learning,” which has become a key foundation for AI systems. Understanding how it works means going back to when a machine learned how to play one of the oldest board games known to humankind.

In 2016, DeepMind’s AlphaGo defeated Lee Sedol, one of the greatest players of the ancient Chinese strategy game called Go. AlphaGo’s winning turn, move 37, was so unusual that DeepMind, the British-American tech company acquired by Google two years prior, thought it was a mistake — but it wasn’t.

In this handout image provided by Google, South Korean professional Go player Lee Sedol (R) reviews the match with other professional Go players after the final game against Google's artificial intelligence program, AlphaGo, during the Google DeepMind Challenge Match on March 15, 2016 in Seoul, South Korea. (Credit: Google press handout)

The move was celebrated as a stroke of creativity, with many exclaiming that AlphaGo was inventing its own strategies rather than relying on pre-existing ones.

The match left a permanent mark on the Go community: three years later, Sedol announced his retirement from professional play. “Even if I become the number one,” he told South Korean media, “there is an entity that cannot be defeated.”

Experts review moves during the Go match between 19-year-old Ke Jie and Google's artificial intelligence program AlphaGo in Wuzhen, in eastern China's Zhejiang province on May 27, 2017. That same day, AlphaGo's developer said the program would be retiring from playing human Go opponents after roundly defeating the world's top player. (Credit: AFP)

His words demonstrate how far AI had come (already nearly a decade ago) and how different AlphaGo was from the machines that preceded it. Earlier chess programs relied solely on brute calculations, evaluating every possible move through exhaustive search and set rules. They didn't learn; just computed. However, AlphaGo processed information in a more human-like manner. How did it get there?

The initial state of the board is empty, with two players taking turns placing white and black stones on a 19x19 grid. As the stones spread, a battle for space emerges.

For AlphaGo, each arrangement of stones is a state, and placing a stone is an action. States change with every action, as each new move reshapes the board.

AlphaGo understands the state of the board and then estimates which moves increase its chances of winning.

In this example, the black stone surrounded the white stone, hence ‘capturing’ it. Once the opposing player’s stones are all captured, the game is won.

Here, as the player of the black stone, AlphaGo has made a winning move, which it understands as a reward. However, whenever its stones are captured, the move preceding it would be considered as a loss, and it would be “penalized,” meaning the system has learned to avoid that move in the future.

AlphaGo started out like a diligent student, learning from the game’s existing masters. It studied and processed thousands of professional matches in order to understand which actions resulted in rewarding states and which resulted in penalties.

Then it turned inwards. It practiced with itself literally millions of times and learned the incentives: if a move brought it closer to winning, it reinforced that choice; if it led to a loss, it learned to avoid it.

Simply put, in the same way a student repeats a word to remember it, or a pianist gets a song right after their 100th attempt, DeepMind essentially created a machine able to discern right from wrong by learning from its and everyone else’s mistakes.

And that was the breakthrough behind ‘Move 37.’ AlphaGo knowingly decided to play a move that almost no human would have thought to play, going beyond its human guide. It made that move based on intuition shaped by experience — while also deviating from it.

The same loop of states, actions, rewards, and penalties were implemented in a very different arena: StarCraft II. If Go is a contest of strategy, StarCraft II is a contest of managing chaos.

Players must gather resources, build armies, and conquer the opponent’s base. Professional players describe the intensity of this game as being akin to playing chess while conducting an orchestra. Its playstyle felt alien to those who sought to master this game, as it required doing a great number of tasks simultaneously, and with precision.

Like Go, StarCraft II begins from an initial state, a small base that grows with every action.

For AlphaStar, every screenshot of the battlefield was a state. Every command (train a soldier, expand a base, launch an attack) was an action. Victories brought rewards; defeats were penalties. Rewards and penalties play out in combat: success if your army holds, punishment if it falls apart.

AlphaStar played itself, sometimes defensively, other times aggressively, and sometimes with risky tactics. Over time, it acquired enough information to dynamically respond to every situation thrown at it. By endlessly playing itself, AlphaStar adapted, countered, and improved, eventually reaching the highest human rank in-game, Grandmaster, and defeating seasoned professionals with strategies unlike any seen before.

What does this prove? Machines can learn like humans, through repetition, feedback, and adaptation. But can the same trial-and-error loop be used to tackle something far messier, like the rules of an economy? That is the challenge behind the AI Economist.

This article is the first in a series being produced in partnership with Urban Analytica at SAIL, a cross-disciplinary initiative at the American University of Beirut that aims to democratize knowledge around artificial intelligence and data analytics for advocacy projects within the Arab world.Today, artificial intelligence can do what was once thought of as uniquely human: write poetry, create images, and even propose tax systems. The AI Economist, for instance, uses AI to simulate entire economies in search of fairer ways to balance productivity with equality. But before machines could optimize policies, they had to learn something basic: how to learn from experience. This is what’s called “reinforcement learning,” which has become a key foundation for AI systems. Understanding how it works means going back to when a machine...
Comments (0) Comment

Comments (0)

Back to top