A team of Italian mathematicians, including one who is also a neuroscientist from the Champalimaud Centre for the Unknown (CCU), in Lisbon, Portugal, has shown that artificial vision machines can learn to recognize complex images spectacularly faster by using a mathematical theory that was developed 25 years ago by one of this new study's co-authors. Their results have been published in the journal Nature Machine Intelligence.
During the last decades, machine vision performance has exploded. For example, these artificial systems can now learn to recognise virtually any human face - or to identify any individual fish moving in a tank, in the midst of a large number of other almost identical fish which are also moving.
The machines we're talking about are, in fact, electronic models of networks of biological neurons, and their aim is to simulate the functioning of our brain, which is as good as it gets at performing these visual tasks - and this, without any conscious effort on our part.
But how do these neural networks actually learn? In the case of face recognition, for instance, they do it by acquiring experience about what human faces look like in the form of a series of portraits. More specifically, after being digitized into a matrix of pixel values (think about your computer monitor's RGB system), each image is "crunched" inside the neural network, which then manages to extract general, meaningful features, from the set of sample faces (such as the eyes, mouth, nose, etc).
This learning (deep learning, in its more modern development) then enables the machine to spit out another set of values, which will in turn enable it, for instance, to identify a face it has never seen before in a databank of faces (much like a fingerprint database), and therefore to predict who that face belongs to with great accuracy.
The story of Clever Hans
But, before the neural network can begin to perform this well, though, it is typically necessary to present it with thousands of faces (i.e. matrices of numbers). Moreover, much as these machines have been increasingly successful at pattern recognition, the fact is that nobody really knows what goes on inside them as they learn their task. They are, basically, black boxes. You feed them something, they spit out something, and if you designed your electronic circuits properly... you'll get the correct answer.
What this means is that it is not possible to determine which or how many features the machine is actually extracting from the initial data - and not even how many of those features are really meaningful for face recognition. "To illustrate this, consider the paradigm of the wise horse", says first author of the study Mattia Bergomi, who works in the Systems Neuroscience Lab at the CCU.
The story dates from the early years of the 20th century. It's about a horse in Germany called Clever Hans that, so his master claimed, had learned to do arithmetics and announce the result of additions, subtractions, etc. by tapping one of its front hooves on the ground the right number of times. Everyone who witnessed the horse's performance was convinced he could count (the event was even reported by the New York Times). But then, in 1907, a German psychologist showed that the horse was in fact picking up unconscious cues in his master's body language that were telling it when to stop tapping...
"It's the same with machine learning; there is no control over how it works or what it has learned during training", Bergomi explains. The machine having no a priori knowledge of faces, it just somehow does its stuff - and it works.
This led the researchers to ask: could there be a way to inject some knowledge of the real world (about faces or other objects) into the neural network, before training, in order to cause it explore a more limited space of possible features instead of considering them all - including those that are impossible in the real world? "We wanted to control the space of learned features", Bergomi points out. "It's similar to the difference between a mediocre chess player and an expert: the first sees all possible moves, while the latter only sees the good ones", he adds.
Another way of putting it, he says, is by saying that "our study addresses the following simple question: When we train a deep neural network to distinguish road signs, how can we tell the network that its job will be much easier if it only has to care about simple geometrical shapes such as circles and triangles?".
The scientists reasoned that this approach would substantially reduce training time - and, not less importantly, give them a "whiff" of what the machine might be doing to obtain its results. "Allowing humans to drive the learning process of learning machines is fundamental to move towards a more intelligible artificial intelligence and reduce the skyrocketing cost in time and resources that current neural networks require in order to be trained", he remarks.
What's in a shape?
Here's where a very abstract and novel mathematical theory, called "topological data analysis" (TDA), enters the stage. The first steps in the development of TDA were taken in 1992 by the italian mathematician Patrizio Frosini, co-author of the new study and currently at the University of Bologna. "Topology is one of the purest forms of math", says Bergomi. "And until recently, people thought that Topology would not be applied to anything concrete for a long time. Until TDA became famous in the last few years."
Topology is a sort of extended geometry that, instead of measuring lines and angles in rigid shapes (such as triangles, squares, cones, etc.), seeks to classify highly complex objects according to their shape. For a topologist, for example, a donut and a mug are the same object: one can be deformed into the other by stretching or compression.
Now, the thing is, current neural networks are not good at topology. For instance, they do not recognize rotated objects. To them, the same object will look completely different every time it is rotated. That is precisely why the only solution is to make these networks "memorise" each configuration separately - by the thousands. And it is precisely what the authors were planning to avoid by using TDA.
Think of TDA as being a mathematical tool for finding meaningful internal structure (topological features), in any complex "object" that can be represented as a huge set of numbers, by looking at the data through certain well-chosen "lenses" or filters. The data itself can be about faces, financial transactions or cancer survival rates. For faces in particular, by applying TDA, it becomes possible to teach a neural network to recognize faces without having to present it with each of the different orientations faces might assume in space. The machine will now recognize all faces as being a face, even in different rotated positions.
It's a 5! No, it's a 7!
In their study, the scientists tested the benefits of combining machine learning and TDA by teaching a neural network to recognise hand-written digits. The results speak for themselves.
As these networks are bad topologists and handwriting can be very ambiguous, two different hand-written digits may prove indistinguishable for current machines - and conversely, two instances of the same hand-written digit may be seen by them as different.
That is why, to be performed by today's vision machines, this task requires presenting the network, which knows nothing about digits in the world, with thousands of images of each of the 10 digits, written with all sorts of slants, calligraphies, etc..
To inject knowledge about digits, the team built a set of a priori features that they considered meaningful (in other words, a set of "lenses" through which the network would "see" the digits), and forced the machine to choose among these lenses to look at the images. And what happened was that the number of images (that is, the time) needed for the TDA-enhanced neural network to learn to distinguish 5's from 7's - however badly written -, while maintaining its predictive power, dropped down to less than 50! "What we mathematically describe in our study is how to enforce certain symmetries, and this provides a strategy to build machine learning agents that are able to learn salient features from a few examples, by taking advantage of the knowledge injected as constraints", says Bergomi.
Does this mean that the inner workings of learning machines which mimic the brain will become more transparent in the future, enabling new insights on the inner workings of the brain itself? In any case, this is one of Bergomi's goals. "The intelligibility of artificial intelligence is necessary for its interaction and integration with biological intelligence", he says. He is currently working, in collaboration with his colleague Pietro Vertechi, also from the Systems Neuroscience Lab at CCU, on developing a new kind of neural network architecture that will allow humans to swiftly inject high-level knowledge into these networks to control and speed up their training.