CAMBRIDGE, Mass. -- Police and security teams guarding airports, docks and border crossings from terrorist attack or illegal entry need to know immediately when someone enters a prohibited area, and who they are. A network of surveillance cameras is typically used to monitor these at-risk locations 24 hours a day, but these can generate too many images for human eyes to analyze.
Now, a system being developed by Christopher Amato, a postdoc at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), can perform this analysis more accurately and in a fraction of the time it would take a human camera operator. "You can't have a person staring at every single screen, and even if you did the person might not know exactly what to look for," Amato says. "For example, a person is not going to be very good at searching through pages and pages of faces to try to match [an intruder] with a known criminal or terrorist."
Existing computer-vision systems designed to carry out this task automatically tend to be fairly slow, Amato says. "Sometimes it's important to come up with an alarm immediately, even if you are not yet positive exactly what it is happening," he says. "If something bad is going on, you want to know about it as soon as possible."
So Amato and his University of Minnesota colleagues Komal Kapoor, Nisheeth Srivastava and Paul Schrater are developing a system that uses mathematics to reach a compromise between accuracy — so the system does not trigger an alarm every time a cat walks in front of the camera, for example — with the speed needed to allow security staff to act on an intrusion as quickly as possible.
For camera-based surveillance systems, operators typically have a range of computer-vision algorithms they could use to analyze the video feed. These include skin detection algorithms that can identify a person in an image, or background detection systems that detect unusual objects, or when something is moving through the scene.
To decide which of these algorithms to use in a given situation, Amato's system first carries out a learning phase, in which it assesses how each piece of software works in the type of setting in which it is being applied, such as an airport. To do this, it runs each of the algorithms on the scene, to determine how long it takes to perform an analysis, and how certain it is of the answer it comes up with. It then adds this information to its mathematical framework, known as a partially observable Markov decision process (POMDP).
Then, for any given situation — if it wants to know if an intruder has entered the scene, for example — the system can decide which of the available algorithms to run on the image, and in which sequence, to give it the most information in the least amount of time. "We plug all of the things we have learned into the POMDP framework, and it comes up with a policy that might tell you to start out with a skin analysis, for example, and then depending what you find out you might run an analysis to try to figure out who the person is, or use a tracking system to figure out where they are [in each frame]," Amato says. "And you continue doing this until the framework tells you to stop, essentially, when it is confident enough in its analysis to say there is a known terrorist here, for example, or that nothing is going on at all."
Like a human detective, the system can also take context into account when analyzing a set of images, Amato says. So for instance, if the system is being used at an airport, it could be programmed to identify and track particular people of interest, and to recognize objects that are strange or in unusual locations, he says. It could also be programmed to sound an alarm whenever there are any objects or people in the scene, when there are too many objects, or if the objects are moving in ways that give cause for concern.
In addition to port and airport security, the system could monitor video information obtained by a fleet of unmanned aircraft, Amato says. It could also be used to analyze data from weather-monitoring sensors to determine where tornados are likely to appear, or information from water samples taken by autonomous underwater vehicles, he says. The system would determine how to obtain the information it needs in the least amount of time and with the fewest possible sensors.