MIT computer scientists have developed a system that learns to identify objects within an image, based on a spoken description of the image. Given an image and an audio caption, the model will highlight in real-time the relevant regions of the image being described.
Unlike current speech-recognition technologies, the model doesn’t require manual transcriptions and annotations of the examples it’s trained on. Instead, it learns words directly from recorded speech clips and objects in raw images, and associates them with one another.
The model can currently recognize only several hundred different words and object types. But the researchers hope that one day their co…
MIT News – Electrical engineering and computer science (EECS) – Computer science and technology