Author(s): Rösch P. J.; Deuser F.Habel K.Oswald N.  

    |     ISSN: 3005-2092  

Cognitive superiority using artificial intelligence aims to extract relevant information from a huge amount of data to create military and non-military situational awareness. Reliable and timely interpretations of visual information are contributing factors to gain such superiority. With the rise of large-scale, multimodal deep learning models like Contrastive Language-Image Pre-training (CLIP), a promising type of neural network is emerging to perform such visual recognition tasks. This kind of network is able to extract knowledge from visual input by applying Optical Character Recognition (OCR), facial recognition, or object classification at once and without being explicitly fine-tuned. This zero-shot capability of CLIP is enabled by the choice of specific text prompts targeting the searched object within an image.

Citation:

Rösch P. J.; Deuser F.; Habel K.; Oswald N.

ABSTRACT

In this paper, we investigate how CLIP can be used to identify vehicles in the military domain and use lessons learned from the Ukraine-Russia war. For analysis, a new dataset was created containing images with military and civilian vehicles, but also images without vehicles. First, we search for appropriate queries to leverage single search results and then ensemble multiple prompts. Second, we explore whether this approach can be used to identify military vehicles from video streams based on surveillance cameras and smartphones. We show on our image dataset that with thoughtful prompt engineering, the CLIP model is able to identify military vehicles with high precision and recall. The performance of the video dataset depends on object size and video quality. With this approach, allies, as well as hostile parties, can systematically analyze large amounts of video and image data, without time-consuming data collection and training.


FULL ARTICLE