Modeling concepts using supervised or unsupervised machine learning approaches are becoming more and more important for video semantic indexing, retrieval and filtering applications. Naturally, videos include multimodality audio, speech, visual and text data, that are combined to inferred therein the overall semantic concepts. However, in literature, most researches were mostly conducted within only one single domain. In this paper we propose an unsupervised technique that builds context-independent keyword lists for desired speech-based concept modeling from WordNet. Furthermore, we propose an extended speech-based video concept (ESVC) model to reorder and extend the above keyword lists by supervised learning based on multimodality annotation. Experimental results show that the context-independent models can achieve comparable performance to conventional supervised learning algorithms, and the ESVC model achieves about 53% and 28.4% relative improvement in two testing subsets of the TRECVID 2003 corpus over a prior state-of-the-art speech-based video concept detection algorithm.
By: Xiaodan Song; Ching-Yung Lin; Ming-Ting Sun
Published in: RC23648 in 2005
LIMITED DISTRIBUTION NOTICE:
This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.
Questions about this service can be mailed to reports@us.ibm.com .