Now the internet teems with consumer videos which are highly diverse and often with low quality. Video semantic indexing, which extracts meaningful information from those Internet videos, has been extensively studied. In this paper, we introduce the research activities in TRECVID workshop, an international workshop, where many participants compete with each other on video semantic indexing techniques and forecast their prospect. We focus on machine learning approach robust against the variety and the low quality, fast computational techniques required for real applications, and multi-modal recognition techniques which combine various modes such as image, audio, and text. We discuss their problem, and predict their future development.