Home >

news ヘルプ

論文・著書情報


タイトル
和文: 
英文:Similarity search for office XML documents based on style and structure data 
著者
和文: 渡辺 陽介, 上垣外 英剛, 横田 治夫.  
英文: Yousuke Watanabe, Hidetaka Kamigaito, Haruo Yokota.  
言語 English 
掲載誌/書名
和文: 
英文:International Journal of Web Information Systems 
巻, 号, ページ Vol. 9    No. 2    pp. 100 - 116
出版年月 2013年6月 
出版者
和文: 
英文:Emerald Group Publishing Limited 
会議名称
和文: 
英文: 
開催地
和文: 
英文: 
公式リンク http://dx.doi.org/10.1108/IJWIS-03-2013-0005
 
DOI https://doi.org/10.1108/IJWIS-03-2013-0005
アブストラクト Purpose – Office documents are widely used in our daily activities, so the number of them has been increasing. A demand for sophisticated search for office documents becomes more important. The recent file format of office documents is based on a package of multiple XML files. These XML files include not only body text but also page structure data and style data. The purpose of this paper is to utilize them to find similar office documents. Design/methodology/approach – The authors propose SOS, a similarity search method based on structures and styles of office documents. SOS needs to compute similarity values between multiple pairs of XML files included in the office documents. We also propose LAX+, which is an algorithm to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm. Findings – SOS and LAX+ are evaluated by using three types of office documents (docx, xlsx and pptx) in our experiments. The results of LAX+ and SOS are better than ones of the existing algorithms. Originality/value – Existing text-based search engines do not take structure and style of documents into account. SOS can find similar documents by calculating similarities between multiple XML files corresponding to body texts, structures and styles.

©2007 Tokyo Institute of Technology All rights reserved.