Recent office documents follow an XML archive format, so they consist of multiple XML files. XML files in office documents include information about page structures and styles such as font, color and position. But, existing text-based search engines do not focus on structure and style of documents. By utilizing them, we can achieve similarity search for office documents based on structures and styles. We propose SOS, a similarity search method based on structures and styles of office documents. To compute a similarity value between office documents, we have to compute similarity values between multiple pairs of XML files in the documents. We also propose LAX+, which is an algorithm
to calculate a similarity value for a pair of XML files, by extending existing XML leaf node clustering algorithm. In our experiments, we use docx, xlsx and pptx files and evaluate SOS and LAX+ by precision and recall.