Data Aggregation: Imputed Topic Analysis

Traditional approaches to organizing content involve tedious "meta-tagging" — labeling individual documents (or organizing them into folders,) to indicate their subject matter. In the case of large repositories of legacy information (like federal government web sites) this approach is too time consuming, too expensive, and inaccurate. And that's if it can be done at all.

"Clustering" search approaches attempt to alleviate this problem by organizing search engine results "on the fly", but their clusters only reflect the results delivered by an outside search solution, and don't give users any further indication as to the topicality of the document set as a whole. Furthermore, because they only operate at "search time", they offer no consistency across searches, and no assistance to the wider problem of content aggregation throughout the site.

Unlike "clustering" technology, IT.com's imputed topic analysis is able to pre-analyze a corpus of documents and discover topics that naturally occur, prior to user queries. In this way, we're able to take advantage of a naturally occurring taxonomy within the corpus. We build a hierarchical, multi-path taxonomy of the corpus — "multi-path" meaning that each document might be found within several topics, depending on the mixture of its content. We also analyze the terms within documents, and associate them to topics as well.

This allows us to provide in response to each query a list of topics that are related to what the searcher is looking for. When a searcher enters a query, we’re able to discover what topics are relevant to that particular search, and list them alongside the search results, suggesting alternate lines of inquiry to organize their results and focus their search, and providing a view of the topics addressed by the site as a whole.