I recently posted a question about designing cross database search on The Asilomar Institute for Information Architecture's aifia-members list. A few people provided me with descriptions of their experiences. Most respondents empasized understanding the business rules and users. The following is my summary of this discussion.
Defining the problem
My project is site search for a digital library portal that provides entree to our data. A cross database search of our portal and databases provides access to information across all these sources. We presently provide a site search feature that searches our portal pages and all databases. The results screen gives individual database results 5 at a time with links to view the remaining hits per database. Log analysis of our site search shows that users expect to any type of information we collect and serve via our site. In our search logs you can find known item searches for databases, for book and article titles, for individual market reports, for topics, etc. This is the feature we are redesigning.
The data we are searching includes many different content types. Content includes digital library collections (from newsletter archives to news feeds, market reports and company profiles), enterprise databases (from online dictionaries to document repositories), external data sources (external sources such as citation indexes, external database vendor sites, etc.). Our design approach has centered around two simultaneous efforts: 1) low level examination and analysis of search logs and 2) top down definition of rules based on bottom-up creation of use cases for our personas. The rules we came up with are relatively high level. We're now struggling with more specific rules specifying site search functionalities.
I describe cross database searching as a task that can 1) help users identify resources and/or 2) help people find more granular results within resources or sets of resources. This is a high-level view. Prior to describing functionalities, we expected that we would encounter issues with any effort to: 1) divide sources into categories, 2) define relevancy rules across sources. We also believed that speed to build cross database search results would be an issue.
Feedback
The three bullet points below summarize the responses to my inquiry. They discuss the approach taken in applications where multiple databases are being searched and collated in some form. A fourth category of approach was not to provide collated results in any form. It was noted that this was due to an inability to implement and not to any lack of need.
1) Collated and clustered
One person is considering blended results and post-query clustering on documents. Their approach is similar to Vivissimo's or Northern Light's. We agreed that the task of blending or collating results is difficult and not always sensible. Very often we're dealing with different document types, different metadata schemas, different sets of relevancy rules per database. He notes that the collation of results across databases with disparate document types may produce results that are possibly better for his users, but it is more difficult to execute. He also suggested that one of our current prototypes -- results with databases mapped to categories -- may be enough if our current search is bad or nonexistent. He reports that relevancy rules across databases are difficult to understand and in his organization, this task belongs to database owners/administrators.
2) Collation under categorized searches
Another person presents cross database results in categories, mapping each database to a category. Their group opted for this strategy after considering collating all results. She notes that a decision to go with categories vs. deduped/blended results really depends on user needs. Their group decided that categorized results are sufficient after doing search term analysis. They happen to be using the Verity UltraSeek database module as their tool.
3) Collated
James Robertson pointed me to BBCi Search and John O'Donnovan discussed BBC site search and its cross-database search functionality for me at a high level, using the illustrations in Martin Belam's "A Day In The Life Of BBCi Search". BBC's site search provides collated results from all of their site, with the exception of their news corpus, which is separated from this collation. I assume that they did not have difficulty with relevancy ranking in this combined corpus if their system primarily indexes web page content, i.e. does not index non-html documents or specific fields in data sources.
John provided thoughts about cross database search design. He says core issues will revolve around figuring out if and how we will want to normalise, compare and display results given our paticular business rules and user. If it is to be a generic solution then we will need to have adjustable rules and collation.
Summary
A few other people commented that they experienced considerable pain dealing with cross database searching. This is not surprising. This is a nascent industry for search. Information professionals are familiar with cross database search in database aggregator applications such as Dialog and in Z39.50-based federated search. These search tools have typically only helped you identify a resource and require an extra step of executing your search on a single database. The type of cross database searching that is being pursued more heavily is concerned with presenting collated, deduped and relevancy-ranked results. As noted by Webfeat in a recent InfoToday article, in truth this goal is not completely possible right now. (For more on this topic, see "The Truth About Federated Searching")
The process of specifying cross database search is obviously concerned with defining rules that apply to our specific target user. Refining the high level goals as more refined rules can be an arduous task with a diverse set of document or content types and metadata schemas in the searchable targets (sources). Collating results in this type of environment is increasingly difficult and may not produce the type of high relevance that is associated with search within individual databases. This situation gives rise to two questions ask whose answers may be in conflict. The first is related to users -- who will be using this and what types of results do they expect? The second is related to systems -- how can we give the most relevant results given the types of searches we see executed presently? We're continuing to bang out prototypes to see if our ideas are executable, but balancing what people will want with what will give best results is tricky.
Comments
Post new comment