Search and data mining on the web: from string matching to assistive exploration
A quarter-century ago Web search stormed the world: within a few years the Web search box became a standard tool of daily life ready to satisfy informational, transactional, and navigational queries as required towards some task completion. However, two recent trends are dramatically changing role of this box: first, the explosive spread of smartphones brings significant computational resources literally into the pockets of billions of users; second, recent technological advances in machine learning, artificial intelligence, natural language understanding & speech processing led to the wide deployment of assistive AI systems, culminating in personal digital assistants. Along the way, the "Web search box" has become an "assistance request box" (implicit, in the case of voice-activated assistants) and likewise, many other information processing systems (e.g. e-mail, navigation, personal search, etc) have adopted assistive aspects.
Formally, an assistive systems is defined as a selection process within a base set of alternatives driven by some user input. The output is either one alternative or a smaller set of alternatives, maybe subject to future selection. Hence, classic IR is a particular instance of this formulation, where the input is a textual query and the selection process is relevance ranking over the corpus.
In increasing order of selection capabilities, assistive systems can be classified into three categories:
- Subordinate: systems where the selection is fully specified by the request; if this results in a singleton the system provides it, otherwise the system provides a random alternative from the result set. Therefore, the challenge for subordinate systems consists only in the correct interpretation of the user request (e.g., weather information, simple personal schedule management, a "play jazz" request).
- Conducive: systems that reduce the set of alternatives to a smaller set, possibly via an interactive process (e.g. the classic ten blue links, the three "smart replies" in Gmail, interactive recommendations, etc).
- Decisive: systems that make all necessary decisions to reach the desired goal (in other words, select a single alternative from the set of possibilities) including resolving ambiguities and other substantive decisions without further input from the user (e.g., typical translation systems, self-driving cars).
This lecture is organized as follows: in the first part I will present a personal perspective & retrospective on selected Web search and web data mining technologies: both what turned out gratifyingly right and what turned out embarrassingly wrong. Topics will include near-duplicates, the Web graph, query intent, inverted indices efficiency, and others. While this is an idiosyncratic collection the driving force behind all of these technologies has been a inexorable quest for larger scale, more speed, and greater functionality. In the second part of the lecture I will show how this quest naturally leads to assistive AI solutions that are becoming pertinent to a wide variety of information processing problems. I will mostly present ideas and work in progress, and there will be many more open questions than definitive answers.