Prabhakar Raghavan on building a secure foundation for information retrieval.
Dr. Prabhakar Raghavan is vice president and chief technology officer of Verity, Inc., in Sunnyvale, California. An ACM Fellow, he was a researcher for IBM before joining Verity. He holds a bachelor's degree from the Indian Institute of Technology, Madras; a master's in electrical and computer engineering from the University of California, Santa Barbara; and a Ph.D. in computer science from the University of California, Berkeley.
UBIQUITY: Let's start with Verity, why don't we? Tell us about Verity and your relationship with it.
PRABHAKAR RAGHAVAN: Verity is a public company in enterprise infrastructure software. Within the knowledge management industry, it is well acknowledged by analysts such as IDC and Gartner as the leader. I am vice president and chief technology officer.
UBIQUITY: Describe your job.
RAGHAVAN: My job is to set the strategic future of the company technically. It's a mixture of a technical role and a business role, where I see the direction of the markets and the direction of the technologies and bring them together.
UBIQUITY: In the simplest terms, what does Verity do?
RAGHAVAN: Most people are familiar with search and classification on the Web for instance, Web search services such as Google and Web classification services such as Yahoo!. Verity does the same thing for enterprise and e-commerce portals as well as e-business applications and the content within those spaces. The amount of content on the Web is a few tens of terabytes. The amount of content in corporate enterprises repositories is many orders of magnitude larger than that. I would argue that the most valuable content in the world is that which is protected by and kept inside enterprises. Searching, organizing and personalizing information inside of companies is what Verity is about. While I used the analogy of the Web to set the stage, the technical challenges inside companies are very different from those for the Web.
UBIQUITY: How are they different?
RAGHAVAN: It goes both ways. There are things that are easier inside companies, and then there are things that are more challenging. Let me explain what I mean by this. What makes it easier is you don't have as many people creating content with very divergent motives. You don't have totally random authorship styles. And typically you don't have spam in enterprises, nor do you have users who are trying to break your system. Those are reasons why enterprise content is easier to deal with than the Web. Now, there are many reasons why it's harder to deal with. The primary factor is what we call "fine-grained security."
UBIQUITY: What is fine-grained security?
RAGHAVAN: The point of fine-grained security is that the enterprise can restrict the documents that users can access. Let me amplify. If you are on the Web, you and I are equal citizens, meaning if you can access a Web page, so can I. In enterprises, it's different. For example, mergers and acquisitions documents are only visible to people in the mergers and acquisitions department or in high management. Human resources documents about benefits are available to everybody, but maybe documents on an impending layoff shouldn't be visible to everybody.
UBIQUITY: How would you explain why this is important for search and classification?
RAGHAVAN: Let's say you have an innocuous engineer who issued a query for "Cisco", and all this person is trying to figure out is what Cisco routers she can buy for use in her company. If the search results were to say, "Here is a document about a Cisco takeover" then at that point you have already compromised information. It doesn't matter if you say, "I don't really want her to see the full text of that document." So a completely innocent query compromised information this individual was not supposed to see. In summary, fine-grained security is the ability to interlace search with security at the document and individual levels.
UBIQUITY: This sounds like a large technical challenge.
RAGHAVAN: A huge one! The challenge is compounded by the fact that documents reside in different repositories within a company -- such as intranet HTML, databases, enterprise applications like human resources applications as well as Lotus Notes, Microsoft Exchange and other content management systems. Corporate data is stored in disparate systems in geographically diverse networks. You have to pull it all together and make it indexable, searchable and classified, while respecting the fine grained security.
UBIQUITY: Are there other challenges that make it more difficult to work on intranets?
RAGHAVAN: Another challenge is the diverse types of documents that exist within an enterprise. Documents can exist in hundreds of different file formats such as PDF, Microsoft Word, as well as in legacy formats that many people may have forgotten. You may not even have the application anymore, but it's a key document that was written five years ago. The question becomes: how do enterprises show the information to you? Those are some of the challenges that arise in enterprise software that are very different from the Web.
UBIQUITY: Can you give a general idea of some of the solutions? For instance, how do you protect that document that talks about a Cisco takeover? (And let's be sure we emphasize that the Cisco takeover is just a hypothetical example. There's no need to call a stockbroker on our account.) So how do you protect it?
RAGHAVAN: I'll describe some of the framework for the solution, but how it gets deployed depends on the customer's needs. Search engines typically begin with the process known as indexing. For each document, the search engine creates an index, which is a list of all of the words that are contained in the document. So when I say, I want all documents containing the word 'John', the system goes through the indexes and retrieves the documents containing the word 'John.' It can be more sophisticated. For instance, it could ask for documents that contain a particular phrase like "International Business Machines" or "Association for Computing Machinery." That's fairly standard technology.
UBIQUITY: What happens after the indexing process?
RAGHAVAN: During the indexing process, the system knows which word or phrase is contained in which document. However, we also maintain for each document a list of groups of users who can access the document. For instance, a particular document can only be accessed by people in our Japan office or by people in executive management. When the user initiates a search, the engine checks her credentials. It figures out which group she is a member of. For instance, is the user in the Japan office? Is she in human resources? Then an intersection process happens, where we take all of the retrieved results and check them against all of the groups that the user belongs to. It's a fairly exhaustive process.
UBIQUITY: Does the engine verify the credentials of the person initiating the search?
RAGHAVAN: It could say that you're a manager because some static directory says you are a manager. Or at the time of search it could go and ask somebody with authority if you are still a manager. There are a few customers who are interested in having real-time reassessment of credentials. But that can slow things down a little bit. The solution again is to maintain group memberships, and then check what groups the user is a member of, and then verify those against which groups can access these documents.
UBIQUITY: Would there be a way for an assistant to conduct a search?
RAGHAVAN: The assistant will have his or her own identity, and would in turn only be shown documents that an assistant is able to see, and nothing else. We let customers decide how to delegate that privilege.
UBIQUITY: And is there a way to retroactively make things invisible? Suppose a company suddenly got into a lawsuit involving Cisco. Is there a way to say, "From now on people cannot view certain documents that mention Cisco"?
RAGHAVAN: Yes, absolutely. That's easy. However, that's only from now on, meaning that if you saw the document yesterday, there is nothing I can do about it today.
UBIQUITY: So, in what you do, security is every bit as important as searching?
RAGHAVAN: Yes. This aspect of secure search is the foundation from which we build up deeper functionality that corporations demand. To lay it out at the very high level, Verity's information management infrastructure can be divided into three tiers. The first tier is "Discovery," which includes basic and advanced search features such as secure search. There are other forms of search that enterprise users do. This has to do with how enterprise users' behavior diverges from that of consumers on the Web. Consumers on the Web typically have an interest in making a transaction, meaning, "I want to buy a product," or a navigational need such as, "I want to find the home page of Pizza Hut because I want to take my kids there tonight." Enterprise knowledge workers tend to place a higher premium on their time, and so they demand a richer suite of discovery services than just keyword search.
UBIQUITY: What comes after the base layer?
RAGHAVAN: The second tier is "Content Organization,' what I briefly alluded to as classification. Content organization means to analyze, extract significant concepts, and build out taxonomies on the content in the enterprise. It goes without saying that these taxonomies, again, will respect security privileges in the manner I mentioned earlier.
UBIQUITY: And the final layer is?
RAGHAVAN: The third tier, which I think really is the harbinger of the future, is to invoke ideas from social network theory. I'm sure you are familiar with Milgram's theory of "six degrees of separation," one of the standard examples of social network theory. The point is instead of looking only at interactions between people, we want to look at the interactions between people and the documents in the enterprise. Which documents are you reading? Which documents are you opening? Who else in your department is looking at those same documents? Or what other documents are your employees looking at?
UBIQUITY: So you're actually extending social network theory, right?
RAGHAVAN: Right. There was a bunch of research in the '90s on tapping the social network for knowledge management. We are looking inside the social network. The idea is to analyze, to mine these linkages and use them to recommend documents to you as you are going about your daily tasks, or to bring to you experts in your enterprise on the path you seem to be taking. For instance, if you were researching a particular subject or an individual, it might bring up not only their home page, but maybe documents they have written, as recommendations. The idea is to minimize the burden of how much you have to type in physically and to infer cues based on your behavior.
UBIQUITY: How is this different from classic citation studies, where you would look at all of the people who have cited your papers?
RAGHAVAN: Good question. Citation analysis is actually one of the cornerstones of the field. But understand, Verity does more than citation analysis. Citation analysis looks at issues like, if you and I are looking at the same documents. I go look at a new set of documents, then citation analysis would suggest that you too should look at those new documents. That's based on our behavior. Verity takes this to the second level. We bring documents together. Since the content is overlapping, they tend to have the same vocabulary. As a result, if you and I are looking at similar documents, that somehow brings us closer together. Citation analysis simply focuses on behavior, but we also focus on the content. There is a third aspect that has to do with role. Verity's technology examines the notion of role. For example, if you and I are in the same department within a company, let's say, then somehow your opinion and your behavior might mean more than that of a person in a different department. In other words, you and I will influence each other more heavily, because presumably our tasks and our business goals are similar.
UBIQUITY: So how does it all come together?
RAGHAVAN: In the combination of these three aspects: behavior, content and role, the challenge is to try to derive value from all of these as opposed to from one or the other. Verity focuses heavily on the business benefits that we can deliver to our customers. We like to try and quantify the ROI that companies make as a result of our technologies.
UBIQUITY: How do you quantify ROI on a search and classification system?
RAGHAVAN: Much of it comes from time-motion studies. We've had customers document this and tell us that before installing our software, their employees spent X amount of time looking for information. Once they put our software in place, X dropped to Y (whatever amount). Knowledge workers are compensated on average say $90,000 a year. The fact that the knowledge workers can save two-and-a-half hours per week searching for information translates to an enterprise savings of $300 million a year. A second form of ROI comes from deployments on e-commerce sites, such as Home Depot or pressplay. These companies look at the conversion of browsers into buyers and the resulting uptake in transactions. Here is a simple example for an e-commerce site. If you are not sure how to spell Tchaikovsky, you're not going to find that Tchaikovsky CD. Therefore, a shopper goes away who should have made a transaction. Verity's search software includes spell correction. If you misspell Tchaikovsky, it will say, "It looks like you meant 'Tchaikovsky,'" and pull up the results. Another way of measuring ROI is to consider accepted industry figures from analysts like Forrester and IDC. Let's say that if you had to manually classify documents, it would cost you about $25 per document. The value proposition here is to say, well, if you have two million documents, you could manually maintain them at a net bill of $50 million. Instead, if you augment the process so Verity's intelligent classification technology does most of the job, you save yourself a lot of money. So these are some of the ways we like to concretely measure ROI. It's great to bring fantastic technology to the market, but we want to make sure there is a business benefit because in this economic climate especially, ROI is very critical for our customers.
UBIQUITY: To what extent is it possible to move the basic system components from client to client? How much tailoring is necessary?
RAGHAVAN: We have maintained over the years a ratio of roughly 10 percent of our total revenue from consulting, consulting being the portion where our software consultants go out and build the application, which means 90 percent of our revenue comes from our software. The idea is to build as much functionality into the core engine as possible. A good way to think about it is we are like a database for text, just like Oracle is a database for relational and transactional data. People build applications on top of their database. We focus on the core engine and do relatively little customization from one installation to the other. Now, that said, some customers like to tailor the look and feel of their content. So there is some customization but for operational reasons we keep that to a minimum, and ingest the more critical functionalities in the engine.
UBIQUITY: You mentioned the Home Depot case of an e-commerce site. That's not within the intranet?
RAGHAVAN: Home Depot is an e-commerce Website which has deployed Verity's advanced search technology. Approximately 60 percent of our revenue comes from intranet applications; 15 to 20 percent comes from e-business sites. Verity doesn't go after the free content on the Web. We manage the proprietary content within and facing out of companies.
UBIQUITY: Is there some reason that you specifically don't want to get into the free content on the Web?
RAGHAVAN: We don't see a lot of money to be made on the Web. When the Web boom came, Verity was there and predated the Web search engine. In some sense it might have been a good marketing exercise, but from the standpoint of revenues and business, it didn't make sense so we never went in that direction. Our choice is being validated because you see a stream of former Web search engines such as AltaVista, Inktomi and even Google trying to move into the enterprise space.
UBIQUITY: And yet, you think of the Verity system as a great engine, don't you?
RAGHAVAN: Absolutely. But at the same time, to put it out on the Web and build up an infrastructure for serving Web sites is definitely not money that we would invest at this point.
UBIQUITY: How would you compare it to something like Google?
RAGHAVAN: In terms of technical evaluation, I would say that just as Google does a fine job of addressing the challenges on the Web, we do as good or better of a job in addressing the challenges of the enterprise such as security requirements, diversed formats and repositories. So we are much more cued into dealing with the needs of the enterprise. Google is a service. We are focused on being a software company, not on being a service. I'd like to think we excel at what we are doing, much as Google excels at dealing with the Web. Now, as much as they try to migrate, they'll run into enterprise realities and will have to build their experience and products to address those realities before they can compete with us in the marketplace. Two years ago, when I left IBM, I considered the Web search engine companies, and I thought there was a more compelling proposition here.
UBIQUITY: Before we leave the subject of general search engines, are there any that you find particularly good?
RAGHAVAN: I think the market speaks for itself. People have voted with their clicks, and Google seems certainly to be the most popular. I see the other search engines picking up as well, but certainly Google has a reputation and a brand now that is going to be hard for others to overcome.
UBIQUITY: Let's go back in time and look at your intellectual evolution. How did you get to where you are now?
RAGHAVAN: I finished my Ph.D. in 1986 and went to IBM Research. I spent nine years at the T.J. Watson Research Center, and then four and a half at the Almaden Research Center. Both are very exciting workplaces in terms of scientific excellence. I had the opportunity to work with some very intelligent people there. IBM is a large enough company that it can say, "Go and do neat things and have an impact." What that translates to, especially when you are young and fresh out of college, is a tremendous opportunity to grow professionally and establish a scientific reputation and establish your presence in the field. During the period, I moved to a number of different areas, starting from core algorithms in theory to more of a focus on what you could call applied theory. Interestingly, even now I get e-mail from people in that business saying, "We are trying some of your techniques from way back then." Then I moved on to work with databases and data mining, especially after I moved to Almaden. The Almaden Lab has a fairly strong focus on data storage and data management. In that environment, I was stimulated by colleagues who were thinking about the issues of databases and data mining. And so, both my research and publications moved to more of that focus. One of the things I did there was to initiate a project on Web search and mining, which was publicly called the Clever project. We presented a lot of nice ideas to the scientific community, many of which I am pretty certain have found their way into commercial Web search engines. That was really an exciting period as well, because the Web was taking off at the time.
UBIQUITY: What's it like doing research in a corporate environment?
RAGHAVAN: It is often exhilarating. I'll give you an example of the kind of question that I used to ask my colleagues there, and this will give you a sense of how things go in a research lab environment. I said, "The big challenge is to imagine that you have no computation constraints. If you had this power at your disposal, how could you do a better search engine? " Meaning, could you give people better answers? Now, that's more than a hard exercise, because it actually led to the development of some of the algorithms that we came up with in the project. What it does, is set you free to think in a different dimension, and then you come back and say, "All right, now we invented an algorithm that gives great results, but is excruciatingly slow." What can you then do as a computer scientist to bring its performance to an acceptable level? Being able to do that and thinking about ideas is very exhilarating, and that's what you could do at an IBM Research facility.
UBIQUITY: And what brought you to Verity?
RAGHAVAN: Having done that for some years, what I came to recognize perhaps three years ago was that I wanted to be much closer to where the rubber meets the road, so to speak, or where the algorithm meets the electronics. I wanted to take these great ideas and move them aggressively into products. That led to my coming to Verity two years ago, where I could see that I had influence in all parts of the business, in sales and marketing and, most importantly, the technology. I could decide which technical ideas go into the product to help solve the customers' business problems.
UBIQUITY: Compare the two experiences.
RAGHAVAN: It's a very different lifestyle. One way I described it to my former IBM colleagues was, "At Verity, I have the right and responsibility to make a difference." It's a different set of challenges.
UBIQUITY: When you design systems for clients, do you, yourself, get involved with the clients and the clients' problems?
RAGHAVAN: Certainly with some of the larger clients for a variety of reasons. One is starting from the aspect of projecting technical competence, all the way to when issues arise, what do we do about it? I have to say, it's not just me. Over the last two years, we have built a team of people with former research backgrounds. About 15 percent of our technical organization is PhDs. It's an environment that is intellectually stimulating.
UBIQUITY: What is the largest and smallest level of company that your products are used by?
RAGHAVAN: Eighty percent of the Fortune 50 and 66 percent of the Fortune 100 companies are Verity customers. That includes both the enterprise customers as well as e-commerce customers.
UBIQUITY: As your involvement with information retrieval has deepened and broadened, have you been surprised especially by anything that's happened?
RAGHAVAN: Well, I don't know if this should surprise you, but I feel that information retrieval as a science has languished for a while. But then the Web reshaped information retrieval in a number of ways. The Web finally took in terabytes of information that most of us had no access to and put it all together in a network. Suddenly all of us were able to get all of this information, and that shifted everybody's expectations on what it is we should be able to do with knowledge management, that we should each be able to easily search and browse categories. So that really reshaped people's expectations and behavior. The other thing that went on at the same time, and to me this is the more valuable part in the long run, is the networking revolution. Over the '90s, if you think about it, companies like Cisco and Nortel flourished. But why did they flourish? Well, what they were doing was selling networking gear to all of the biggest companies in the world. So large multinational companies like ABB, which is one of our customers, or PwC or KPMG, instead of having a lot of data and information silos scattered around the world, suddenly found that they could tie all of this information together. If you think about it, the '90s was a period when the bits and bytes scattered around the world were tied together by the network revolution.
UBIQUITY: So what's the future look like?
RAGHAVAN: In the next 10 years we have to go beyond the bits and bytes and look at the information. What does it take to tie that information together? Companies have realized that having spent billions of dollars on tying the bits together, they now have to get value. Information that is now being made accessible should also be searchable, retrievable and browsable. There's a pattern as to what happened to the Web and change information retrieval to what's happening inside the enterprise. Call it a deeper driver of information retrieval, because it's no longer a case that I want to search the 5,000 documents on my local computer. I have the whole world at my fingertips.
A Ubiquity symposium is an organized debate around a proposition or point of view. It is a means to explore a complex issue from multiple perspectives. An early example of a symposium on teaching computer science appeared in Communications of the ACM (December 1989).
To organize a symposium, please read our guidelines.
Ubiquity Symposium: Big Data
- Big Data, Digitization, and Social Change (Opening Statement) by Jeffrey Johnson, Peter Denning, David Sousa-Rodrigues, Kemal A. Delic
- Big Data and the Attention Economy by Bernardo A. Huberman
- Big Data for Social Science Research by Mark Birkin
- Technology and Business Challenges of Big Data in the Digital Economy by Dave Penkler
- High Performance Synthetic Information Environments: An integrating architecture in the age of pervasive data and computing By Christopher L. Barrett, Jeffery Johnson, and Madhav Marathe
- Developing an Open Source "Big Data" Cognitive Computing Platform by Michael Kowolenko and Mladen Vouk
- When Good Machine Learning Leads to Bad Cyber Security by Tegjyot Singh Sethi and Mehmed Kantardzic
- Corporate Security is a Big Data Problem by Louisa Saunier and Kemal Delic
- Big Data: Business, technology, education, and science by Jeffrey Johnson, Luca Tesei, Marco Piangerelli, Emanuela Merelli, Riccardo Paci, Nenad Stojanovic, Paulo Leitão, José Barbosa, and Marco Amador
- Big Data or Big Brother? That is the question now (Closing Statement) by Jeffrey Johnson, Peter Denning, David Sousa-Rodrigues, Kemal A. Delic