It's all about the database

Ubiquity, Volume 2002 Issue April, April 1- April 30, 2002 | BY John Gehl

Full citation in the ACM Digital Library

Phil Bernstein on the unheralded potential of meta-data management.

Phil Bernstein is a Senior Researcher at Microsoft Corporation. Over the past 25 years, he has published more than 100 articles on the theory and implementation of database systems, and is a co-author of three books, the latest of which is "Principles of Transaction Processing for the System Professional," (Morgan Kaufmann, 1997). He is an ACM Fellow, a recipient of the ACM SIGMOD Innovations Award (for database research), and a member of the Board of Trustees of the Computing Research Association and of the VLDB Foundation. His research is in the areas of meta-data management and transaction processing systems, details of which are at http://www.research.microsoft.com/~philbe.

UBIQUITY: One of the most common complaints about technology is that everything's gotten too complicated. Do you sympathize with that complaint?

BERNSTEIN: Well, yes, up to a point; products are certainly getting very complicated. The fact is that everybody wants more features, but then they're unhappy when some of the complexity comes back to confuse them or bite them in some way.

UBIQUITY: Is there any hope for dealing with the complexity problem, or is it going to get continuously worse?

BERNSTEIN: It gets both continuously worse and better at the same time, but in different respects. When people come up with ideas for useful features to add to products, and when customer demand for those features becomes evident, software vendors incorporate those features because of the competitive advantage and because their customers like it. But very often, the interactions of those new features with the existing complex software base that they are added to is not fully understood.

UBIQUITY: So the glass of water is half full and half empty. And the half-empty part will always be empty?

BERNSTEIN: I'm afraid so. Nobody knowingly adds features with the intention of creating bugs, but these big products -- not just applications but also operating systems, database systems, compilers and so on -- are too complicated for any one person to understand. When you add a feature, you can't possibly be sure that there will be no adverse interactions. That's the sense in which things get worse -- more complicated and buggier, or at least more confusing.

At the same time, product developers and researchers are always looking for more powerful and simpler abstractions that capture existing functionality plus the new features that creep in over time. When those simpler abstractions are discovered, and ultimately incorporated into products, it often requires a rewrite of a product or a portion of a product. The simplicity embodied in that rewrite makes the features more robust and more reliable because their interactions are easier for designers and developers to understand. In that sense, things get better. But these improvements don't come overnight. And every now and then you have a big leap where a simple abstraction actually comes before the feature complexity.

UBIQUITY: What would be an example?

BERNSTEIN: The relational data model is one. But these leaps are few and far between. Most progress is done incrementally, partly because these great new ideas come infrequently and partly because you can't throw out these enormous products and start over every time you have a new idea. For example, I work in the database field. The big database vendors are all facing this problem today with XML. XML introduces a new query language, XQuery, which shares a lot of its capabilities with SQL, the current standard, but also has some opportunities for doing things differently. Do you do it incrementally? Or do you write your query processor over again?

UBIQUITY: Characterize the new possibilities from a database point of view.

BERNSTEIN: If adding XML were neatly encapsulated in one part of the system, a rewrite of that one part wouldn't be so bad. The problem is that it potentially affects every layer. XML is a new data type. You could store it using existing SQL data types or you could create a new data type for XML. You could use existing storage structures like binary large objects to store XML in a bit stream, or you could shred it into separate fields of different columns of tables. Or you could develop a new access method, a new physical representation that somehow gives you advantages over the existing options that the system supports. There are potentially new operators you can perform on XML trees that are not easily represented in a table-oriented SQL language. You could implement those. You could have a new language on top with the potential to optimize it differently, given the new repertoire of operators and access methods that are available underneath. So, you could make an argument for rewriting the whole thing. Obviously, some database system code bases are more flexible in being able to make these additions in all the different parts of the system without disrupting the overall shape of the software. Others are too old and have undergone too much modification, making these sorts of changes much more difficult. It's a classic case of what happens with new functionality. Do you rewrite or make incremental forward progress?

UBIQUITY: Does the Microsoft vision differ from anybody else's in this respect?

BERNSTEIN: Probably not. I think that the history of the PC software industry is a little bit different in that there's been more willingness to rewrite or make massive changes to a product. I think there's been more of that with PC products than in the traditional enterprise-oriented software business. Therefore culturally, I think Microsoft is sometimes more willing to take big steps forward.

But, that too is changing, because more of Microsoft's business now is oriented toward enterprise software and systems like Windows 2000, Windows XP in the operating systems space, or SQL Server in the database market. These are already very complex programs and reliability is an important feature. You don't want to rewrite huge portions of them from one release to the next if that jeopardizes stability. Instead, you work hard on architecture and clean interfaces, and then use them to provide the base necessary to make incremental modifications for long periods in the future. Every vendor faces this. With every release they have to decide, is this the one where we're going to bite the bullet, redesign major pieces, and work four years on the next release? Why don't we just rewrite the whole thing from scratch?

UBIQUITY: What is your particular focus within databases?

BERNSTEIN: Currently I'm working on meta-data. Many people know me best for work I did in transaction processing, from the late 1970s to early 90s. But for the last five to ten years I've been focusing on problems of meta-data.

There are two kinds of meta-data that people commonly talk about. One is structural meta-data, that is, schemas, interface definitions, and other data-structure-like things, which describe how information is put together. So, an interface definition in a programming language, a database schema in a database system, a Web site map that describes how the Web pages are connected to one another -- these are all examples of structural meta-data.

The other kind of meta-data is more in the spirit of information retrieval. It's things like keyword descriptions and other content-oriented descriptions of information. People often call that meta-data, as well. The kind of meta-data that I'm currently focused on is more of the first kind, the structural kind, which is more naturally a database problem, since a big piece of all database systems is in the manipulation of schema information and other structural meta-data.

UBIQUITY: What's the nature of the central problem?

BERNSTEIN: It's currently quite laborious to manipulate meta-data. In commercial data processing, for example, we've been working with the relational model for many years as a way of eliminating record-at-a-time manipulation of data, which greatly shrinks the amount of application code that must be written. But we don't have anything comparable for the manipulation of structural meta-data. In meta-data you look at the objects one by one, and this leads to some very long programs.

UBIQUITY: How would the world be different if you solved meta-data problems?

BERNSTEIN: There are two classes of applications that make heavy use of structural meta-data and thus would benefit from progress on this problem. These applications are never as powerful or plentiful as you might like.

One of those areas is software tools. What I mean by that is any kind of design or development tool , including tools written for software design, computer-aided design, electrical engineering, or even complex document release management systems. These are all very complicated tools. It seems like these tools are never quite as good and plentiful as we want. Part of the reason for that is that they're very laborious to write. If we could do better in offering database support for manipulating structural meta-data, these programs would shrink in length by quite a lot, which means that with the same level of effort, we could build more and better tools than we do currently.

UBIQUITY: And the second class of applications?

BERNSTEIN: The other place where it would make a difference is in schema-driven applications. There are many run-time activities that are done in systems that are currently interpreting structural meta-data. An example that's been around for over 30 years is data translation, where somebody has data in a format that was maybe state-of-the-art at the time it was developed, but is no longer widely used. They now want to translate that data into a more current format. Typically this is done by creating a mapping of some kind from the old format to the new, and then interpreting that mapping in order to translate the data. Writing translators like that is currently very laborious because it involves object-at-a-time programming on meta-data structures. A more modern example of the same thing is in electronic commerce where you're translating between message formats, for example, in a business-to-business scenario.

UBIQUITY: Do you have much interaction with people in application areas?

BERNSTEIN: For one thing, my wife is a business application architect, so I hear a lot from her about real world problems. Also, although I'm currently in Microsoft Research, I used to be responsible for a product and talked to application people all the time to get good examples of nasty problems that they were running into to help motivate good directions for products.

UBIQUITY: Do you have a methodology for finding the nastiest problems?

BERNSTEIN: The best way to do this from a researcher's standpoint, particularly in an industrial setting, is to develop a prototype system that shows people who are working with customers a solution that you think might help them. Then you let them try it out, and have them tell you all the things that are wrong.

UBIQUITY: Is that a fairly orderly process?

BERNSTEIN: Sometimes it can be a bit chaotic, but it's always interesting. In the end, the research only moves forward when you have good ideas. Not every problem has a good idea waiting in the wings to be applied. Ultimately, the proof of the pudding is in the eating. If you produce a technology that you think is going to solve a problem, and it doesn't, then it's important to find out why, even if you're not happy about hearing the reasons.

UBIQUITY: What was your undergraduate degree in?

BERNSTEIN: Computer science. I was in the first generation of undergrads that had that opportunity. Back in the early '70s there were very few computer science departments.

UBIQUITY: Then you've always been interested in computers?

BERNSTEIN: Well, since my undergraduate days at Cornell. I didn't know anything about them before I arrived. I was intending to be an electrical engineer. But I quickly discovered that I enjoyed programming, and that, lucky for me, there was a major computer science department there. It gave me the opportunity to pursue that interest early on.

UBIQUITY: What frustrates you about the field of meta-data?

BERNSTEIN: On the product side, meta-data management has never really developed into a healthy business. Like a lot of middleware, it gets stuck between the big products like programming environments, database systems, and the like. So what happens is that tool developers extend the underlying database technology, usually expending very modest amounts of effort, because they can't really sell it. The database technology is just an enabling technology that provides a somewhat nicer data manipulation capability to their meta-data driven tools.

Occasionally, someone goes off and builds a serious meta-data management product. There have been many runs at this over the years. I think some interesting and useful products have been built. Some were commercial successes, but that success has usually been short-lived. I don't think it's that the technology has necessarily been inappropriate, but perhaps it really wasn't meeting enough of people's needs to make them willing to change.

I don't know. It's hard to figure out why product categories don't succeed in a big way. But currently there's not much, if any, independent market left for meta-data management systems. My hope is that this is an opportunity for researchers to take another look at the technology and come up with something that's better than what was developed during the last round that apparently didn't succeed commercially.

UBIQUITY: How do you see meta-data, particularly, and data management generally, in terms of the whole field of computer science? Do you see it as central or peripheral in the realm of computer sciences?

BERNSTEIN: There's an old joke, that whenever you ask researchers to describe their field, they start writing on the whiteboard with a big circle in the middle that consumes half the board and contains all their stuff. And then there are these little circles on the perimeter, which is what everybody else does. Of course, databases are the big circle in the middle. Absolutely central!

Seriously, I guess I have two answers to this. One is that, I think software people in general are looking for effective ways of using the enormous amount of desktop and notebook computing power that's being offered. We often point to the benefits of having more "knowledge-oriented" applications, for lack of a better term, as one opportunity. I don't think that term has a specific technical meaning but the feeling is that we could be doing more symbolic processing, more "intelligent" things based on this huge computing power. Some of that involves manipulating much more information, whether it be lots of databases on the Web, or larger databases that can be made available and downloaded to our desktops. So databases are central. And a lot of it involves manipulating descriptions of that information. That's akin to cognitive thought, manipulating abstractions of the world rather than all the detailed knowledge we might have. That's the problem of meta-data management. This database view of meta-data is not the only approach to taking advantage of the trend.

UBIQUITY: What's another approach?

BERNSTEIN: Certainly, the artificial intelligence community has several interesting abstractions -- knowledge-based systems, natural language systems, speech systems -- which attempt to do the same thing, which is to manipulate abstractions of all the facts in the world in some intelligent way. The graphics and data visualization people also have something to say about this. So this isn't purely a matter of thought. This is also a matter of visualization.

I think many areas of computer science have their particular spin on how to do this. There's a lot of similarity in goals and probably a lot of similarity in the actual technology that they're trying to develop. For example, I've recently been collaborating more with people who have experience in the AI area, particularly in the knowledge representation area. It's clear they've been addressing the same problems that database meta-data people have been after for years. There's always been some cross-fertilization between these two areas, but it's growing now as a recognition that we're working on the same thing. I wouldn't be surprised to find other subfields that would also find quite a lot in common with the meta-data work that we do. Not to say that the meta-data work is at the center of this, but just that there are many overlaps, and that as they get discovered, hopefully we'll make faster progress.

UBIQUITY: It's always nice to hear about harmony. Is there any controversy that's broiling?

BERNSTEIN: None that has percolated up to the surface. Meta-data is a Rodney Dangerfield kind of area of computer science. It doesn't get a whole lot of respect. Everybody acknowledges that many of the problems that plague the data management field have a strong meta-data component, but the work that appears most glamorous and seems to attract the most attention among researchers doesn't have a huge amount of meta-data content. Maybe people don't know quite how to deal with it. I don't know if that's controversial exactly, but meta-data is definitely not widely considered to be at the center of things.

UBIQUITY: Putting on your academic hat for a moment, is it hard to get students interested in meta-data?

BERNSTEIN: No, it's really a straightforward matter to get graduate students interested: have a compelling vision and the prospect of some problems that are of the right size for a PhD thesis. Also, the database field is now well known to be important, so it naturally attracts students. When I started out in the early '70s, it was not clear that database was even a field. Now, of course, it's a huge field. Many PhDs graduate every year, and the field is still expanding.

UBIQUITY: Are you still teaching as an adjunct?

BERNSTEIN: I teach at University of Washington and I have a couple of PhD students there.

UBIQUITY: Can they develop a relationship with Microsoft at the same time?

BERNSTEIN: Yes, some of them do. For anybody who's working closely with me, it makes sense for him or her to visit here as an intern during the summers. There's been a flow of students back and forth. But it's not a requirement.

UBIQUITY: You're obviously in two worlds now -- Microsoft and the University of Washington. Throughout your career you've spent a lot of time in both academia and real world companies. What are your thoughts about the different cultures that you've experienced?

BERNSTEIN: I'm sure there are cultural differences, but it's not the difference that's uppermost in my mind. Instead, I think of differences between of the job categories which a PhD-educated computer scientist can make a career. The three that stand out as obvious choices are being a professor, an industrial researcher, or a product designer or related role in the product world. I spent a lot of time doing each of the three. What I've discovered is that there is no such thing as a perfect job and that each one of these roles has its benefits and liabilities. I've enjoyed doing all of them, but I don't yearn for any one of them as being obviously better.

My current affiliation with University of Washington gives me the opportunity to get some of the benefits of both places. Culturally, I think the places are really compatible, with the exception that in the university you have a little more freedom and encouragement to work on things whose direct practical importance is not so clear. That would be difficult to justify in an industrial setting. Clearly you have an advantage in an industrial setting of being closer to the problems that are experienced by customers and product developers, which makes it easier to do work that is likely to be relevant in the near future.

UBIQUITY: Talk about some of the differences between doing research in the product design, industrial, and university settings.

BERNSTEIN: The difference is that when you're in a product group, everybody's measured by their contribution to getting a high-quality product out on schedule, for an appropriate cost. Although innovation is certainly valued, it's not done at the expense of the other goals. If you produce a great patent or research paper, but the product is a year late, you're likely to be, well, let's say, criticized, for the lost business opportunities. In industrial research, you look further out. You work on problems that the product developers are unlikely to work on for the next, say, four years or more. By the time you understand it well, they'll be ready to make use of it. Obviously, it could even be farther out than that. But you don't want to be any closer in than say, two releases out, which makes it qualitatively different than what you do in the product group. On the other hand, people in universities feel like they have to work even further out, more in the five- to ten-year timeframe so that they don't conduct research and then they find out a year later that somebody is shipping it in a product, making their research obsolete.

UBIQUITY: Has it been an advantage to you to have done all three kinds of jobs?

BERNSTEIN: For me, yes. But whatever job you have, part of that job is to figure out where your advantages are, that is, the points of high leverage that make it possible for you to make an outsized contribution. So, when working in a university as a professor, you have graduate students. They are apprentice researchers who are willing to work on problems that you choose. That's not something you get in an industrial setting very often. So, you have to make use of that advantage. By contrast, in an industrial setting you have this high bandwidth communication channel to product developers and customers, and that can give you some advantages in being able to do better, more effective work that affects lots of users. In whatever environment you find yourself, it's important to analyze what the potential benefits are, what the opportunities are, and how that plays synergistically with the strengths you have to offer.

COMMENTS

Articles

It's all about the database