Word Matters Podcast

An Interview with John Morse, Part 2

The second of three special episodes from our interview with the former President and Publisher of Merriam-Webster

word matters podcast logo

Hosted by Emily Brewster and Peter Sokolowski.

Produced in collaboration with New England Public Media, with much gratitude to John Voci.

Download the episode here.

Transcript

John Morse: The dictionary is one of the few places where the world's expert on a particular subject speaks directly to an end user unmediated by any editor, interpreter, popularizer.

Emily Brewster: Thanks for joining us for a very special episode of Word Matters. I'm Emily Brewster. My colleague, Peter Sokolowski, and I recently spoke with John Morse about his long career in dictionary publishing. As a lexicographer, production editor, and executive at Merriam-Webster, John's career has been a kind of microcosm of reference publishing during the years of transition from mass market print to the online information economy. His unique perspective gives us all a look behind the scenes of dictionary making, and we think you'll find part two a fascinating conversation.

Peter Sokolowski: We're going to talk about electronic dictionaries, and that involves CD-ROMs, it involves online references that were before Merriam-Webster.com, and it also involves the Tenth Collegiate Dictionary. There's a lot of the early years of dictionary digitization that we want to hear about. Where do you want to begin?

John Morse: I'm enough of an old timer that I want to push us back even a little bit further than that.

Peter Sokolowski: Good.

John Morse: Which is to say, I think if you want to talk about the beginnings of digital dictionary publishing, you're really beginning in the 1960s and some things that happened in the 1960s that ultimately I think really help us significantly 30 years later. And that has to do with fact that in the 1960s we created, with the assistance of others, digital versions of the Seventh Collegiate and of the current Merriam-Webster paperback dictionary; the creation of a digital version of the Seventh Collegiate and ended up being very important. And it was very much on the Merriam-Webster end, something that was done with Ward Gilman [aka Gil] being intimately involved in working on that. And I think it's sort of important to bring that up because I think Merriam-Webster people, maybe others, really associate Gil with the print program-

Peter Sokolowski: Right, exactly.

John Morse: ... for the use of academic researchers to early efforts at lexicology to understand dictionaries, their content, their structures. It was licensed to universities and to scholars for scholarly use only.

Peter Sokolowski: And that was the '60s.

John Morse: Yes.

Peter Sokolowski: And the Seventh Collegiate.

John Morse: The Seventh Collegiate.

Peter Sokolowski: And what was Gil's involvement?

John Morse: Gil really was one of the people within the Merriam-Webster editorial department who took a very active interest in the digital possibilities for dictionaries. I think it came out of his work on being a production person on Webster's Third, of being a proofreader, and of creating his whole style manual, which just lovingly documents the style rules which are in effect the data structure rules. Merriam-Webster entries are so rigorously structured that in fact they constitute a database, a fairly well-structured database with real strict rules of grammar and syntax for how that entry goes together. And I think because of that, Gil was able to see that this wasn't just a typesetting tape, this was a database and that it would be useful for researchers to use it in that way.

Peter Sokolowski: It occurs to me that the incredibly detailed instructions for editors, which in-house we call the Black Books, gave very specific directives about how to define an adverb and how many modifiers you can before the genus term. And that does lead to this kind of uniformity, and that wouldn't have been true for other dictionaries. It's just an interesting point that there was this intersection of specificity of lexicographical style and the needs of a digital data structure.

John Morse: Right. And it's something that when we do the data capture of Webster's Third comes back in a very prominent way. But the anecdote there is at one point, and this was before any data capture begun, but it's certainly after Merriam-Webster began thinking about its text files as being databases. We had a visit from the principal editor of the OED came over to the United States at one point and visit the office and our editor-in-chief, Fred Mish, and he had a conversation and Fred took him over to show him the Black Books and how they were structured and what was the content. And Fred said, "And of course I'm sure that you at the OED have something just like this." And the guy said, "Well, no, we don't have anything like this." So Fred asked, "Well then if an editor hits a particular kind of entry, how do they know how they should approach that entry?" They're told to go think of a word like that word and look it up and do it the same way. Which is why when they did the initial data capture of the OED, they wanted to have as one of the features an author search. So if you wanted to see all of the quotations from Jane Austen, you could do it, or William Shakespeare. The only problem was that they styled the author's name in authors quotations in about five different ways.

Emily Brewster: That's right. That's right. I can attest to that.

John Morse: Whereas we had a very rigid sense of if you're going to quote an author, you name the author in this particular way. And we had an Authors Quoted editor, Kathy Doherty, who checked every quoted illustration we had to make sure that that was in fact being styled the same way we had styled at someplace else.

Peter Sokolowski: Yeah.

Emily Brewster: It's very interesting. I feel like it's a very significant part of Merriam-Webster's editorial culture that things are consistent. We put an incredible amount of effort into consistency in the styling of really every aspect, which really does lend itself to this kind of database structure idea.

Peter Sokolowski: And one notion that has occurred to me recently just in thinking in kind of the long term of the Merriam-Webster style is because I hear it out loud more than most people for two reasons. One is I read the word of the day to record it, but I also attend the National Spelling Bee every year. And that becomes this sort of performance of the definitions because they're all read into this enormous room and on television. And what I have found is that that very technical kind of language, the very technical approach to language actually has given a timeless quality to our definitions, a quality of authority, a quality, you could argue of dictionary-ease perhaps, but that has allowed some definitions that were written decades ago to still be valid, to still work perfectly well and to be unquestioned and good definitions. And I just think that very technical approach to defining was really smart.

John Morse: An aspect to point out is that the dictionary is one of the few places where the world's expert on a particular subject speaks directly to an end user unmediated by any editor, interpreter, popularizer; the person who, the day we define paradigm, is probably the most steeped in knowledge about paradigm there is, crafts that definition. And then that definition is what is read by the end user as if somehow Linus Pauling speaks directly to the person getting vaccinated. And of course, that's not the way we do it. Almost every other way that information gets imparted, there is an interpreter or a popularizer or an editor or someone who says, "Well, I think what Linus meant to say..." And recasts what the world's experts said in order to make it more accessible. And to be sure that's a needed activity. But the dictionary sets itself a different situation. There will be no interpreter. The lexicographer does the research, comes to the conclusions, and speaks directly to the end user. And I think in some ways that is the language of transmission that we use in order for the expert to speak to the end user unmediated. So it is stylized. It is probably not the way the end user normally speaks. It's probably not even the way that the creator of the information necessarily speaks, but it is the language of the transmission, and I think that we are constantly tinkering with that. I don't think it's a fixed state what that language should look like. I think it's always going to look a little bit like the way your parents or your grandparents spoke. I think the language of the dictionary is always a generation or two behind the current generation. So it does have that feel of authority because it's what an older person might say. But one of the challenges I think we have to have, and I think we've been doing it, is understanding that as we move through time, that grandparent is not the same grandparent. We still find, I think in the dictionary, an awful lot of it may be still the language of people who were graduate students at Columbia in the 1930s. And I think we know we have to move that forward because now a user's grandfather might sound like me.

Emily Brewster: Formality is a constant. There is a formality that is required in a definition. I think the formality gives it structure, gives it authoritative quality, but still has to communicate very clearly. And sometimes the words that were relied on for defining text years ago have shifted in meaning; those words have to be replaced.

John Morse: Right. Right.

Peter Sokolowski: And I think two other nuances; one is, that generation of your parents or grandparents is also the generation of your teachers, so that you're hearing the language that you might've heard in the classroom. And also, of course, this was the deliberate strategy of the King James Bible. It was supposed to be a couple generations old because that sounded more authoritative and they knew that, they were aware of that, and we're still doing it today in a different way, but it's for the same purpose.

John Morse: But it gives back to the digital aspects. What I'm thinking of more is not so much the text of the definition as the structure of the entire entry. There is a place for the head word, a place for the function label, a place for the etymology, a place for the pronunciation, a place for the status label, and it is all rigorously controlled. Because of that, it becomes easier, and I'm going to get to this a minute. By the time we get into the 1980s, digital publishing is mostly a licensing business, so we're not going to create the product. So we're going to license our data to other people to use in their products. And in order for them to use that data, they need to know exactly what is the intellectual significance of everything that's in that text file.

Peter Sokolowski: Right. And that includes what a bold-faced colon means, or what a light-faced italic semicolon means, because those are so specific in the Merriam-Webster style that we can actually make a tag for them.

John Morse: And I think that's the sort of sensitivity that Gil brought to the project from the initial efforts to digitize because he was so steeped in the style rules of our dictionaries, such that he potentially writes the book. Part of why Gil is important in the early days of digital publishing is because he understands the inherent underlying structure that is sitting in the text file. Also, because I think really the first person to really articulate the vision of the possibilities of digital dictionaries would come to an age where dictionaries can be created and read and studied all in digital format. As I dig through our files to make that such a clear statement, that's the potential to create dictionaries, to read dictionaries, and to study dictionaries all in digital. And to move it into the '70s, I think it's with that kind of sensitivity that that's the possibility, and here the resources we already have to move in that direction. That leads to the way the Eighth Collegiate is typeset and the way we create the typesetting file for the Eighth Collegiate. The typesetting file is a structured database. It's a crude structure, but there is a place for the head word. There is a place for the functional label. There is a place for the pronunciation. There's a place for the etymology.

Peter Sokolowski: And this is for the keyboarders?

John Morse: The keyboarders who are keyboarding the Eighth Collegiate are creating the beginnings of a structured process.

Peter Sokolowski: And these are kind of-

Emily Brewster: And this is 1983?

Peter Sokolowski: No, no, 1973.

Emily Brewster: 73, 73.

Peter Sokolowski: This is early days for digitization.

John Morse: Right. The most outstanding example, and probably underutilized, was the Geographical Dictionary because we pay them out with Webster's New Geographical Dictionary in 1972. And some people have looked at it and said, "This is the most heavily tagged typesetting file anyone has ever seen." Because in that, the same sort of logic is brought to bear. There's a place for the head word, there's a place for pronunciation, there's a place for the descriptor of what this is. Is it a river? Is it a town? Is it a city? Is it a mountain? A place for the population figure to sit, place for the larger geographical entity of which the entry is a part of, that could have been exploited massively. If we had the right interface for that, you could have gone in and said, "Show me the rivers in Turkey," and it would've been able to do that. We did take advantage of that in the 1980s when we did a copyright update to the Geographical that was tantamount to a new edition. There had been new census figures for almost every place in the world at that point. And so we kind of turned the Geog inside out and created printouts that showed all of the place names with population figures separated by their larger entities. So we had all of the place names in Turkey or all of the place names in Pennsylvania and what population figure we were currently showing for them. And with that in hand, it was relatively easy to say, "All right, here's my printout of the population for all the places in Pennsylvania. Here's Pennsylvania's census report. Let's just go down, compare them, fix them, go into the database from the printout, update those figures and create new pages." Again, all of that late in the 1970s. So I think-

Peter Sokolowski: It's proto digital. It's a way of working.

John Morse: It's definitely getting to the first part of what Gil is saying. We're going to create dictionaries on a computer. We're not going to consume them yet on a computer. I mean, that obviously, that comes later when more users have a digital device in their hands, which is going to come a good deal later. But we're going to exploit how we create dictionaries from our knowledge of their data structures.

Peter Sokolowski: And if we were to look at the manuscript of that, we could see the tags.

John Morse: It is the most illegible. The typesetter we used was for the geographical was I think the same one we used for several other editions of the Collegiate. And for decades afterwards, we were reminded of the terrible shape of the Geog manuscript because we were trying to build in all of that kind of metadata around the text file.

Emily Brewster: Well, I will say that today working in the data file is kind of brutal. I don't really like to do it. I can do it. I don't typically work in it. It always takes me a little while to get used to it again so that I can see the parts for what they are.

John Morse: And I probably would have the same trouble also because I think we're in an entirely new generation of the way information is stored from what these early efforts were. These early efforts were in the time, still, where people were thinking of SGML and HTML, which really were the necessary coding technologies for graphic user interfaces. And I kind of understood the world as long as we were talking about flat tagged files, because then it really did all kind of make sense. I think now we rely more on relational databases, and that's kind of where John left the movie.

Peter Sokolowski: The style file became a data file for the Unabridged Dictionary right after the '70s or into the '80s or around 1980, Webster's Third, the unabridged edition, was turned into a digital file around the time that you started. Is that right?

John Morse: Not the unabridged.

Peter Sokolowski: Okay.

John Morse: The unabridged was really still in hot metal type. Well, through my early days at Merriam-Webster. Now, if you're a real dictionary or database weenie, the weird irony there is that hot metal type was set mostly through monotype machines and in monotype machines, the typist who was doing the typesetting was actually creating a punched paper tape, and the punched paper tape was then fed into the typesetting machine and actually created all the hot metal letters that would go into the galley. That punched paper tape is probably really the first digital dictionary. And as people who have sort of studied this, look at this and say, "We were that close. We were that [close] to really understanding that you can store information digitally in the monotype days." But in fact, it was not the way we understood it, W3 was fundamentally a big pile of metal type.

Peter Sokolowski: Metal plates later. So the type became stereotype plates.

John Morse: We kept standing type of those pages made out of hot metal until about the 1980s, and it was about at that point that what we did was, we simply made proofs of all those pages. We said, "From here on out, we're just going to make plate changes by going into the reproduction proofs that we've made of the standing metal type."

Peter Sokolowski: So they were no longer replacing metal struck letters?

John Morse: Right. And we were one of the last people on the face of the Earth who still had anything like that. And by the 1980s, we realized this is not the way you do things any longer. And we sold off all the standing metal type to General Motors who melted them down and used them to create batteries for Buicks.

Emily Brewster: Wow.

Peter Sokolowski: Those are the plates for Webster's Third.

John Morse: Yes. You want to know where the standing type for Webster's Third [is]? It's probably now in a junkyard, because those Buicks are probably long since-

Emily Brewster: Right. It's sitting in a decaying Skylark somewhere.

John Morse: Right.

Peter Sokolowski: Because they occupied a huge amount of space. It was a whole kind of warehouse full of plates that I'm going to imagine weighed five or six pounds each for 3000 pages. So it was a lot of metal.

John Morse: Right. Frankly, it was not the way to do things.

Peter Sokolowski: But that did mean that there were 25 years of corrections, what we call plate changes, corrections made to that text so that by the time this was sort of frozen in that way, many or most of those corrections, at least of typos, such as, they were already made.

John Morse: Right. We only did a few more years of plate changes before we gave up on doing anything to the main A to Z. So the main A to Z of Webster's Third is pretty much frozen at some point in the 1990s because what we had to do was create a digital version of the typeface used in Webster's Third in order to set new entries and put them into the pages. And we did. It's hard to tell the difference if you look at them.

Emily Brewster: And at that point, when we would make updates to Webster's Third, we added a series of addendas, and I remember Steve Perault telling me very specifically that it was one addenda, two addendas, it didn't do the Latin addendum, addenda. We made updates to that dictionary by inserting a section to the front of the book.

John Morse: Right. Although Fred would always correct people. If you called it "in the addenda," it'd be ok in speech, but if you were writing correspondence and said, "That word is in the addenda," he'd always correct it: It's in the addenda section. The entries are the addenda. If you're going to say "it's in the," then it's "in the addenda section."

Peter Sokolowski: Oh, of course.

Emily Brewster: Wow. Yes, the section of the dictionary that is the addenda.

John Morse: I only got corrected on that about 50 times.

Emily Brewster: I have a question about this kind of structure. We have tagged, but invisible to the user, subject codes at our entries, and I am curious about when those subject codes were added, and to which dictionary they were first added.

John Morse: Yes. The concept of subject coding actually was with Webster's Third, at least. That was in the pre-digital age. So what did they do? They snipped out each individual entry on the page. They pasted it onto a 3 x 5 slip of paper, and then they subject coded those slips of paper, and then the use for that was they sent all the entries for an individual subject code to an outside consultant who then looked at all of those entries, reviewed them and sent them back comments.

Emily Brewster: Interesting. So the subject codes were first for the purpose of having those definitions reviewed or written by the appropriate expert.

John Morse: Yes. And if you go to our consolidated file where all of the citations are, you will see some of those, you will see an entry pasted to a three by five slip, and you will see a two letter code on it. That would've been the version of the entry that was sent out to the consultant. Again, we're still in the 1970s. That's the point at which we said, when the Eighth Collegiate was being created, and where we're going to introduce a data structure to the typesetting file; that's when we also introduced the concept of subject codes. And again, it was part of Gil's notion that the dictionary could be created, revised, consumed, and studied in digital form. You would ask in a sense, "Well, why bother doing that? We're having a perfectly good job revising dictionaries on paper. Why bother putting all of this stuff in there?" And one of the early ideas of how would you make some sort of profitable use out of having a digital dictionary is, well, why not assign a subject code to every entry? Two things can then happen. You can create derivative dictionaries, but also you can speed the review or ease the review of entries by an in-house or an out-of-house consultant.

Peter Sokolowski: And the obvious derivative dictionaries, for example, Law Dictionary, Dictionary of Medicine, are two that we do publish, but these are subject codes that could read music, physics, sports, or even more specific baseball, right?

Emily Brewster: Yes. The subject codes are really so complex. They're not just for individual entries, they're for individual senses of entries. And so when an entry has its sense order revised, then someone has to make sure that all of the subject codes are also revised so that they apply to the particular sense or subsense. I mean, the complexity that the subject codes introduce and also make possible is, I think, remarkable. And the end user knows nothing about it.

Peter Sokolowski: Right.

John Morse: And needn't. This really was an in-house productivity tool more than anything we expected would be visible to a user. Over the years, we've tried to think of other ways to make use of all of that. For a while, I tried to convince our advertising salespeople that you could tell advertisers if you're trying to sell baseball gloves, we'll only run your ad on entries that are tagged for baseball. For one reason or another, [that] never really worked. The power of the digital dictionary was slow to, I think, really emerge, and somewhat made the wise observation sometime when thinking about these new technologies that we often tend to overestimate the impact they can have in the short term, but underestimate the impact they can have in the long term. And I think that really was true for us in digitizing, that I think a lot of the early advantages we thought you might get never really materialized as much, but long-term had a tremendous impact.

Peter Sokolowski: And so on the way to that, on the way to a digital dictionary, what was the next stop along the way?

John Morse: Well, I think really in the 1980s, the heyday in terms of digital publishing licensing as a way of taking advantage of the digitization because people now are thinking, "Well, we could bundle a digital dictionary with a digital encyclopedia. We could create a digital bookshelf of products. We could bundle a spell checker with a word processor, or conceivably even bundle a whole dictionary with a spell checking-"

Peter Sokolowski: Or handheld digital dictionary.

John Morse: Or a handheld, which really is sort of a simple thing. But probably where we had our biggest initial impact as digital publishers was creating the list of acceptable spellings, which ranged from about 80,000 to a hundred thousand that would go into spell checkers. And we did that originally with a technology company called Proximity. Proximity was acquired by Franklin Electronic Publishers, and it was with Franklin Electronic Publishers that the Proximity Technology created the first Spelling Aces, which was a handheld or desktop spell checker. So if you weren't sure how to spell a word, you put in your best guess on this little keyboard and you would be told whether that's a good spelling, or if not, what might've been the spelling you were looking for. Within just a few years of the launch of that product, Franklin had sold over a million units of the Spelling Ace. It was that deal and also with Proximity, that created checking technology based on our word lists that went into all kinds of different software platforms that you might not ordinarily think of. That was the first major thing propelling us into digital publishing. The notion of licensing your data, of course was nothing new, particularly to us. All of our competitors were doing the same thing. Most particularly Houghton Mifflin was very active. They created a whole software division in order to take advantage of these licensing opportunities to get your dictionary text or data into other kinds of electronic products. I would say throughout the 1980s, for most dictionary publishers, that was the key piece of digital publishing, was licensing your data to someone else to make that product. And it was across the board. I mean, sort of the one that was most entertaining I think was after Steve Jobs left Apple for the first time, he created a company called NeXT computers, and he created a kind of Dream Scholars workstation based around his computer that he was creating, the NeXT computer. And in this Dream Scholars workstation would be bundled right into the operating system, Merriam-Webster's Collegiate Dictionary, Merriam-Webster's Collegiate Thesaurus, and the works of Shakespeare.

Peter Sokolowski: Wow, I did not know that.

John Morse: So that was a fun one. But in the fullness of time, the NeXT computer did not really catch on, but it certainly was my own training. That was my first experience with doing full text displays of dictionary entries on a computer screen. And pretty much what I brought to the rest of digital publishing really began with that NeXT computer. And without doing too much name-dropping, I mean, it was an interface that kind of was done with Steve Jobs over the shoulder. And then one of his really talented lieutenants, Mike Hawley, who was working at the MIT Digital Lab at the time, but working for Steve Jobs and the NeXT computer at the same time, he was also one of the real architects of thinking, what is the most elegant way to display dictionary text? And the first thing we decided was every numbered sense begins a new line.

Peter Sokolowski: Yes, yes. It's hard to read otherwise.

John Morse: It is.

Peter Sokolowski: And that's something that people don't think about, but that's a big difference between the printed page and the digital page.

John Morse: Well, it's the difference between Webster's Second and Webster's Third. Also, this notion that you run in all numbered senses into a single block of text. I think we really blame on Webster's Third.

Peter Sokolowski: True, because 19th century dictionaries didn't do that.

John Morse: Right. And they did it as a space-saving device, and what they didn't realize is how much space that saved, which is why Webster's Third ends up being a good deal smaller than Webster's Second, smaller than they thought it was going to be. As much as we like having each numbered sense begin a new line, it consumes tons of space, and that's obviously one of the virtues of digital publishing is we could unpack all that. And I began to think of a dictionary entry more like a catalog card than as a dictionary entry. So all of your data elements could begin on different lines, and you might even label what is there. So you could say "head word" or "main entry" or "pronunciation." For some of the early entries, digital products, we actually would do that. I think we backed away from that, the understanding that the typography really does carry that kind of information. When we get to the period of online digital dictionaries, one of the most important aspects of online digital publishing is search engine optimization. What can you do to make sure the search engines find you? One of the ways the search engines will find you is the relevance of your entry or article or whatever to the search term, and the way the search engines tend to do that is how often in that entry or that article is the search term mentioned? Now, you can imagine how that really works against a dictionary, because in general it's going to be once and then the rest of the time it should be some other words describing that head word. From the search engine's point of view, that makes it a very non-relevant entry. So one of the things we did at a certain point was we found ways to repeat that name over and over again. So now at Merriam-Webster Dictionary online, it will say, definition of paradigm, etymology of paradigm, functional label of paradigm, derived forms of paradigm, which the user does not need. I mean, you know you're looking at the paradigm entry you do not need to be told this is the etymology of it.

Peter Sokolowski: And user also doesn't notice. They don't even take it in.

John Morse: All of that cascade of different developments, I still, at least certainly in my experience, and I think to some degree in Merriam-Webster's, it begins with those two very important licenses. The one for Franklin, where we really see we can sell this in really big numbers and reinforcing for us the importance of the structure of the database because Franklin needed information out of that database that you only get if you understand the data structure. So they really wanted to get anything that was a head word or a derived form, but not necessarily a variant and not necessarily an inflected form. This was something that Robert Copeland really took the lead on, is making sure that we had a database that tagged every element for what it was intellectually and not what it was typographically.

Peter Sokolowski: Because typographically, they might both be boldface, for example.

John Morse: Right. In that first Eighth Collegiate data structure, that always carried over to the Ninth Collegiate as well. That's not always the case. After you get through, as Gil would've called it, the matter before the definition, then the rigorous fielding of data stops and it becomes more typographical. So there's no easy way to tell the difference between an inflected form and a derived form.

Peter Sokolowski: Did Franklin require Merriam-Webster to provide tagged text or did they tag our text?

John Morse: We worked together, and what emerged is that became a more and more successful license. The president of the company at that point, Bill Llewellyn, said there should be a person inside the Merriam-Webster editorial department who is Franklin's editor. That Franklin knows they can come to at any time with a question or a project, and that editor is available, and that was Robert Copeland essentially became Franklin's editor inside Merriam-Webster. And between Robert's efforts and Franklin's efforts, we got to the kind of word list we wanted, but he became one of the real advocates for "we need more rigorous tagging throughout all of the database."

Peter Sokolowski: Which is all kind of proto-web tagging.

John Morse: It is proto-web, and it's why eventually our movement to the web becomes a lot easier and why ultimately the data capture of Webster's Third becomes much easier to do because one of the great questions, and OED faced this as well, if we're going to move forward digitally, we've got to get this old legacy print product into digital form. How on earth are you going to do it? Both OED and Merriam-Webster look carefully at optical scanning. For every difficult question that you face, there's an easy and obvious answer that's also wrong. And as soon as you mentioned, we really need to get these big dictionaries into digital form people, "Oh, well, optical scanning." The error rate on optical scanning is just through the roof, and it would've created a database with probably hundreds of thousands of typographical errors. So that really did not look like it was going to work. What ended up working is having typists type it in.

Peter Sokolowski: They had to rekey the entire book.

John Morse: And as a matter of fact, rekey it twice. What worked well is, you type this page, you type this page, we'll do a comparison, and any place where the two of you haven't typed it the same way, one of you is right and one of you's wrong. And when you did that file compare between the two keyboarders and then resolved any differences, we got to about 99.98% accuracy.

Emily Brewster: That is fantastic. So much more reliable than having one proofreader.

John Morse: Right.

Emily Brewster: Right. I think because it instinctually makes sense to me that doing that comparison of two carefully keyed versions is going to result in fewer errors in the final project.

John Morse: Right. So the keyboarding company did no proofreading, to the best of my knowledge, beyond early batches to make sure people weren't way off, and we only did sampling proofreading. Every time we got a batch done, we would bring it in, and Madeleine Novak, who was really running the project of getting this in digital form, would have proofreaders proofread some batch. And as long as they were still hitting 99.98, we felt pretty good.

Peter Sokolowski: Didn't you tell me once that if it was just 99% accurate, there would effectively be an error on every page?

John Morse: Oh, at least. At least.

Peter Sokolowski: Just think about that for a dictionary. It's just not acceptable. And it's also, a page is incredibly dense with letters, with text.

John Morse: Right. Think about a W3 page, think about how many typeset characters there are on a W3 page, and you are probably around, I forget the number, let's say eight to 9,000, 10,000. So even if you are saying 99.98, that's two for every 10,000. So even there, you're going to have two typos per page.

Emily Brewster: Now what year did that happen?

John Morse: I think we started this someplace around 1990, and it was done in pretty short order. And to keep going back to this theme, they were not only keying the text, they were assigning the tags.

Emily Brewster: They would have to.

John Morse: And what they did was they set up a template for typists to type into, and they had fed the logic of W3 syntax into it. So the first thing you do is you type in the head word. There are only a limited number of things that W3 style allows to come right after a head word. So they were just given those choices of what are the only things that can follow a head word, and they were able to then put it into the right window on their screen, which would assign the tag. Then having had that data element, and again, there's only a limited number of things that in W3 rules can come after a head word and a pronunciation field. So because that has such a rigorous data structure, the typists were not only able to type in the text, but they were able to assign the tags with a remarkable degree of accuracy. Peter has it on his wall, I believe.

Peter Sokolowski: In my office.

John Morse: The flowchart that I was able to create for the typesetting company and also for one of the early people who were thinking about being able to program a text editing system for us that just shows how that structure works and what is the flowchart of a W3 entry.

Peter Sokolowski: It's remarkably simple. The dictionary is very complex, but there was an order for it.

John Morse: And it took me a while to get it. I practiced trying to assign HTML tags to entry after entry after entry, and eventually I saw it for its simplicity and elegance. There is a recursive quality to what you can do and the order of information, the information before a definition, and then the order of information in each sense is the same, and the derived forms is the same. There's just a perfect kind of elegance to it. And then all I had to do was capture it. Here it is in its simplicity and its elegance. When the developer who saw that, who was trying to come up with a text static system, looked at that, he said, "John, you have just saved some place in the order of five figures of cost because you have this."

Peter Sokolowski: They didn't have to figure it out.

John Morse: They didn't have to figure it out, and if P.B. Gove did nothing else for us, he created that real rigor of entries.

Peter Sokolowski: It's as if he was thinking in a proto-digital way, I sometimes feel, I feel like his natural inclination was to go toward the organized structure. And some people criticized him, of course, in that book for its rigor that kind of removed maybe what some might call the poetry of the dictionary and replaced it with a kind of technical document. But this is exactly the result.

John Morse: And it redounds to our benefit to this day.

Peter Sokolowski: And so the digitization of Webster's Third becomes a CD-ROM.

John Morse: First deployed as a CD-ROM, but then also deployed as the dictionary inside the subscription unabridged website. And what our thought had been to do with the Unabridged is not unlike what OED had thought about doing, which is you create a website that people can subscribe to knowing that they are watching a dictionary undergo revision. OED came out with its Second Edition, which was pretty much just integrating the supplements and some other changes, and then they essentially said to the world, "And now we're going to work on the Third Edition," which is what they're still essentially doing. And so you subscribe to OED understanding that you were watching it go from the Second Edition to Third Edition. Our hope really had been to do the same thing with our unabridged to say, you're going to subscribe to something called Merriam-Webster Unabridged. It is fundamentally the text of the Third, but you're going to watch us turn it into the Fourth Edition. That plan never really came to fruition in that form, but those were the two major ways we were going to utilize the data capture was first in the CD-ROM, there were CD-ROM dictionaries before there were online dictionaries, and so there were any number of CD-ROM dictionaries created during the 1980s, and Merriam-Webster had licensed its data to CD-ROM developers during the 1980s, and lots of other dictionary publishers did as well. So those CD-ROMs came before there was an online presence. Webster's Third itself actually comes a little later. I think the first CD-ROM version of it isn't probably till about 2000. Well after we have done our other online.

Peter Sokolowski: Gone online, right.

John Morse: The whole switch from that licensing model of dictionary publishing to creating your own websites is really a feature of what happens in the 1990s, which really is the period when online delivery really begins to take over. I think it was in 1991 for instance, that the World Wide Web really becomes available to the public in general. In a lot of ways to have an online presence really requires for it to work well to have graphical user interfaces and not command line. And that really takes off, I think, mostly, in my experience, with the introduction of Windows 95. So if we're looking at when does the online start becoming an important presence, it's in the early '90s for us. I think the first real experience is in '95 with the AOL dictionary, which we've spoken of, which I think was important to us because it really demonstrated that you could build a whole community of people, an online community of people around a dictionary and a thesaurus. We had message boards associated with the dictionary and thesaurus on AOL, whereas people could start talking to each other about given topics. And then we at that point, even as some moderators to moderate what was going on the message boards, they were categorized. There was a message board about slang or about usage or about etymology, and they were filled with people. People were just itching to have a place to talk about language online, and I think that was eye-opening to us and also very energizing because I think that really suggested that's something we can work around.

Peter Sokolowski: There's a constituency.

John Morse: There was a constituency. And the other thing that happened particularly in the '90s goes back to where this story was beginning, which is that data file of the Seventh Collegiate that was supposed to be just for scholarly use is now popping up all over the web. Scholars, maybe not the original ones, but people who somehow were at a university and had this digital dictionary said, "Wow, we could put this online." And so the Seventh Collegiate was all over the web being used in a completely unauthorized way, including a really creative application called the Hypertext Webster. And the Hypertext Webster, every word in the dictionary was a link to its definition, every word.

Peter Sokolowski: So somebody wrote a good program.

John Morse: Someone wrote a program that essentially turned every word in every entry into a link to that definition, and it was getting a lot of use. We knew at that point in the '90s, a couple of things. We knew that a business model was sort of there with AOL, at least you could see what might happen, and we knew that there were a lot of people who wanted to be able to get their dictionary information free on the web, and we had a choice, are you going to meet that need or let someone else meet it? We made the wise choice to say, "If there are that many people who want to get dictionary information free off the web, we've got to be a part of this. We can't stand aloof to this." And fortunately for us, we did it because the other thing that happens in 1995, of course, is that the founders of Dictionary.com register the URLs of Dictionary.com, Thesaurus.com and Reference.com. They had no dictionary. It was really purely a technology platform at that point, and the only dictionary they had access to was the one that the Gutenberg project had created, which was based for no good reason that I can think of on a 1913 version of the Unabridged.

Peter Sokolowski: Of the Merriam-Webster Unabridged.

John Morse: Of the Merriam-Webster Unabridged.

Peter Sokolowski: Which is a little bit ironic.

John Morse: Here's the weirdest part. That 1913 edition was really kind of a bargain dictionary that Merriam-Webster created. It was not the current edition. So the 1913 is not based on the 1909, it's based on the 1890. They took the 1890, it was like bargain publishing, they got the out of print dictionary. They put a couple of new words in it, and they sold it. Of all the versions of dictionary that's the one that the Gutenberg people somehow digitized because that was freely available.

Peter Sokolowski: Because it was out of copyright.

John Morse: Oh, yes. The guys at Dictionary.com essentially sucked in that data and used their technology for displaying dictionary data on Dictionary.com.

Peter Sokolowski: The little footnote I have is that I believe in 1913, a very big dictionary published by Funk & Wagnalls was new, and this was a way to undersell, to have a book that was big and looked fresh and cost a lot less than the then-current Merriam-Webster big dictionary. I believe it was that kind of bare-knuckle competition that we've seen over and over in our history.

Emily Brewster: Our conversation with John Morse continues in our next episode. For the word of the day and all your general dictionary needs, visit merriam-webster.com. Our theme music is by Tobias Voigt. Many thanks to John Morse and to Ed Finnegan of the Dictionary Society of North America for suggesting this oral history project. And to New England Public Media for the use of their studios, and especially to NEPM's John Voci for producing these three conversations. I'm Emily Brewster. For Peter Sokolowski, thank you for listening.

Love words? Need even more definitions?

Subscribe to America's largest dictionary and get thousands more definitions and advanced search—ad free!