Inside Our Citation Files
Our Springfield office holds a file of 16 million alphabetized scraps of paper, each containing a citation for a word. Some of them are from as far back as the 19th century. Many are written by hand. So... how did we create this bit of living history? It's a long story.
Download the episode here.
Emily Brewster: Coming up on Word Matters: a visit to Merriam-Webster's citation files. I'm Emily Brewster and Word Matters is produced by Merriam-Webster in collaboration with New England Public Media. On each episode, Merriam-Webster editors, Ammon Shea, Peter Sokolowski, and I explore some aspect of the English language from the dictionary's vantage point.
The backbone of Merriam-Webster's lexicography from its earliest days is a particular and peculiar collection of data known as the citation files. Today, Peter and I are going to give you a tour of this remarkable collection. As we speak, it is late November of 2021 and I have not actually worked in our Springfield, Massachusetts office since the middle of March of 2020 and I'm missing it. I'm missing working in our office. One of my favorite things about the Merriam-Webster Office Building in Springfield, Massachusetts is the smell that it has. It smells like books. It smells like a library. It also contains one of my favorite resources in the world, and that is our extensive citation files. The editorial department's on the second floor of this brick building and there are a lot of cubicles, of course, where the editors sit and silently work typing away on their keyboards. But one half of the editorial floor, a large portion of it, is taken up with what we refer to as the citation files. They are file drawers. They're almost five feet high.
Peter Sokolowski: Cabinets.
Emily Brewster: These file cabinets, they look like the card catalog files from libraries of days past. And in those citation files are more than 16 million slips of paper. They date from the late 19th century all the way through the early 2010s. They are the backbone of Merriam-Webster's lexicography.
Peter Sokolowski: Absolutely. No one's ever challenged me. I wonder if there's another body of collected evidence on paper of any language in the world that is this large. It's just a massive, massive resource. It shows the method, in a sense, just by being on that floor and having the weight of these and the space that they take up, especially in the middle of the floor, they take up sort of all of the real estate and we use them as kind of bookcases. We put books on top of them.
Emily Brewster: Many of them are digitized now also, so we do not have to leaf through these little slips of paper. As I said, we've been collecting these citations since the late 19th century. And from the 1980s, until the 2010s, citations, we can get into what exactly those are, were created in both electronic and paper form. You can search digitally. And then a number of the older citations from before the 1980s have been digitized also, but not all of them.
Peter Sokolowski: No, no. No. You have to get up off your desk and go and open a drawer and go search.
Emily Brewster: We call them the citation files because the bulk of these file drawers are filled with, what we call, citations. They are excerpts from a text with a word or a phrase highlighted on it, usually just underlined. And it has been identified by an editor as being of some interest to a lexicographer, as being some use to a lexicographer. So it could be that this is a new word that the editor has never seen before, or it is a word that they have seen before, but this is a particularly good example of the word in use. It could be a new use of an existing word. It could be an example that would just be really perfect for including in a definition entry.
Peter Sokolowski: A good example of a word often kind of explains that word.
Emily Brewster: It explains that word. It shows it in its typical collocation. So these citations can come from anything. You can even be audio, right? There are citations that are heard on the radio, and then it's an editor's transcription of what they had heard on the radio. They are also cutouts from a menu or a comic strip or-
Peter Sokolowski: Soup can labels.
Emily Brewster: ... soup can label, yes. Some of them are from an era when they were literally cut and pasted. So someone cut it out of a magazine and then pasted it to this three by five-
Peter Sokolowski: Standard size.
Emily Brewster: ... paper. Every slip of paper in these citation files is 3 X 5. It's lighter weight than an index card.
Peter Sokolowski: And I think that might have been deliberate because card would've taken up a lot of space and think about the volume that's in here.
Emily Brewster: That's right. And would be more expensive also.
Peter Sokolowski: Of course. And harder to type on. A lot of them were cut and paste, sometimes underlined in red with markings on it. But some of them were keyed in, were typed onto the paper. Some of them were written in ink. Some of this ink is now browned and some of the writing is beautiful, some of it's harder to read.
Emily Brewster: The oldest ones are in ink. They are those-
Peter Sokolowski: They're in brown ink, what looks like brown ink.
Emily Brewster: You can sort of trace the evolution of ballpoint pens.
Peter Sokolowski: And the whole history of photographic duplication. There are a lot that are negatives, in the 1950s especially that was a form that we saw a lot. And then what we saw later, I'm going to guess in the '80s, was almost like a printout. In other words, it was keyed in clearly on a word processor and then was printed. So it was a citation from a source, like a book or a newspaper, but it was keyed in by one of our typists.
Emily Brewster: Sometimes a citation is created by an editor who hears something on the radio and jots it down and describes the source of this quotation. But the bulk of these citations had actually been typed out by typists who are not editors, but who were employed by Merriam-Webster and typists on the editorial floor. And editors used to have multiple publications that the company subscribed to for that editor. You would sign up for which magazines you wanted to get, or which journals. They would be delivered to the Merriam-Webster offices. And part of our job every day was to spend an hour or so reading these different publications. You could also do books, journals, newspapers, magazines, those were primarily what we were reading. And then we would leave them on a file cabinet-
Peter Sokolowski: Oh, that's right.
Emily Brewster: ... and the typists would come by and pick it up and take it back to their cubicles and type, type, type it all up. In the entire time that I have been at Merriam-Webster's, since 2000, I, as an editor, have also created electronic citations on my own when I'm doing reading online. So it's both these citations created from a printed text and then also citations, I'm on the Washington Post website and I see a good example, and I just copy and paste it into a special document that we have that is somehow linked up to our files in ways that mystify me.
Peter Sokolowski: It's kind of a digitized version of the same activity. It's so familiar, I think you forgot to mention, it has a dedicated name, even on the time sheets that we used to have. It was called reading and marking. And originally, we were asked as editors to have about an hour a day dedicated to reading and marking. At some points, there were even certain senior editors who were kind of in charge of reading and marking and they would marshal which publications were read by which editors. You had your own choice. I think I had TLS, the Times Literary Supplement, which is a joy to read, looking for British usage.
Emily Brewster: I did Food & Wine for a while-
Peter Sokolowski: That's a good one.
Emily Brewster: ... and that got kind of rough in the late afternoons when I'm really hungry, and reading and marking Food & Wine was fun. I also did Vibe magazine, which was really good.
Peter Sokolowski: Very rich source of informal language. This starts to show specific magazines about music, about food, also academic journals about math and physics and biology, and certainly legal journals, law journals, and also newspapers. Even in my time, I believe every six months we would subscribe to a different geographical newspaper. One from Albuquerque and then one from Portland, Oregon, just to get this kind of greater balance, in addition to, of course, the big ones like The Wall Street Journal, The New York Times, the big magazines, The Atlantic, Harpers, New Yorker. Big surprise, we're reading everything.
Emily Brewster: Right. We also had many novels and nonfiction books coming in and we would mark them all up.
Peter Sokolowski: Say you had a book on the history of World War II or something, and you might have made 60 markings and you would usually put little angle brackets on both sides of the beginning and the end of this sentence. And then you'd underline the word or several words that you wanted focused on. And then to help the keyboardists, we would make a note of what page it was.
Emily Brewster: All the different pages in the text that had been marked.
Peter Sokolowski: And so that the typists could then get that book and simply cruise through the book and key in all of those citations.
Emily Brewster: And some of those books that were read and marked would then go into our library. We have a very extensive library. That's the thing I miss most about working in our office is not having access to all the many books.
Peter Sokolowski: And so, what we're getting at is this basic mission of describing the language required evidence and little slips of paper have been part of making dictionaries since the modern era began. Samuel Johnson had little slips of paper and they were almost always literary quotations, Milton and Spencer and Shakespeare. I've seen some of these. Our friend, John Overholt, who's a special collections conservator and librarian at Harvard, oversees one of the largest collections of Samuel Johnson material in the United States and he showed me some of these slips. What's amazing is, he would have friends or acquaintances reading books, and then noting, "Oh, this is an interesting word and Shakespeare uses it in this play at this point." And that way, if it's on a card or a slip of paper, it can then be alphabetized or separated into vocabulary. And of course, this was also the method used by the Oxford English Dictionary, the famous photograph of James Murray, the editor of the first edition of the OED, the tall bearded man with a sort of mortarboard, very academic looking. But if you look behind him in that photograph, there are all these pigeon holes where they would put the slips of paper. At Merriam-Webster, it almost became a kind of assembly line process because all of the editors contributed to this and then the typists would key them in. And then there was actually a position of a clerk who then filed them. And in point of fact, we were not allowed to file them ourselves.
Emily Brewster: No.
Peter Sokolowski: I don't know if you remember this, only the clerk could file them. And that was the way that we ensured great alphabetical order in real sense of correct order to everything. But I have to say, the biggest group of these words, it seems like the largest collection, were from the late 1930s through the mid 1960s, that big fat middle of the 20th century, during which time the research was made for Webster's 3rd, when Philip Gove, in particular, really emphasized the use of citations in writing definitions.
Emily Brewster: Well, and the citations were all that editors had, that lexicographers had, to base their definitions, was only the citations. We nowadays have access to multiple corpora and other databases. A corpus, plural corpora, is a carefully collected set of sources that meet different characteristics. So you could have a corpus that is only of scientific journals, but it would be balanced. It'd be very particularly chosen what scientific documents and sources are included in this corpus. And then somebody would be able to sift through it in search for every instance of the word microbe, for example, and find all of those. And we now have access to those things too. But the citation file is something different. It is not a corpus, because it is not carefully balanced in that way. Instead, every single citation is a reflection of some editor's determination that, "Hey, this is interesting. This is useful for writing a dictionary."
Peter Sokolowski: Yes. You're getting at something really important here. First of all, they're both tools used in the service of writing definitions, based upon evidence of course. Another thing about a good corpus, a balanced corpus, as you mentioned, is that it doesn't have garbage. And what I mean by that is, it doesn't have code or advertising and it especially doesn't have duplication. Google, in a sense, is a messy corpus. If you look up a given word, you might find that it's actually the same article repeated many times, and that's not helpful. Sometimes we want to take a kind of census and say, "How many times per million words is the word micro abused?" For example, you can do that with a good balanced corpus, and I think that's exciting. But with a citation system, you're really getting just the bumps, the things that you notice, the things that are usually kind of extraordinary or new uses of an existing term. I sometimes think of a well-balanced corpus as kind of the horizon. This is the language. You can measure the proportions of natural language in a corpus, because it is natural writing. These are long articles. They might be only about sports or only British. They may be isolated in certain ways, but generally speaking, they're natural use of language. A citation file isn't. A citation file measures the stuff you notice, which might be like the little bumps on the horizon, the little bumps that don't have the connecting language between them. And so that was noticed, in fact, by Philip Gove, the editor of Webster's 3rd, who recognized that we had huge amounts of evidence for the most obscure terminology, because those are the things that we would be inclined to notice and make citations of. And so there was a point during the production of Webster's 3rd at which he mandated that, whatever it was, the 4th word of the 35th page of every single book that was read, had to be turned into a citation-
Emily Brewster: What?
Peter Sokolowski: ... yes, just to make a series of base normal language quotations and citations.
Emily Brewster: I have never heard that before. That is absolutely fascinating.
Peter Sokolowski: And he had the typists do that. And part of it was, sometimes there would be a backlog. During the production of a dictionary, the typists did two things. They keyed in the manuscript and then they keyed in the citations. And so what would happen is the manuscript would come and go in different waves. They would go to the typesetting and they'd go to the printing and there'd be proofreading. There'd be all these stages during which time the typists were free to type many more citations. Sometimes in those gaps, they would also make citations themselves, not from the editor's reading, but from a very kind of mechanical third word on the third page, you know, that kind of thing. And that is interesting. Gove wanted a corpus and he didn't have one. So he had that instinct to look for one.
Emily Brewster: Well, the technology was on the horizon, but it had not arrived yet. And now in this modern era, since the late 20th century, corpus lexicography has been the main way, the most common way, that dictionaries are made. And Merriam-Webster also does corpus lexicography. But I really can't imagine doing lexicography that was solely corpus based. Citations are so valuable especially for common words, because if you are sitting down to revise the entry for us, as I did for the 11th Collegiate Dictionary, and this would have been around like 2002 or so, to look through a corpus for all the examples of the article a. You would look for different collocations, you would look for different positions in a sentence that the a is used, what kinds of words it's coming before or after. But using the citations, you end up with a collection of examples of the word, a, that are remarkable in some way.
Peter Sokolowski: Of course, they're different.
Emily Brewster: And so that tells the lexicographer, it really does that job of narrowing down so the lexicographer can just look through and it's still an enormous stack as it is for any common word.
You're listening to Word Matters. I'm Emily Brewster. We'll be back in a minute with more tales from the citation files. Word Matters is produced by Merriam-Webster in collaboration with New England Public Media.
Ammon Shea: I'm Ammon Shea. Do you have a question about the origin, history or meaning of a word? Email us at firstname.lastname@example.org.
Peter Sokolowski: I'm Peter Sokolowski. Join me every day for the Word of the Day, a brief look at the history and definition of one word available at merriam-webster.com or wherever you get your podcasts. And for more podcasts from New England Public Media, visit the NEPM podcast hub, at nepm.org.
Emily Brewster: The contents of the citation files aren't just its citations. There are also conversations taking place over a period of years. I'll read you one. The citation files are not a corpus, they are a curated collection-
Peter Sokolowski: That's a good way of putting it.
Emily Brewster: ... of relevant information.
Peter Sokolowski: There's also commentary and even arguments in some of these that you would see little things going back and forth. I wish I could remember an argument. I remember there have been a couple of memorable ones.
Emily Brewster: I have one. Citations that we've been talking about, these excerpts of text, are always on white slips of paper. Definitions that get drafted are on what we call buff pieces of paper, but they're kind of a vanilla, yellowy-
Peter Sokolowski: Beige.
Emily Brewster: ... beige color. And then blue slips of paper were used for author quotations and then pink pieces of paper, we just call them pinks, and they're used for starting a fight.
Peter Sokolowski: Yes. Questions, commentaries, corrections.
Emily Brewster: Yes. Questions, commentaries and corrections.
Peter Sokolowski: If you see an error, for example, if you're in house and you find a typo in one of our dictionaries, in the old days, there were 40, 50 editors on the floor. You didn't jump up and say, "Hey, I found a mistake." You wrote it down on the slip and the page number would go on one corner. And the addition would go in the other corner and the head word would go in the middle and then you'd simply make your commentary.
Emily Brewster: That's right. And then it was in the pipeline and it would be acted on as publishing schedules permitted.
Peter Sokolowski: The next time that page was opened, for example, there might be a case where a new definition could be added to a word that fell on that page. And that means if they open that page, in the typesetting sense, they could make an easy correction for very little money.
Emily Brewster: That's right. Now we just put it in a shared spreadsheet. Yes, it's very boring.
Peter Sokolowski: It's less exciting. But what was the one you're talking about?
Emily Brewster: I have a pink from W2, Webster's 2nd was being revised. This is referring to a definition in Webster's Second International Dictionary, it was published in 1934. And Oaks has written a question, or maybe it Sheffield, it's not clear. There are a bunch of names and date stamps on here.
Peter Sokolowski: Every editor had a stamp that also gave the date, with their name and date. Yeah.
Emily Brewster: So this question says, "The head word is preposition. Does the fact that a preposition may follow in position it's object seem important enough to telling what a preposition is to take four lines, better omit." "We have already said above, usually placed before the word," et cetera, but then there is a reply from someone with the last name Thomas, "There is still abroad in the land, a good deal of school masterly feeling that such locutions are a bit off color as violating the etymology of prior position. I should retain the statement."
Peter Sokolowski: Wow. So you have these debates that take place. I have one in front of me from the '50s also, I assume that was from the production time of-
Emily Brewster: '59.
Peter Sokolowski: So lots of them from the '50s during the heavy production of Websters 3rd. This one is handwritten, fuddy-duddy. And by the way, if you write these by hand, it has a squiggly line under fuddy-duddy, because that indicates boldfaced. In this case, he's referring to a head word, fuddy-duddy, noun with just a single line under it, which means it's italic and then the comment. "With the world so full of them, we can't let this stay out through another edition." In other words, it was a recommendation in this case that we add a new entry for the term fuddy-duddy.
Emily Brewster: But the term is not defined.
Peter Sokolowski: It's funny, it's a suggestion without a draft definition.
Emily Brewster: I have to say that is not my favorite kind of note to be on the receiving end of, someone saying, "We really need to define this word", but having made no stab at all.
Peter Sokolowski: Right. We just recently entered the term fluffernutter into the dictionary, and it got a lot of attention.
Emily Brewster: It's on white bread, it's peanut butter and a marshmallow cream.
Peter Sokolowski: Marsh cream, right. It is apparently regional to the Northeast. It's a term I never didn't know so I didn't know that it was regional myself. But I looked it up in our citation files and I found one from 1980, The New York Times. It's a newspaper glued to the card with fluffernutter keyed at the top so that it can be alphabetized. But then, in the purple ink stamped heavily, it says rejected for 9 coll, C-O-L-L, rejected for the 9th Collegiate Dictionary, 1983. And then right above it, it says rejected for C-10, the 10th Collegiate in 1993. And then on the next slip, it says rejected for C-11, for the 11th Collegiate. And this is really what happens. It's kind of like a conveyor belt. Words come and go. But some words grow in frequency, and grow in currency, grow in popularity. And this is one we'd been watching for literally decades. And it finally got in, in 2021.
Emily Brewster: That's right. But for each of those books leading up to just very recently-
Peter Sokolowski: It was assessed.
Emily Brewster: It was assessed and it was deemed not yet met our criteria, did not have substantial enough widespread use.
Peter Sokolowski: And we really did depend upon this evidence. The evidence was the citations. There's something that Fred Mish used to say, and Fred Mish was the editor-in-chief of three editions of the Collegiate, the 9th, the 10th and the 11th. And he's someone that we both knew and worked with. But he used to say, "Nothing is so humbling as a trip to the citation files." And that's kind of interesting. He would also say that the prerequisite to get a job at Merriam-Webster is you have to submit your linguistic prejudice to the evidence that's before you, and really what he's talking about is the citations.
Emily Brewster: Let us know what you think about Word Matters. Review us, wherever you get your podcasts, or email us at email@example.com. You can also visit us at nepm.org. And for the Word of the Day and all your general dictionary needs, visit merriam-webster.com. Our theme music is by Tobias Voigt, artwork by Annie Jacobson. Word Matters is produced by John Voci and Adam Maid. For Peter Sokolowski and Ammon Shea, I'm Emily Brewster. Word Matters is produced by Merriam-Webster in collaboration with New England Public Media.