The DCC Core Vocabulary Lists (6/28/13)
Sources of frequency data for the Latin core list:
1. L. Delatte, Et. Evrard, S. Govaerts and J. Denooz, Dictionnaire fréquentiel et Index inverse de la langue latine (Liège: Laboratoire d'Analyse Statistique des Langues Anciennes, 1981). The "LASLA" list is available in .pdf form here.
2. Paul B. Diederich “The Frequency of Latin Words and Their Endings,” (Dissertation, University of Chicago 1939), as digitized by Carolus Raeticus in 2011.
Definitions and quanitites were adapted from various sources, including Lodge 1922, and the Oxford Latin Dictionary.
Sources of frequency data for the Greek core list:
1. A subset of the comprehensive Thesaurus Linguae Graecae database, frequency data kindly provided by Maria Pantelia of TLG. The subset included all texts in the database up to AD 200, a total of 20.003 million words, of which the period AD 100-200 accounts for 10.235 million. The point of the chronological limitation was to minimize any possible distortions that would be caused by the large amounts of later Christian and Byzantine Greek in the TLG, texts that are not typically read by most students of ancient Greek.
2. The corpus of Greek authors at Perseus Chicago, which at the time our list was developed (summer 2012) was approximately 5 million words. This frequency data was kindly provided by Helma Dik of the University of Chicago.
In both cases many judment calls were made about which words to include and how to list them. This work was carried out by Chris Francese in 2012-13, with valuable help from Willie Major, Eric Casey, Meghan Reedy, and Marc Mastangelo. Alice Ettling edited early version of the lists. Derek Frymark did outstanding work formatting the lists, developing spreadsheets for them, and entering the LASLA frequency data, among other things. James Martin and Meredith Wilson organized the lists by semantic groupings and by parts of speech. Qingyu Wang created the searchable database versions of the lists in Drupal. Alex Lee created Mnemosyne flashcard sets for both the Latin and Greek lists. More details about how the lists were created can be found below.
Why Core Vocabulary
The main point of core vocabulary lists such as these is to help prioritize the learning of vocabulary. Assuming the goal is to read extant Greek and Latin texts, one should learn these words first. The lists can be used to distinguish which words in a given text are very common, and which are not, and students can be held responsible for only the most common ones, and gradually build to a mastery of the whole core.
We also use them in this project to determine which Latin and Greek words are NOT normally glossed in the running vocabulary lists that go with each section of the texts themselves. Words on these lists may well occur on the running lists also, if they are used in an unusual sense.
The texts in DCC are intended to be readable by people at an intermediate level of study: those who have completed the introductory textbook, and have had some, though not necessarily very much, reading experience thereafter. On the one hand it is wrong to assume that those who have worked through an introductory textbook have complete command of all the vocabulary in it. On the other hand, the running lists need to be kept to a manageable size to avoid glossing very common words in a way that would simply obscure the more difficult ones. So these lists are meant both for study, to help build a core vocabulary, and as a method of keeping the running lists in the DCC series from becoming too big.
Experts in vocabulary acquisition in Latin and Greek agree that it is desirable to work toward a goal at the end of the second year of college study of mastering all the vocabulary around the 80% level, that is, the number of lemmas that generate 80% of the words in the corpus of available texts (Muccigrosso, 2004, p. 416; Major, 2008, p. 7). In Greek, this means about 1100 items, in Latin, about 1500. A typical introductory textbook in Latin gives around a thousand separate vocabulary items to be learned, words that may or may not correspond well with those on the statistically derived lists. Commonly used Latin textbooks range from 843 lemmas (Wheelock’s Latin) to 1811 (in Ørberg’s Lingua Latina). These figures are derived from John Muccigrosso, 2004, p. 431.
A 50% list of vocabulary, that is, all the distinct lexical items that account for half of the available corpus, amounts to about 250 words in Latin, fewer than 100 words in Greek (in English the figure is about 100) (Major, 2008, pp. 1-2). This constitutes the very basic core vocabulary that must be thoroughly mastered before any comfortable reading can occur. But using only this very small list here would make the running lists quite unwieldy, and possibly less useful for intermediate students, who should already be familiar with many more words than these.
It seemed too much to expect every user of these commentaries to have fully mastered two years’ worth of college Latin or Greek vocabulary, since these texts may well be used during the second year. Lists of 1000 words in Latin and 500 words in Greek correspond to the 65-70% level, which we consider to be a good basic working vocabulary, and a challenging but realistic goal for three semesters of college study.
For ease of study, each list is divided in several ways, not just alphabetically, but by frequency group, by part of speech and morphological category, and by semantic grouping.
The DCC Latin List
The Latin list contains about 1000 of the most common words in Latin. These are the lemmas or dictionary headwords that generate approximately 70% of the word forms in a typical Latin text.
Diederich used a database of 202,158 words (194,378 without proper names) from more than 200 authors "from Ennius to Erasmus" (iii, 4) who appear in the Oxford Book of Latin Verse, Avery's Latin Prose Literaure, and Beeson's Primer of Medieval Latin. His explicit goal was to provide readers with vocabulary going beyond the basic texts (like Vergil, Cicero, Caesar) that Lodge and others relied on. His final tabulation included the number of times a given word occurred in classical prose, poetry, and medieval Latin, and all three together. A full list of the authors and works in his sample can be found here.
The Dictionnaire fréquentiel used a database of 794,662 words (582,411 from prose authors, 212,251 from poetry), grouped into 13,077 separate lemmas. The texts used (listed in full on p. i) include many golden age and canonical authors, such as Catullus, Caesar, some speeaches of Cicero, Horace's Odes, Juvenal, Tacitus, Seneca, Vergil, Ovid, Tibullus, along with some less commonly read authors such as Persius, Quintus Curtius, and Vitruvius.
Lists based on the Word Study Tool at Perseus were found to be distinctly inferior to these two, despite the much larger sample available (10.5 million words as of 2012). The Word Study Tool is incapable of distinguishing between homographs, of which there are many in Latin. Occurrences of genibus, for example, are automatically assigned, half to genus and half to genu, which inappropriately makes the word for "knee" rank extremely high. This problem is endemic to machine generated lists, and pushes out perhaps a hundred or more words that, on the evidence of Diederich and the LASLA samples, not mention one's subjective impressions of Latin, should be included (e.g. potens, hortor).
Nonetheless, in an effort to include a wide chronological spectrum of Latin, I constructed a list of post-classical authors from the Perseus database, and used the top thousand lemmas from that sample as a check on the more classically oriented lists of Diederich and LASLA. This sample included authors such as Ammianus Marcellinus, Apuleius, Ausonius, The Venerable Bede, Boccaccio, Boethius, Claudian, St. Jerome, Christoforo Landino, Minucius Felix, Petrarch, Poliziano, Prudentius, the Scriptores Historiae Augustae, and Marco Girolamo Vida. The main contribution of this sample was to show the importance of certain words in Christian Latin, such as dominicus, episcopus or monasterium, which were not prominent in the other lists. Since most of these are easily recognizable from English cognates, they were not included here. The Perseus post-classical sample also helped to make decisions in cases where Diederich and the LASLA list differed.
All such lists tend to agree for the most part on the top 500-600 lemmas, but beyond that the vagaries of individual samples dictate what lemmas make it into the top thousand, and which lie just outside of that cut-off. There thus enters an element of subjectivity, which however is limited by using more than one large hand-collated sample, along with the machine generated lists of Perseus. In some cases also there is room for difference of opinion as to what exactly constitutes a lemma. The LASLA group, for example, distinguished carefully between cum used a conjunction, and cum used as a preposition, and assigned separate lemmas to bonus -a -um, bonum -i n., and boni -orum.
In deciding what constitutes a lemma for this list Diederich has been the main guide. Lodge’s book provided most of the definitions, quantities, and principal parts. Some words that did not seem to merit a separate lemma, such as regularly formed adverbs, were occasionally appended to another (for example, celeriter is appended to celer). This allowed the inclusion of relevant information without unduly increasing the total number of lemmas. I have, however, limited this practice to cases in which all components of the lemma are very common, and occur high on Diederich’s list and the LASLA list. IN the database presentation all these "extra" adverbs had to be disaggregated from their parent lemmas, so that it may seem that certain words are included which did not earn the distinction on gounds of pure frequency.
The DCC Greek List
This list contains about 500 of the most common words in ancient Greek. These are the lemmas or dictionary headwords that generate approximately 65% of the word forms in a typical Greek text.
The TLG texts were analyzed automatically by the TLG’s lemmatizer tool, which attempts to determine from what lemma or dictionary headword a given form derives. The lemmas were then ranked by frequency (ὁ, αὐτός, καί, δέ, and τίς coming in as the top 5, for example, with ὁ, i.e. the definite article, at 3,280,309 instances). The lemmatizer tool, however, is far from perfect, and considerable editing of the raw results was necessary to catch “rogue” words that tended to rise high because they share a form or forms with an extremely common word. The adjective ἥμερος (“tame”), for example, is relatively rare, but its rank is inflated in the raw data because it shares some forms with the very common noun ἡμέρα (“day”). Confronted with an ambiguous form, the lemmatizer splits the occurrences of that form equally between the two (or more) possible lemmas, resulting in the promotion of some rarer lemmas, and the relative demotion of commoner ones. Some of these problems are easy to spot, but many are not.
The Perseus Chicago data was analyzed using the Logeion tool, developed at the University of Chicago by a team under the direction of Helma Dik. Although the TLG data set is larger, the Logeion frequency data is superior in that its Greek lemmatizer tool has been significantly improved by the intervention of humans who disambiguated some ambiguous forms. The number of disambiguated words in the Perseus Chicago Greek corpus at the time of the creation of our list was over 360,000 and included all of Homer, Hesiod, and Aeschylus, and several thousand-word samples from the rest of the corpus. For some authors, such as Lysias and Plato, the figure was closer to 12,000.
Differences in the lemmatizer tools, and differences in the samples being analyzed, led to considerable disagreement between the two “top 500” lists thus produced, the TLG list and the Logeion list. Another complicating factor had to do with lemmatization, that is, the process of deciding what exactly counts as a dictionary headword in ancient Greek. Words treated as a lemma by TLG but not by Logeion include, for example, εἷδον and εἶπον; words treated as a lemma by Logeion but not by TLG include εἰκός and ἔξεστι. A lemma like ἄκρος, while not in the top 500 in its own right based on the TLG data, might be included if the figures for the adverb ἄκρον, lemmatized separately by the TLG, were added.
In making the innumerable judgment calls of this type it was enormously helpful to have figures from two different samples. Where they agreed, we could accept the results with some confidence; where they disagreed, we could investigate the details in the TLG corpus itself and make a determination. A word like κίνησις, for example, rises very high in the TLG, but not in Logeion. On further examination in the TLG corpus it emerges as very important in Aristotle and certain other philosophical texts, but not terribly common elsewhere. So it was omitted. παρασκευάζω fails to make the cut in the TLG, but does in Logeion. It turns out to be very common in classical prose, and so we kept it. Such examples could be multiplied. In doubtful cases, we preferred words common in classical Greek across genres, and tried to avoid words not extremely common in the standard authors, but whose frequency data was inflated due to being very common in specialized mathematical, medical, or Aristotelian texts. ψυχρός is an example of a word that makes the top 500 in TLG thanks to its prominence in medical texts (and was thus omitted here); the statistical prominence of γωνία is due to its appearance in very repetitive mathematical texts, and was thus also rejected.
Occasional editing was also done to avoid the appearance of many words based on the same root. φιλία, included in Logeion’s top 500, was omitted here, since we already have the adjective φίλος. δικάστης, a possible candidate, was rejected because we already had δίκαιος and δίκη. A few words that didn’t quite make the cut statistically were included because of their cultural importance, such as ἐλεύθερος and δαίμων. Some verbs whose simplex forms do not rise to the level of the top 500 on their own, were included because they are extremely common as compounds of similar meaning: βαίνω and ἀγγέλλω, for instance. In general, compound verbs were only included in addition to the simplex forms when the compounds either a) were extremely common on their own, b) had a substantially different definition from the simplex, or c) were vastly more common than the simplex so that the simplex itself could be omitted. Thus παρασκευάζω but not σκευάζω.
The goal throughout was to achieve a balance between the goals of (unrealizable) statistical perfection on the one hand, and pedagogical utility on the other. Definitions are intended to cover most or all of the principal meanings, and were adapted by Chris Francese and Eric Casey from the various dictionaries made available on the Logeion dictionary tool.
We at DCC are extremely grateful to the following people for generous and substantial help in the creation of this list: Helma Dik, Associate Professor of Classics at the University of Chicago and the developer of the Logeion interface; Maria Pantelia, Professor of Classics at University of California, Irvine and Director of the TLG; and Wilfred Major, Assistant Professor in the Department of Foreign Language and Literature at Louisiana State University; and Eric Casey, Associate Professor of Classics at Sweet Briar College. This list would have been of far lower quality without their help. None of them, however, is in any way responsible for what errors remain in the list. We would very much appreciate being apprised of any errors users might discover.
Several Dickinson students also made invaluable contributions, typing in Greek, helping to compare the lists and statistics, and organizing the lists by parts of speech and semantic categories. We are particularly grateful to Alice Ettling (’11), James Martin (’12), Merri Wilson (’12), Derek Frymark (’11),a nd Qingyu Wang ('15).
For a Greek 50% list, see Major, 2008, p. 4. For Latin, Paul Diederich's list of 300 words, and Mahoney's 200 Essential Latin Words perform a similar function, though neither was created using Perseus or has quite the precision or sample size of Major's Greek list.
A Greek 80% list is available also in Major's excellent article (Major, 2008, pp. 12-24). Useful larger lists of high frequency Latin words include those of Lodge (1922), 2000 words, based on a small corpus of classical school authors; and Diederich (1939), 1,500 words, based on a wider selection of classical and medieval Latin found in anthologies, a database about 200,000 words. Diederich's list of 1,500 words, which represents about 85% of the words in a typical Latin text, has been very usefully edited and presented by Carolus Raeticus on his site, Hiberna Caroli Raetici.
In late 2011, the Perseus Digital Library included 10.5+ million words of Latin, from antiquity through the Renaissance, and 13+ million words of Greek, from antiquity through the Byzantine period. The Perseus Vocabulary Tool can be used to create various kinds of custom lists on the basis of this corpus, but see above on the work required to make these usable.
Finally, anyone interested in the issue of Latin vocabulary and learning should consult Muccigrosso's important article (2004).
Diederich, Paul Bernard. (1939). The Frequency of Latin Words and Their Endings (Dissertation, University of Chicago).
Lodge, Gonzalez. (1922). The Vocabulary of High School Latin (New York: Teachers College Columbia University, 1922).
Major, Wilfred E. (2008). It’s Not the Size, It’s the Frequency: The Value of Using a Core Vocabulary in Beginning and Intermediate Greek. CPL Online, 4.1, 1-24.
Muccigrosso, John (2004). Frequent Vocabulary in Latin Instruction. Classical World, 97, 409-433.
Raeticus, Carolus (2011). Topical Vocabulary. Retrieved December 19, 2011.