The file english.words in the current directory is part of a multi-language wordlist package which I originally assembled around March 1992, at DEC SRC, and distributed through public FTP. The original README file for the English part of the package is reproduced below. The other files mentioned in the original README are not easily available at the moment, sorry. If you really need them, send me mail. *** IMPORTANT NOTE *** My current address is Jorge Stolfi Computer Science Department (DCC - IMECC) Universidade Estadual de Campinas (UNICAMP) Campinas, SP, Brazil. ------------------------------------------------------------------------ FILE: english.words VERSION: DEC-SRC-92-04-05 EDITOR Jorge Stolfi DEC Systems Research Center AUTHORS OF ORIGIONAL WORDLISTS Andy Tanenbaum Barry Brachman Geoff Kuenning Henk Smit Walt Buehring DESCRIPTION The file english.words is a list of over 104,000 English words compiled from several public domain wordlists. The file has one word per line, and is sorted with sort(1) in plain ASCII collating sequence. The file is supposed to include all verb forms ("-s", "-ed", "-ing"), noun plurals and possesives, and forms derived by various prefixes and suffixes ("un-", "re-", "-ly", "-er", "-ation", etc.) However, the list is still highly incomplete and inconsistent: not all stems have all forms, and some forms (notably possesive plural) are missing altogether. The file is NOT supposed to contain any "proper" names, such as the names of ordinary persons, corporations and organizations; nations, countries and other geographical names; mythological figures; biological genera; and trademarked products. It is also not supposed to contain abbreviations, measurement symbols, and acronyms. (Some of these are available in separate files; see below). The pronoun "I" and its contractions ("I'm", "I've") are capitalized as usual; the other words are all in lowercase. Besides the letters [a-zA-Z], the file uses only hyphen apostrophe, and newline. AUXILIARY LISTS In the same directory as englis.words there are a few complementary word lists, all derived from the same sources [1--8] as the main list: english.names A list of common English proper names and their derivatives. The list includes: person names ("John", "Abigail", "Barrymore"); countries, nations, and cities ("Germany", "Gypsies", "Moscow"); historical, biblical and mythological figures ("Columbus", "Isaiah", "Ulysses"); important trademarked products ("Xerox", "Teflon"); biological genera ("Aerobacter"); and some of their derivatives ("Germans", "Xeroxed", "Newtonian"). misc.names A list of foreign-sounding names of persons and places ("Antonio", "Albuquerque", "Balzac", "Stravinski"), extracted from the lists [1--8]. (The distinction betweeen "English-sounding" and "foreign-sounding" is of course rather arbitrary). org.names A short lists names of corporations and other institutions ("Pepsico", "Amtrak", "Medicare"), and a few derivatives. The file also includes some initialisms --- acronyms and abbreviations that are generally pronounced as words rather than spelled out ("NASA", "UNESCO"). english.abbrs A list of common abbreviations ("etc.", "Dr.", "Wed."), acronyms ("A&M", "CPU", "IEEE"), and measurement symbols ("ft", "cm", "ns", "kHz"). english.trash A list of words from the original wordlists that I decided were either wrong or unsuitable for inclusion in the file english.words or any of the other auxiliary lists. It includes typos ("accupy", "aquariia", "automatontons") spelling errors ("abcissa", "alleviater", "analagous") bogus derived forms ("homeown", "unfavorablies", "catched") uncapitalized proper names ("afghanistan", "algol", "decnet") uncapitalized acronyms ("apl", "ccw", "ibm") unpunctuated abbreviations ("amp", "approx", "etc") British spellings ("advertize", "archaeology") archaic words ("bedight") rare variants ("babirousa") unassimilated foreign words ("bambino", "oui", "caballero") mis-hyphenated compounds ("babylike", "backarrows") computer keywords and slang ("lconvert", "noecho", "prog"), (I apologize for excluding British spellings. I should have split the list in three sublists--- common English, British, American---as ispell does. But there are only so many hours in a day...) english.maybe A list of about 5,000 lowercase words from the "mts.dict" wordlist [6] that weren't included in english.words. This list seems to include lots of "trash", like uncapitalized proper names and weird words. It would take me several days to sort this mess, so I decided to leave it as a separate file. Use at your own risk... ORIGINAL LISTS The original wordlists from which those files were compiled are listed below. They were obtained by anonymous FTP on 92-Feb-10. [1] file: ispell/ispell/english.lrg size: 690778 bytes contact: Walt Buehring from: phloem.uoregon.edu: /pub/src/ispell.3.0.tar.Z * The (unexpanded) "large" english wordlist for ispell 3.0. [2] file: ispell/ispell/english.sml+ size: 575226 bytes contact: Walt Buehring from: phloem.uoregon.edu: /pub/src/ispell.3.0.tar.Z * The (expanded) "small" english wordlist for ispell 3.0. [3] file: words.english.Z size: 217119 bytes (479261 bytes uncompressed) contact: Henk Smit from: donau.et.tudelft.nl: /pub/words/ * From the README file on ftp.cs.vu.nl: This list is made out of 2 lists, the normal /usr/dict/words on most Unix systems, TeX english wordlist (available at archive.cs.ruu.nl) [4] file: dict.2 size: 274848 bytes contact: H Morrow Long from: bulldog.cs.yale.edu: /pub/dict.shar * According to H. Morrow, it came with some version of the "ispell" package. [5] file: minix.dict size: 357226 bytes author: Andy Tanenbaum from: cs.ubc.ca: /pub/wordlists-1.0.tar.Z * From the README file: Article 1997 of comp.os.minix: From: ast@botter.UUCP Subject: A spelling checker for MINIX Date: 6 Jan 88 22:28:22 GMT Reply-To: ast@cs.vu.nl (Andy Tanenbaum) Organization: VU Informatica, Amsterdam This dictionary is NOT based on the UNIX dictionary so it is free of AT&T copyright. I built the dictionary from three sources. First, I started by sorting and uniq'ing some public domain dictionaries. Second, as some of you probably know, I have written somewhere between 3 and 6 books (depending on precisely what you count) and an additional 50 published papers on operating systems, networks, compilers, languages, etc. This data base, which is online, is nonnegligible :-) Finally, I added a number of words that I thought ought to be in the dictionary including all the U.S. states, all the European and some other major countries, principal U.S. and world cities, and a bunch of technical terms. I don't want my spelling checker to barf on arpanet, diskless, modem, login, internetwork, subdirectory, superuser, vlsi, or winchester just because Webster wouldn't approve of them. All in all, the dictionary is over 40,000 words. If you have any suggestions for additions or deletions, please post them. But please be sure you are not infringing on anyone's copyright in doing so. Andy Tanenbaum (ast@cs.vu.nl) [6] file: mts.dict size: 346983 bytes contact: Barry Brachman from: cs.ubc.ca: /pub/wordlists-1.0.tar.Z * From the README file: These word lists were collected by Barry Brachman at the University of British Columbia. They may be freely distributed as long as this notice accompanies them. mts.dict contains only words that are not in /usr/dict/words. [But note that your version of /usr/dict/words may be different from mine! Use "sort -u" to get a list of unique words. ] From wc: 24259 24259 198596 /usr/dict/words 35475 35475 346992 mts.dict ----- ----- ------- 59734 59734 545588 total [7] file: words.english.Z size: 288385 bytes (644217 bytes uncompressed) from: ftp.hawaii.edu: /pub/editors/LEXICAL/word-lists/ author: unknown. COMMENTS: The "large" list from ispell 3.0 [1] is the most complete, and contains almost all the words of the "small" ispell list [2], of Andy Tannenbaum's list minix.dict [5], and of the lists from Delft and Yale [3, 4], as well as /usr/dict/words. It leaves out some 500--1000 words from each of these lists. On the other hand, the file mts.dict from UBC [6] contains some 7000 words that are not in the ispell list [1]. Therefore, mts.dict seems to be largely orthogonal to the list [1--5]. The file words.english from Hawaii [7] seems to be the union of mts.dict [6], Andy's file minix.dict [5], and /usr/dict/words, except that it omits some 250 words from the latter. COMPILATION PROCESS The file english.words is a slightly cleaned-up version of the "large" english wordlist [1] that comes with the ispell 3.0 package, which is available from phloem.uoregon.edu. First, I expanded the prefixes and suffixes using "isexpand" and some Gnuemacs hacking, and removed all words with capitals or periods. Then I compared the result with other publicly available wordlists [2--7], and did a little bit of manual cleanup. That meant removing some 8500 words that were obviously wrong or inappropriate, and adding about 4800 new words. Those 8500 words were largely distributed among the other lists. The table below gives the number of lowercase words in each original list ("lcase"), and how many of such words were included ("accept") and not included ("reject") in the final file english.words: ref site: file lcase accept reject --- ---------------------- ------- ------ ------ [1] uoregon: english.lrg 103124 102000 1124 [2] uoregon: english.sml+ 56694 56223 471 [3] tudelft: words.english 48150 47305 845 [4] yale: dict.2 47355 46577 778 [5] ubc: minix.dict 38699 38394 305 [6] ubc: mts.dict 35215 28874 6341 [7] hawaii: words.english 65165 57558 7607 (NON-)COPYRIGHT STATUS To the best of my knowledge, all the files I used to build these wordlists were available for public distribution and use, at least for non-commercial purposes. I have confirmed this assumption with the authors of the lists, whenever they were known. Therefore, it is safe to assume that the wordlists in this package can also be freely copied, distributed, modified, and used for personal, educational, and research purposes. (Use of these files in commercial products may require written permission from DEC and/or the authors of the original lists.) Whenever you distribute any of these wordlists, please distribute also the accompanying README file. If you distribute a modified copy of one of these wordlists, please include the original README file with a note explaining your modifications. Your users will surely appreciate that. (NO-)WARRANTY DISCLAIMER These files, like the original wordlists on which they are based, are still very incomplete, uneven, and inconsitent, and probably contain many errors. They are offered "as is" without any warranty of correctness or fitness for any particular purpose. Neither I nor my employer can be held responsible for any losses or damages that may result from their use.