Concordancer

A concordancer is a computer program that automatically constructs a concordance—an alphabetised index of every occurrence of a word or phrase in a body of text, each entry displayed with its surrounding context. Concordancers are primary tools in corpus linguistics, lexicography, computer-assisted translation, and language teaching. The most common display format is the key word in context (KWIC) layout, in which each hit appears centred on a line with a fixed span of words to its left and right, enabling rapid scanning of usage patterns across many occurrences.

History

Pre-computational concordances

The compilation of concordances predates computers by many centuries. Around 1230, the French Dominican cardinal Hugh of Saint-Cher directed a team of friars in assembling a concordance of the Latin Vulgate Bible, generally regarded as the first systematic concordance of any text.¹ To help readers locate passages, Hugh divided each biblical chapter into lettered sections. Later milestones include a Hebrew Old Testament concordance compiled by Rabbi Mordecai Nathan (1448), Alexander Cruden's Complete Concordance to the Holy Scriptures (1737), and the manuscript Asaf ha-Mazkir, an unfinished concordance to the Babylonian Talmud compiled by Moses Rigotz around the turn of the 19th century.²

First computer concordance

The first concordance produced with computing assistance was the Index Thomisticus, a comprehensive lexical index of the writings of and around Thomas Aquinas, totalling approximately 10.6 million Latin words. The Italian Jesuit priest Roberto Busa conceived the project in 1946 and secured the sponsorship of IBM in 1949 after a meeting with chairman Thomas J. Watson.³ Keypunch operators in Gallarate, Italy, encoded the texts onto punched cards from around 1950. IBM executive Paul Tasman developed the processing methods. The full 56-volume printed edition was completed around 1980, followed by a CD-ROM edition in 1989 and a web-accessible version in 2005.

The KWIC format

The key word in context (KWIC) display was formalised as a computational technique by Hans Peter Luhn, a researcher at IBM, in a 1960 paper in American Documentation.⁴ In KWIC output, each instance of the search term (the node word) is centred on a line with a fixed window of words to each side; sorting the resulting lines alphabetically by the immediately adjacent word reveals collocational and phraseological patterns at a glance.⁵

COCOA

One of the first dedicated concordancing programs was COCOA (COunt and COncordance Generation on Atlas), created in 1965 by D. B. Russell at University College London and the Atlas Computer Laboratory in Harwell, Oxfordshire.⁶ Written in approximately 4,000 cards of FORTRAN, it processed text annotated with flat, non-hierarchical markup tags and could produce word counts and concordances in multiple languages. Within its first six months COCOA had been applied to texts in at least six languages. A second version designed for multiple mainframe platforms was distributed to British computing centres in the mid-1970s. Growing dissatisfaction with its interface and the eventual withdrawal of Atlas Laboratory support prompted British funding bodies to commission a successor program.

Oxford Concordance Program

The Oxford Concordance Program (OCP) was designed and written in FORTRAN by Susan Hockey and Ian Marriott at Oxford University Computing Services (OUCS) between 1979 and 1980 and first released in 1981.⁷ Hockey and Marriott acknowledged that OCP owed much to COCOA and the CLOC system at the University of Birmingham. OCP accepted COCOA-format markup to encode metadata such as author, act, scene, and line number, and was described by its authors as "a machine-independent text analysis program for producing word lists, indices and concordances in a variety of languages and alphabets." By the mid-1980s it had been licensed to approximately 240 institutions in 23 countries.⁸ A personal computer version, Micro-OCP, was developed for the IBM PC and sold by Oxford University Press from the late 1980s. Version 2 was rewritten in 1985–86 and documented in the same 1987 article by Hockey and co-author John Martin.⁷

Personal computer era

The availability of affordable personal computers in the 1980s and 1990s enabled standalone concordancing applications that analysts could run locally without specialist computing facilities. MicroConcord, developed by Mike Scott and Tim Johns and published by Oxford University Press in 1993 for MS-DOS, was among the first concordancers designed specifically for classroom language teaching.⁹ WordSmith Tools, also developed by Mike Scott, was first released in 1996 and became one of the most widely used corpus analysis suites in academic linguistics research.¹⁰ Other tools from this era include TACT (University of Toronto, 1989), a suite of MS-DOS freeware programs for literary text analysis, and MonoConc, a Windows concordancer created by Michael Barlow.

Web-based concordancers

From the late 1990s onwards, web-based concordancers hosted on remote servers gave researchers browser access to large preloaded corpora without requiring local storage or processing. The Sketch Engine, developed by Adam Kilgarriff and Pavel Rychlý (Masaryk University), was launched commercially in July 2003 by Lexical Computing Limited and introduced word sketches—automatically generated one-page profiles of a word's typical grammatical relations and collocations.¹¹ AntConc, created by Laurence Anthony at Waseda University, Tokyo, was first released in 2002 as freeware for Windows, macOS, and Linux.¹²

Features

Modern concordancers typically offer a range of analytical functions beyond basic KWIC display.⁵¹³ These commonly include:

KWIC display with the node word centred and context words in aligned columns, sortable by the word one, two, or three positions to the left or right of the node (L1–L3 and R1–R3)
Concordance plots, visualising the distribution of hits as marks along a scaled bar representing each text in the corpus
Frequency and word lists, both alphabetical and ranked by frequency
Collocation statistics, identifying words that co-occur with the search term more often than chance, quantified by measures such as mutual information, the t-score, or log-likelihood
Keyword analysis, comparing word frequencies between a study corpus and a reference corpus to identify statistically distinctive items
N-gram analysis, finding frequently recurring word sequences of a specified length
Part-of-speech tagging integration, allowing searches filtered to particular grammatical categories
Unicode support for multilingual text

Bilingual and parallel concordancers additionally display aligned text in two or more languages side by side, enabling comparison of translation equivalents across language pairs.

Notable concordancers

WordSmith Tools

Created by Mike Scott and first released in 1996, WordSmith Tools is a Windows corpus analysis suite that evolved from MicroConcord.¹⁰¹⁴ Its three core modules are Concord (KWIC concordances), WordList (frequency and alphabetical word lists), and Keywords (statistical keyword identification relative to a reference corpus). Oxford University Press used WordSmith Tools for dictionary preparation work. Version 4.0 is freely available; later versions are sold by Lexical Analysis Software Limited.

AntConc

AntConc is a freeware, multiplatform concordancing toolkit created by Laurence Anthony, Professor of Applied Linguistics at Waseda University, Tokyo.¹⁵ First released in 2002 and formally described in a 2005 academic paper, it runs on Windows, macOS, and Linux. Its tools include a KWIC concordancer, a concordance plot for visualising distribution across texts, a collocates tool, a keyword list, and an n-gram analysis module. Because it is free and requires only plain text files, AntConc is widely used in linguistics courses and independent research worldwide.

Sketch Engine

The Sketch Engine is a corpus management and query system co-created by Adam Kilgarriff and Pavel Rychlý and launched in 2003 by Lexical Computing Limited.¹¹¹⁶ It provides browser-based access to over 800 corpora in more than 100 languages. Beyond concordance searching, it offers word sketches, collocation analysis, distributional thesaurus construction, keyword and terminology extraction, and diachronic analysis. It is used by major publishers including Macmillan and Oxford University Press for lexicographic research. A subset tool, SKELL (Sketch Engine for Language Learning), is freely accessible to individual learners.

Wmatrix

Wmatrix is a web-based corpus processing environment developed by Paul Rayson at the University Centre for Computer Corpus Research on Language (UCREL), Lancaster University.¹⁷ Alongside concordances and frequency lists, Wmatrix integrates CLAWS part-of-speech tagging and the USAS semantic tagger, enabling keyword analysis simultaneously at the levels of individual words, grammatical categories, and semantic domains—an approach that extends standard keyword methods beyond simple lexical comparison.

ParaConc

ParaConc, developed by Michael Barlow, is a Windows concordancer for parallel (multilingual) corpora that accepts up to four aligned texts in different languages.¹⁸ Designed for contrastive analysis, translation studies, and language learning research, it includes a "Hot words" feature that uses relative frequency data to suggest likely translation equivalents of a search word.

LancsBox

LancsBox is a free, cross-platform corpus analysis tool developed at Lancaster University under the direction of Vaclav Brezina.¹⁹ Released in 2015, it supports more than 15 languages and includes a KWIC concordancer, frequency analysis, and a GRAPH tool that renders collocations as an interactive network diagram. It integrates the TreeTagger for part-of-speech annotation and was designed to lower barriers to corpus analysis in teaching and research contexts.

Applications

Corpus linguistics

Concordancers are the primary analytical instrument in corpus linguistics, providing systematic access to patterns of use across large samples of authentic text. Common research uses include studying collocations and phraseology, analysing semantic prosody, comparing language varieties, and tracking lexical and grammatical change over time. Large reference corpora such as the British National Corpus (approximately 100 million words) and the Corpus of Contemporary American English (over one billion words) are typically queried through dedicated web concordancers.

Lexicography

John Sinclair at the University of Birmingham pioneered the systematic use of concordance data in dictionary making through the COBUILD project, funded by Collins from the early 1980s. The project produced the Collins COBUILD English Language Dictionary (1987), generally considered the first major English dictionary compiled entirely from corpus evidence rather than invented illustrative examples.²⁰ Concordance lines allowed lexicographers to observe authentic collocates, typical syntactic environments, and register distinctions that introspection-based methods had tended to miss. Corpus-driven methods have since become standard practice in commercial lexicography.

Computer-assisted translation

In computer-assisted translation (CAT) software, a concordancer search allows translators to query a translation memory for all previously translated instances of a word or phrase in context, enabling consistency across a document or project. Bilingual concordancers—which search sentence-aligned parallel corpora in two languages simultaneously—are also used to locate translation equivalents in existing translated texts. Web-based bilingual concordancers such as Linguee and Reverso Context extend this capability to large publicly accessible multilingual corpora.

Language teaching

Tim Johns at the University of Birmingham coined the term data-driven learning (DDL) around 1990 to describe a pedagogical approach in which language learners use concordancers to explore corpus evidence and discover grammatical and lexical patterns inductively, acting as "language detectives" rather than passive recipients of pre-stated rules. Johns and Mike Scott developed MicroConcord (1993) specifically for classroom use. DDL has since been studied extensively across second and foreign language teaching contexts and has been found to support learner autonomy and awareness of collocational patterns.

References

"Hugh of St. Cher's Concordance". Christianity.com. Retrieved 2025-04-01.
"Concordance". Jewish Encyclopedia. 1906. Retrieved 2026-05-04.
"Father Roberto Busa Conceives the Index Thomisticus". History of Information. Retrieved 2025-04-01.
Luhn, H. P. (1960). "Key word-in-context index for technical literature (kwic index)". American Documentation. 11 (4): 288–295. doi:10.1002/asi.5090110403.
"Concordancing". Corpus Linguistics: Method, Theory and Practice. Lancaster University. Retrieved 2025-04-01.
"COCOA: Count and Concordance Generation on Atlas". Chilton Computing. Retrieved 2025-04-01.
Hockey, Susan; Martin, John (1987). "The Oxford Concordance Program Version 2". Literary and Linguistic Computing. 2 (2): 125–131. doi:10.1093/llc/2.2.125.
"Oxford Concordance Program". CTI Centre for Textual Studies, Oxford University. Retrieved 2025-04-01.
"Tim Johns: concordancing in the language classroom". lexically.net. Retrieved 2025-04-01.
"WordSmith Tools". Lexical Analysis Software. Retrieved 2025-04-01.
Kilgarriff, Adam; Rychlý, Pavel; Smrž, Pavel; Tugwell, David (2004). "The Sketch Engine" (PDF). Proceedings of the 11th EURALEX International Congress. Lorient. pp. 105–116.
Anthony, Laurence (2005). "AntConc: Design and development of a freeware corpus analysis toolkit for the technical writing classroom". Proceedings of the IEEE International Professional Communication Conference. pp. 729–737. doi:10.1109/IPCC.2005.1494244.
Weisser, Martin. "Concordancers: An Overview". Retrieved 2025-04-01.
Scott, Mike. "WordSmith Tools Version 4 Manual" (PDF). Lexical Analysis Software. Retrieved 2025-04-01.
Anthony, Laurence. "AntConc". Waseda University. Retrieved 2025-04-01.
Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít (2014). "The Sketch Engine: Ten Years On". Lexicography. 1 (1): 7–36. doi:10.1007/s40607-014-0009-9.
Rayson, Paul. "Wmatrix: A web-based corpus processing environment". UCREL, Lancaster University. Retrieved 2025-04-01.
Barlow, Michael. "ParaConc". Retrieved 2025-04-01.
"LancsBox". Lancaster University. Retrieved 2025-04-01.
"The History of COBUILD". Collins Dictionary. Retrieved 2025-04-01.

[christianity-hugh-1] "Hugh of St. Cher's Concordance". Christianity.com. Retrieved 2025-04-01.

[jewish-encyc-concordance-2] "Concordance". Jewish Encyclopedia. 1906. Retrieved 2026-05-04.

[histinfo-busa-3] "Father Roberto Busa Conceives the Index Thomisticus". History of Information. Retrieved 2025-04-01.

[luhn1960-4] Luhn, H. P. (1960). "Key word-in-context index for technical literature (kwic index)". American Documentation. 11 (4): 288–295. doi:10.1002/asi.5090110403.

[lancs-conc-5] "Concordancing". Corpus Linguistics: Method, Theory and Practice. Lancaster University. Retrieved 2025-04-01.

[cocoa-chilton-6] "COCOA: Count and Concordance Generation on Atlas". Chilton Computing. Retrieved 2025-04-01.

[ocp-hockey87-7] Hockey, Susan; Martin, John (1987). "The Oxford Concordance Program Version 2". Literary and Linguistic Computing. 2 (2): 125–131. doi:10.1093/llc/2.2.125.

[ocp-cti-8] "Oxford Concordance Program". CTI Centre for Textual Studies, Oxford University. Retrieved 2025-04-01.

[timjohns-9] "Tim Johns: concordancing in the language classroom". lexically.net. Retrieved 2025-04-01.

[ws-home-10] "WordSmith Tools". Lexical Analysis Software. Retrieved 2025-04-01.

[sketch-2004-11] Kilgarriff, Adam; Rychlý, Pavel; Smrž, Pavel; Tugwell, David (2004). "The Sketch Engine" (PDF). Proceedings of the 11th EURALEX International Congress. Lorient. pp. 105–116.

[antconc-2005-12] Anthony, Laurence (2005). "AntConc: Design and development of a freeware corpus analysis toolkit for the technical writing classroom". Proceedings of the IEEE International Professional Communication Conference. pp. 729–737. doi:10.1109/IPCC.2005.1494244.

[weisser-13] Weisser, Martin. "Concordancers: An Overview". Retrieved 2025-04-01.

[ws-manual-14] Scott, Mike. "WordSmith Tools Version 4 Manual" (PDF). Lexical Analysis Software. Retrieved 2025-04-01.

[antconc-home-15] Anthony, Laurence. "AntConc". Waseda University. Retrieved 2025-04-01.

[sketch-10y-16] Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít (2014). "The Sketch Engine: Ten Years On". Lexicography. 1 (1): 7–36. doi:10.1007/s40607-014-0009-9.

[wmatrix-17] Rayson, Paul. "Wmatrix: A web-based corpus processing environment". UCREL, Lancaster University. Retrieved 2025-04-01.

[paraconc-18] Barlow, Michael. "ParaConc". Retrieved 2025-04-01.

[lancsbox-19] "LancsBox". Lancaster University. Retrieved 2025-04-01.

[cobuild-20] "The History of COBUILD". Collins Dictionary. Retrieved 2025-04-01.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20