Testing vocabulary

 

Do we need to test vocabulary? Yes, indeed! And we need more research on the size, the use, and the acquisition of vocabulary.

Here's one example

In universities around the world non-native speakers of English are faced with heavy burdens of long and difficult texts in English, with vocabulary they never met in high school. Many are slow readers and think they understand much more than they actually do understand.

 

A Study of TEFL Vocabulary by Prof. Magnus Ljung (1990) at Stockholm University, compares a corpus of texts intended for the Swedish upper secondary schools in Sweden with a corpus of modern English (Bham Corpus) which was compiled at The University of Birmingham. The result of this investigation was that in the TEFL corpus there was a predominance of words denoting concrete objects and physical actions. Words denoting abstractions and mental processes are under-represented.

Magnus Ljung says in his concluding remarks that ”there is reason to be critical of the TEFL texts on at least two major counts, i.e. the low general level of lexical sophistication and the absence of a clear increase in vocabulary difficulty as we move from the early to the later school years. The words which are missing or under-represented in the TEFL texts are not, on the whole, particularly rare or abstruse. In most cases, they are precisely those words which it is necessary to know in order to read British or American (quality) newspapers and magazines, or to understand news broadcasts and discussions of current events on radio and TV.”

Background to the ForumEducation vocabulary test

In the late 1980's the Birmingham Corpus was lemmatised by EFL teachers in a school project in Goteborg, Sweden. The Corpus contained about 20 million running words (tokens) with nearly 250,000 word types, which resulted in 34,000 useful lemmas.

The B'ham Corpus contents:

Book authorship 75% male 25% female
Language varieties 70% British 20% American 5% other
Language mode 75% writing 25% spoken

Corpora had been used since the 1960's for testing vocabulary at the university of Goteborg. None had been lemmatised. Words were picked from six frequency bands. 20 words from each band  were given five alternative distractors/correct answers in Swedish. Many thousand students were tested and the test was considered as relatively reliable.

With increasing numbers of immigrants in the secondary schools and the university there was a clear demand for a similar test without translations. The lemmatised list from the B'ham Corpus was used to create a test, which used associations and synonyms instead of translations. It was used extensively throughout the 90's as a pen and paper test in adult secondary education. In 1998 ForumEducation used the lemmatised list to create a database that could random generate tests at three levels for on-line testing:

For the level 1 test four frequency bands are used and for levels 2 and 3 six bands are used. 20 words are randomly chosen from each band.

As the test can be taken on-line it is ideal for testing large populations. More than 500,000 persons have been tested between 1998 and 2011. During a period of six weeks in 2005, 12,000 CET test takers were tested in China. Also several thousand ESL/EFL teachers have taken the test.

The lmmatisation of the B'ham Corpus

The B'ham frequency wordlist has about 250,000 different word types, which means that every inflexion of a word has been given a frequency number denoting how many times per Corpus size (20 million) this particular form of the word occurs. When estimating vocabulary size, it is the lexical form of the word that is of interest , which means that the frequencies of the various word types have to be added up and included under a lexical form of a word (lemma). For practical purposes and time it was decided that the lemmatised word list should be based on the word-types that appeared 3 times or more per Corpus size. By setting this limit, the number of word-types was reduced to c. 85,000.

The lemmatisation of the word "record" may serve as an example of how the lemmatisation was done:

record 1336
record's 4
recorded 527

recording

257
recordings 91
records 625
record 2840 n v

(n and v refer to parts of speech and were added to all lemmas in accordance with the categorisation in ED, Colleens English Dictionary.)

As a reference dictionary, the Colleens English Dictionary, 2nd edition, 1986, was used. It contains 170,000 references, which in almost all cases cover the word types in the B'ham word list down to F=3 or more. Collins Cobuild is too small (70,000 references). If in doubt about the existence of a word and if it was not found in CED, the "word" was deleted. At the same time as the manual lemmatisation was performed, word categories and parts of speech were added after the frequency number.

When all the 85,000 word types had been gone through, the number of lemmas in the database turned out to be 34,000.

Presentation of tests for estimating vocabulary size

The ForumEducation vocabulary tests are based on association and to a lesser degree on synonyms. As a compromise between reliability and time to complete the test, 80-120 test words are used in each test with 20 words from four to six main frequency ranges. The first range contains the most common words and the last range contains the least common words. Furthermore each frequency range is subdivided into ten smaller ranges in order to spread the test words over as wide a range as possible. There are always five alternatives. It should be pointed out that in order to estimate a person's approximate vocabulary size the test must be designed so that the test person knows most of the words within the first test range and only a small percentage of words in the last test range.

Example

Test word Alt. 1 Alt. 2 Alt. 3 Alt. 4 Alt. 5
surgery anger celebrate defeat hospital wave
novelty booklet Christmas diamond new poem


Further use of the frequency database /vocabulary acquisition

Learners who participate in a ForumEducation English language course have full access to lots of exercises and vocabulary acquisition guides. There is also a highly useful tool, which allows them to save any word they encounter to a personal glossary. To test their progress, they can generate their own tests on those words.