Using Character-Grams to Automatically Generate Pseudowords and How to Evaluate Them

Creating pseudowords by manipulating a stimulus

Insertion (h)	Deletion (o)	Transposition (i)	Composition (ation)
Piloth	Pilt	Pliot	Pilotation

Table 1:

Creating pseudowords by manipulating a stimulus

Insertion (h)	Deletion (o)	Transposition (i)	Composition (ation)
Piloth	Pilt	Pliot	Pilotation

The second technique, using high frequency bi-grams, involves combining two letter sequences (bi-grams) that appear together frequently to form pseudowords. Programs such as WordGen (Duyck et al. 2004), that use this technique, tend to also consider neighbourhood size and orthographic relatedness. Table 2 shows an example for the pseudoword reroin. WordGen uses bigram frequency, in-part, for pseudowords generation in Dutch, English, German, and French. It generates a random sequence of letters which is validated as a pseudoword dependent on a set of seven constraints: number of letters, neighbourhood size, frequency, summated bigram frequency, minimum bigram frequency, initial bigram frequency, final bigram frequency, and orthographic relatedness (Duyck et al. 2004). Depending on the language, WordGen uses either the CELEX or Lexique (New et al. 2004) as its lexicon. A random selection of letters are considered to be a pseudoword only if it is not an existing word in the lexicon, and it meets all seven criteria.

Table 2:

Creating a pseudoword (reroin) using high frequency bi-grams. Generated using WordGen (Duyck et al. 2004)

Bigram:	re	er	ro	oi	in
Frequency:	4,760	7,279	2,840	468	7,156

Table 2:

Creating a pseudoword (reroin) using high frequency bi-grams. Generated using WordGen (Duyck et al. 2004)

Bigram:	re	er	ro	oi	in
Frequency:	4,760	7,279	2,840	468	7,156

The third technique for creating pseudowords involves combining sub-syllabic elements within a language by breaking existing syllables down into their sub-syllabic elements and then joining them back together. A syllable is a unit of sound, typically made up of a nucleus (usually a vowel) and an optional onset (initial sound) and coda (final sound). This approach takes legal sub-syllabic elements (onset, nucleus, and coda) from existing words and combines them to form a pseudoword (Keuleers and Brysbaert 2010). Table 3 shows an example for the pseudoword shib. The ARC Nonword Database is an example of a system which creates pseudowords by joining sub-syllabic elements. The database holds a collection of over 350,000 monosyllabic non-words which combine onsets with rhymes (nucleus and coda) from sound relationships derived from the CELEX database. Another system that combines sub-syllabic elements is Wuggy (Keuleers and Brysbaert 2010). Given a list of syllabified words, it segments each word into sub-syllabic elements and builds a tree of all possible legal sub-syllabic combinations. The tree is then traversed to retrieve all possible pseudowords. Wuggy uses five lexicons (CELEX, Lexique, E-HITZ, B-PAL, and the Frequency Dictionary of Contemporary Serbian Language) to support pseudoword generation in seven languages: Dutch, English, German, French, Spanish, Serbian, and Basque.

Table 3:

Creating a pseudoword (shib) by combining sub-syllabic elements

Onset	Nucleus	Coda	Pseudoword
sh (as in show)	i (as in tin)	b (as in bib)	Sh-i-b

Table 3:

Creating a pseudoword (shib) by combining sub-syllabic elements

Onset	Nucleus	Coda	Pseudoword
sh (as in show)	i (as in tin)	b (as in bib)	Sh-i-b

Problems with existing methods for creating pseudowords

Each of the existing techniques and systems for creating pseudowords comes with its own set of limitations, as will be discussed in this section in turn. Broadly speaking, these limitations have to do with one or several of the following problems: (1) in order to create a list of pseudowords, the user needs ample knowledge of the language for which pseudowords are being created, for instance, in-depth knowledge about syllable structure rules of that language, (2) only a handful of the major languages are represented, (3) there is no way of evaluating whether the pseudowords created could actually pass as legal words of that language or whether they might be suitable for a given task.

Manipulating a stimulus requires knowledge of which characters can be inserted, deleted, or transposed while still resulting in a legal pseudoword. Using high frequency bi-grams requires knowledge of which combinations are legal within the language. Combining sub-syllabic elements requires knowledge of syllabification, and if phonetic syllables are used (as in the ARC Nonword Database) then understanding conversion from phonotactic to orthographic forms is also required.

Having a high number of possible pseudoword combinations is a risk when combining letter sequences or sub-syllabic elements. In the case of combining sub-syllabic elements, the number of combinations increases exponentially; monosyllabic words have hundreds of thousands of possible combinations, while polysyllabic words have billions (Keuleers and Brysbaert 2010). As a solution to high combination possibilities, software like WordGen and Wuggy provide building constraints or search criteria which restrict the pseudowords that are returned, for example, number of neighbours, word frequency, and summated bigram frequency. This makes searching for pseudowords much more achievable, but can result in more complex software applications that can be confusing for researchers to interact with.

Two lexicons tend to be most widely used for generating pseudowords: CELEX and Lexique. CELEX (Baayen et al. 1993) is a lexical database that contains information on orthography, phonology, morphology, syntax, and word frequency for words in English, German, and Dutch. Lexique (New et al. 2004) is a lexical database for French language which contains information on gender, number, grammatical category, and word frequency. It is used by both the applications that support French: WordGen and Wuggy. Both CELEX and Lexique are general-purpose lexicons, meaning that they generate general-purpose pseudowords. Support for domain-specific pseudowords appears to be lacking, which could be problematic in cases where second language learners are being tested on their knowledge of academic or discipline-specific vocabulary [e.g., from Coxhead’s Academic Word List (2000), or from disciplines such as biological sciences or engineering]. Pseudowords, in these instances, would need to resemble Graeco-Latin words and be possibly longer than general purpose pseudowords, to make the test more realistic for learners.

The lexicons that are used by pseudoword generating software also limit the language support that is provided. ARC and ELP support English alone, WordGen supports four languages including English, and Wuggy supports seven. Wuggy also has the capacity to be extended to support any alphabet-based language. However, extending Wuggy to support other languages requires a list of syllabified words in the desired language, and information about how the syllables are segmented. There is, therefore, a need to develop a system that can be applied to a wider range of languages without requiring the in-depth knowledge of each (such as syllabification and sub-syllabification).

Each of the four existing pseudoword generating applications have some form of criteria that determines how pseudowords are created. However, to our knowledge, none has any formal criteria to evaluate the legal structure or suitability of pseudowords post-production. Each application has a different approach to generating pseudowords, and different criteria (grammars or principles) for restricting the forms of the generated pseudowords. Perhaps, more importantly, they have differing views, even if only slightly, on what constitutes a legal pseudoword. The ARC database focuses on phonological principles (allowing illegal bigrams and orthographic onsets and bodies), while Wuggy focuses on orthographic forms (Keuleers and Brysbaert 2010: 629).

Furthermore, although each approach mentions the importance of suitable pseudowords, in terms of their use in lexical processing, incidental language learning, and lexical decision tasks, to our knowledge there is no evidence of their suitability having been tested or evaluated. A lack of suitability could have implications for how useful the pseudowords are and for the generalization of the results found in such studies. The main problem is that different tasks have different suitability requirements. For instance, in incidental lexical learning experiments, it is desirable that pseudowords closely resemble existing simplex words, but in language testing situations, it may be undesirable to have only simplex words (depending on the type of vocabulary being tested, simplex words may stand out from the rest of the words used), and also, forms which resemble existing words too closely could be mistaken as real words under time pressure, which may lead to an incorrect penalty for the L2 learner being tested.

So while legality concerns a type of criteria that is overall largely desirable in all applications of pseudowords, suitability considerations relate to requirements which will by their very nature vary across tasks. Therefore, we purport that a set of pseudoword evaluation considerations are beneficial to the field, with regard to both legality and suitability.

TOWARDS SOME SOLUTIONS

The present study proposes a new approach to creating pseudowords that is not susceptible to any of the limitations of existing approaches, and explores ways of conducting post-creation evaluations on pseudowords. This study has two contributions:

An automated pseudoword generation technique that can be extended cross-linguistically.
An introduction to novel formal pseudoword evaluation techniques, both in terms of their legal form and suitability for various lexical tasks.

We start by introducing our CGCA, which is a computationally simple approach to generating pseudowords. Next, we describe a set of evaluation criteria that we designed to evaluate the legal form of pseudowords, and some possible considerations for assessing their suitability in various lexical tasks. Finally, we demonstrate how the CGCA has been designed to work with any alphabet-based language, without requiring any knowledge of that language, and how it can be used to create domain or language-specific pseudowords, based solely on an input wordlist.¹

CGCA: character-gram chaining algorithm

In 1990, Bell et al. discussed using statistical analysis of n-gram frequencies to model natural language. An early example of how this might be done can be found in Miller and Selfridge (1950). Bell et al. (1990: 80–81) suggested that “frequencies of n-grams can be used to construct order n−1 models.” They descibe the model as an order n−1 model, where the first n−1 characters of an n-gram are used to predict the nth character. The examples in Table 4 demonstrate the first 100 characters that Bell et al. (1990) generated using natural language modelling with different sized n-gram models. As shown in the table, although the 12-g model is not perfect, the resemblence to natural language improves noticeably each time the n-gram size increases. Bell et al. (1990) used n-gram models to reconstruct sections of text within a language, but their work has lead to our questioning whether a similar technique can be used to construct individual pseudowords.

Table 4:

Open in new tab Download slide

Natural language modelling with n-gram models (Bell et al. 1990)

Order-0 text (single character)

fsn’iaad ir lntns hynci,..aais oayimh t n, at oeotc fheotyi t afrtgt oidtsO, wrr thraeoe rdaFr ce.g

Order-5 text (6-g)

Number diness, and it also light of still try and amoung Presidential discussion is department-trans

Order-11 text (12-g model)

Papal pronounced to the appeal, said that he’d left the lighter fluid, ha, ha”? asked the same numbe

Both the n-gram model by Bell et al. (1990) and the sub-syllabic pseudoword technique used by Keuleers and Brysbaert (2010) and Rastle et al. (2002) have motivated us to create a CGCA to generate pseudowords. Figure 1 gives an overview of the algorithm in four modularized steps while Online Supplementary Appendix A gives the full details.

Figure 1:

The modular steps involved in the character-gram chaining algorithm

The CGCA algorithm can accept either a wordlist or corpus as input. First, it extracts all unique tokens from the input, creating the origin wordlist² that is used to generate pseudowords. Next, it extracts all possible character-grams from the origin wordlist, and finally, it iteratively generates and validates each chain of character-grams, resulting in a list of pseudowords specific to the origin that was used to generate them. This allows researchers to generate either general-purpose or domain-specific pseudowords, based on the input that they pass through to our algorithm.

The CGCA can generate pseudowords from an input that has as little as 100 unique words, to large general-purpose corpora or wordlists such as the Range Programme lists (Nation et al. 2002) or Nation’s (2012) British National Corpus and Corpus of Contemporary American English (BNC/COCA) lists. However, the number of pseudowords that can be generated is dependent on the character-gram size used, for example, a list of 100 unique words can be used to generate 100 pseudowords when 2- or 3-g are used, but only half as many using 4-g.

We used CGCA and the BNC/COCA lists (Nation 2012) to create pseudowords of various n-gram size. Table 5 gives the first 10 such pseudowords for 2-, 3-, 5-, and 8-g. We also generated pseudowords using a combination of n-gram lengths (r-grams). The reason that we can have a maximum size of 8-g is that for larger numbers than 8, there would not be enough character-grams extracted to chain together to generate pseudowords.

Table 5:

A sample of 10 pseudowords generated by CGCA (prior to evaluation)

2-Gram	3-Gram	5-Gram	8-Gram	r-Gram
Scon	Punit	Untalentleman	Uncertification	Eightist
Cens	Recollusted	Unlabelling	Representably	Braveller
Nes	Cree	Registract	Unstructure	Unexception
Vois	Dward	Injusting	Undifference	Disbehaviour
Sunt	Witle	Orches	Intergovernment	Ninthood
Zer	Captime	Heritancy	Unconsolidate	Apartmentalizing
Stro	Hydraft	Easters	Uncirculates	Lettes
Ghol	Natigating	Unsenting	Semanticise	Gotters
Acive	Ouncing	Essentee	Undistinguish	Greeness
Weat	Runnius	Impatibly	Reaffirmative	Whitecturalisation

2-Gram	3-Gram	5-Gram	8-Gram	r-Gram
Scon	Punit	Untalentleman	Uncertification	Eightist
Cens	Recollusted	Unlabelling	Representably	Braveller
Nes	Cree	Registract	Unstructure	Unexception
Vois	Dward	Injusting	Undifference	Disbehaviour
Sunt	Witle	Orches	Intergovernment	Ninthood
Zer	Captime	Heritancy	Unconsolidate	Apartmentalizing
Stro	Hydraft	Easters	Uncirculates	Lettes
Ghol	Natigating	Unsenting	Semanticise	Gotters
Acive	Ouncing	Essentee	Undistinguish	Greeness
Weat	Runnius	Impatibly	Reaffirmative	Whitecturalisation

Table 5:

A sample of 10 pseudowords generated by CGCA (prior to evaluation)

2-Gram	3-Gram	5-Gram	8-Gram	r-Gram
Scon	Punit	Untalentleman	Uncertification	Eightist
Cens	Recollusted	Unlabelling	Representably	Braveller
Nes	Cree	Registract	Unstructure	Unexception
Vois	Dward	Injusting	Undifference	Disbehaviour
Sunt	Witle	Orches	Intergovernment	Ninthood
Zer	Captime	Heritancy	Unconsolidate	Apartmentalizing
Stro	Hydraft	Easters	Uncirculates	Lettes
Ghol	Natigating	Unsenting	Semanticise	Gotters
Acive	Ouncing	Essentee	Undistinguish	Greeness
Weat	Runnius	Impatibly	Reaffirmative	Whitecturalisation

2-Gram	3-Gram	5-Gram	8-Gram	r-Gram
Scon	Punit	Untalentleman	Uncertification	Eightist
Cens	Recollusted	Unlabelling	Representably	Braveller
Nes	Cree	Registract	Unstructure	Unexception
Vois	Dward	Injusting	Undifference	Disbehaviour
Sunt	Witle	Orches	Intergovernment	Ninthood
Zer	Captime	Heritancy	Unconsolidate	Apartmentalizing
Stro	Hydraft	Easters	Uncirculates	Lettes
Ghol	Natigating	Unsenting	Semanticise	Gotters
Acive	Ouncing	Essentee	Undistinguish	Greeness
Weat	Runnius	Impatibly	Reaffirmative	Whitecturalisation

Designing post-production evaluation criteria and requirements

We propose that once generated, pseudowords should undergo evaluation. We argue that the evaluation process needs to involve two separate types of criteria, namely legality and suitability. Determining whether a pseudoword conforms to the rules of a language, and determining whether that pseudoword is suitable for a particular type of lexical task, require two different types of evaluation criteria.

Legal elements are those that we can prove to be legal, in terms of the character combinations that exist in the language (in a wordlist, lexicon, or corpus), rather than all legal elements within the language. Elements that we measure as not legal are not necessarily illegal in the language. They simply cannot be proven to be legal in a sub-collection of the language, in an origin wordlist for example. Our approach to pseudoword generation makes use of the premise that chaining legal character-grams together will sometimes result in a legal pseudoword. Of course, there are other restrictions on the process of building a well-formed word, such as which sequences are being chained and their position in the word. These restrictions will thus lead to some ill-formed generated pseudowords. In particular, using bi-grams (character-grams of size 2) results in legal character sequences of size 2, but when chained together, these bi-grams can result in larger unseen or potentially illegal character combinations, which George Bernard Shaw’s well-known example GHOTI illustrates.³ We have designed a set of three criteria that should be used to evaluate the legal form of pseudowords for English, outlined in Table 6.

Table 6:

Criteria for assessing pseudoword legality (for English)

C+ (one or more consecutive consonants)

Extract sequences of consecutive consonants from a pseudoword and validate them only if they appear within a token in the origin wordlist. For example, for the pseudoword conferious, its consecutive consonants are: c, nf, r, and s.

V+ (one or more consecutive vowels)

Extract sequences of consecutive vowels from a pseudoword and validate them only if they appear within a token in the origin wordlist. For example, for the pseudoword conferious, its consecutive vowels are: o, e, and iou.

CV+C (a consonant, followed by one or more consecutive vowels, followed by a consonant)

Extract sequences of consecutive vowels including one leading and one trailing consonant from a pseudoword and validate them only if they appear within a token in the origin wordlist. For example, for the pseudoword conferious, its cv+c patterns are: con, fer, and rious.

If a sequence of characters appears in a pseudoword but does not appear in a real word (from the origin wordlist), then the pseudoword cannot be proven legal (it might still conform to the word formation rules of the target language, but given that no words in the original wordlist contain the sequence, we cannot be sure whether it does or not). Conversely, if all the character sequences that appear in a pseudoword also appear in real words in the origin wordlist, then the pseudoword can be said to conform to the rules of the target language. This is what we hope for when assessing the legality of a given pseudoword.

Our version of legality is not too strict in that it is flexible enough to allow words which may break phonotactic constraints of a given language, for example, un + table would be classified as legal using the above criteria, even though un would never attach to a noun in English. However, as one anonymous reviewer points out, the legality criteria could be tightened much further to adhere to phonotactic constraints, but this comes with disadvantages in that there will be fewer pseudowords accepted and more language-specific knowledge required to evaluate legality of this type.

Measuring the suitability of a given pseudoword is not to do with whether or not a form could be a legitimate word in a given language, but rather, it has to do with its suitability for use. In lexical decision tasks, for example, the more dissimilar a pseudoword is to a word, the faster the reaction time (Keuleers and Brysbaert 2010). Incidental lexical learning experiments require shorter words with at most one productive affix, and which appear to be similar to existing forms (Webb 2007). Vocabulary tests (Meara 2010) and identification-as-retrieval tasks (Rueckl and Olds 1993) require pseudowords that do not have their own existing inferred meaning. Decoding-tasks use pseudowords of varying difficulty (Proença et al. 2017). The primary importance for suitability appears to be whether the pseudowords that are being generated are too similar to existing words, not similar enough, or within the range of what is required for lexical tasks. We propose four considerations for evaluating the suitability of pseudowords (English-specific), as given in Table 7.

Table 7:

Criteria for assessing pseudoword suitability (for English)

Compound

A pseudowords that is made up of two or more real words within the language. For example captime (cap-time).

Polymorphic

A pseudoword that consists of a real root plus one of more affixes. For example indetermines (in-determine-s). Note that a compound can also be polymorphic. For example captimed (cap-time-ed).

Near polymorphic

A pseudoword whose root does not exist in the language but includes one or more affixes. For example, alphise (alph-ise). Note that a pseudoword can be either polymorphic or near polymorphic, but not both. It either has a real root or it doesn’t.

One-character dissimilarity

A pseudoword that is easily identifiable as one character away from a real word within the language. For example, overses (overseas).

Using these considerations, each pseudoword could be given a binary score of either 0 or 1 for each of the above criteria. These scores could then be used to determine whether a particular pseudoword is (or is not) suited towards a particular lexical task. For example, for lexical decision tasks, a pseudoword that is only one character away from a real word, such as ‘dauntings’ (a rushed or slightly absent-minded participant may not even hear or see the plural ‘s’ and assume the form is a real word, ‘daunting’), may produce different reaction times than one that is not; pseudowords that score 1 in the polymorphic category (such as ‘unbehave’) may be more suited to incidental lexical learning experiments; and pseudowords that score 0 in the polymorphic category (such as ‘hydraft’) may be more suited to vocabulary tests and identification-as-retrieval tasks as they do not include a real root and affix that meaning can be inferred from.

While the CGCA method is cross-linguistically applicable, evaluating the generated pseudowords needs to be done according to language-specific means. That is, applying the rules of each target language separately and constructing suitability criteria separately for each language, as we have done above for English.

We applied each of the criteria outlined above to the first 100 pseudowords generated using each n-gram size (from 2 to 8). Table 8 shows the results of the individual legal evaluation, which was done automatically using a Python script. When the CGCA pseudowords were evaluated against their origin wordlist (Nation’s BNC/COCA lists), 12 pseudowords were not able to be proven legal: 10 pseudowords (out of 100) created using 2-g, 1 pseudoword (out of 100) created using 3-g, and 1 pseudoword (out of 100) created using r-grams. Table 9 shows all of the CGCA pseudowords that contained a character sequence that did not exist in the origin wordlist. Four out of the five C+ errors contained a y, which should be considered a vowel in most cases, but was not treated as one for this evaluation. If y was treated as a vowel, these four pseudowords may not have incurred errors.

Table 8:

Results for the individual legal evaluation (per 100 pseudowords)

Pseudowords	C+ errors	CV+C errors	Non-legal words
2 g	4	6	10
3 g	0	1	1
4 g	0	0	0
5 g	0	0	0
6 g	0	0	0
7 g	0	0	0
8 g	0	0	0
r-g	1	0	1

Table 8:

Results for the individual legal evaluation (per 100 pseudowords)

Pseudowords	C+ errors	CV+C errors	Non-legal words
2 g	4	6	10
3 g	0	1	1
4 g	0	0	0
5 g	0	0	0
6 g	0	0	0
7 g	0	0	0
8 g	0	0	0
r-g	1	0	1

Table 9:

Error examples for the individual legal evaluation

Category	Pseudoword	C+ errors	CV+C errors
2 g	Yies	0	1
2 g	Yied	0	1
2 g	Vois	0	1
2 g	Gymma	1	0
2 g	Tbscrap	1	0
2 g	Faugh	0	1
2 g	Eiguit	0	1
2 g	Jous	0	1
2 g	Dyntin	1	0
2 g	Gympart	1	0
3 g	Reuniour	0	1
r-G	Rhydrate	1	0

Category	Pseudoword	C+ errors	CV+C errors
2 g	Yies	0	1
2 g	Yied	0	1
2 g	Vois	0	1
2 g	Gymma	1	0
2 g	Tbscrap	1	0
2 g	Faugh	0	1
2 g	Eiguit	0	1
2 g	Jous	0	1
2 g	Dyntin	1	0
2 g	Gympart	1	0
3 g	Reuniour	0	1
r-G	Rhydrate	1	0

Table 9:

Error examples for the individual legal evaluation

Category	Pseudoword	C+ errors	CV+C errors
2 g	Yies	0	1
2 g	Yied	0	1
2 g	Vois	0	1
2 g	Gymma	1	0
2 g	Tbscrap	1	0
2 g	Faugh	0	1
2 g	Eiguit	0	1
2 g	Jous	0	1
2 g	Dyntin	1	0
2 g	Gympart	1	0
3 g	Reuniour	0	1
r-G	Rhydrate	1	0

Category	Pseudoword	C+ errors	CV+C errors
2 g	Yies	0	1
2 g	Yied	0	1
2 g	Vois	0	1
2 g	Gymma	1	0
2 g	Tbscrap	1	0
2 g	Faugh	0	1
2 g	Eiguit	0	1
2 g	Jous	0	1
2 g	Dyntin	1	0
2 g	Gympart	1	0
3 g	Reuniour	0	1
r-G	Rhydrate	1	0

The individual suitability evaluation was conducted manually by two of the researchers independently and then the results were compared. The results are shown in Table 10. For consistency, we agreed on using the list of affixes reported by Bauer and Nation (1993) to be either inflectional suffixes, or among the most frequently occurring and regular derivational affixes (258–259), or frequent and orthographically regular affixes (Bauer and Nation 1993: 259–260), see Table 11.

Table 10:

Results from the suitability evaluation (per 100 pseudowords)

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14

Table 10:

Results from the suitability evaluation (per 100 pseudowords)

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14

Table 11:

Chosen coded affixes, derived from levels 2, 3, and 4 by Bauer and Nation (1993)

Inflectional suffixes	Frequent and regular derivational affixes	Frequent orthographically regular affixes
-s, -ies, -es, -ed/-d/-t, -en, -ing, -er, -es	-able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-	-al, -ation, -ess, -ful, -ism, -ist, -ity, -ize/-ise, -ment, -ous, in-/im-

Inflectional suffixes	Frequent and regular derivational affixes	Frequent orthographically regular affixes
-s, -ies, -es, -ed/-d/-t, -en, -ing, -er, -es	-able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-	-al, -ation, -ess, -ful, -ism, -ist, -ity, -ize/-ise, -ment, -ous, in-/im-

Table 11:

Chosen coded affixes, derived from levels 2, 3, and 4 by Bauer and Nation (1993)

Inflectional suffixes	Frequent and regular derivational affixes	Frequent orthographically regular affixes
-s, -ies, -es, -ed/-d/-t, -en, -ing, -er, -es	-able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-	-al, -ation, -ess, -ful, -ism, -ist, -ity, -ize/-ise, -ment, -ous, in-/im-

Inflectional suffixes	Frequent and regular derivational affixes	Frequent orthographically regular affixes
-s, -ies, -es, -ed/-d/-t, -en, -ing, -er, -es	-able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-	-al, -ation, -ess, -ful, -ism, -ist, -ity, -ize/-ise, -ment, -ous, in-/im-

The manual coding was completed by two researchers separately. For the first three criteria (compound, polymorphic, and near polymorphic), any discrepancies were discussed and resolved. For the fourth criteria (one-character dissimilarity), only pseudowords marked positive by both researchers were included as positive in the final results. In coding the one-character dissimilarity, we did not use exhaustive dictionary searches but rather, we used our knowledge of English to see if any real word would immediately come to mind, without thinking too long (we will return to this later).

DISCUSSION

Now, we turn to a discussion of the pseudowords that we generated using CGCA and their evaluation provide a comparison of the same evaluation criteria applied to existing pseudoword generation systems currently in use, and discuss how these compare with the pseudowords generated by CGCA. Finally, we discuss cross-linguistic applicability of the CGCA system and conclude the section with some limitations and scope for future work.

Evaluation of the pseudowords generated by the CGCA algorithm

As regards legality of the pseudowords, we found that bi-grams performed the worst in our dataset, with 10 per cent of the data containing non-legal pseudowords. This suggests caution in using systems that rely exclusively on bi-grams as means of generating pseudowords (i.e., WordGen). The next worst n-gram size was the 3-g group, but this was comparatively better, with only 1 per cent of the forms being non-legal. The system performed equally well for 3 or more n-gram sizes (the few non-legal elements in the r-gram set are likely to be bi-grams).

Turning to issues of suitability, we suggest that forms which only differ from existing words by one single character can be problematic as pseudowords because participants can mistakenly misread these forms as real words, particularly when under time pressure. This means that they might end up being unfairly penalized in a learner vocabulary test. This problem may seriously impact on systems that generate pseudowords by solely changing one single character (e.g., ELP). But in coding the 1-char factor, we found that pseudowords which differ through a single letter from existing words are not all equally problematic: some are recognized much faster than others (recall that we coded this factor quickly, and without reference to a dictionary). For example, overses can easily be linked to the existing form overseas, whereas orand may not immediately be associated with grand. It is not straightforward to glean why some forms are immediately recognized while others are not. Words ending in ‘-in’ and missing a ‘-g’ might be recognized quickly due to the productivity of the progressive inflectional affix in English, but the picture is still incomplete. More work needs to be done in testing the forms on a larger population in order to understand the mechanism at work here.

Although compounding is a highly productive strategy for forming new words in English, we found the CGCA pseudowords generated using the BNC/COCA lists included strikingly few pseudo-compound forms, with a peak at 4-g (10 per cent). This suggests that 4-g are the optimal character-gram size for generating compound words. It might be worthwhile to investigate experimentally whether there is a difference in how different pseudowords are viewed, based on structural differences (compound-like forms, versus polymorphic forms, versus morphologically simple forms).

In evaluating suitability, we have found interesting correlations between n-gram size and various factors. For one-character dissimilarity, there is a downward slope across n-grams, dropping from 2-g (40 per cent) to 4-g (20 per cent) before evening out. This trend suggests that the smaller the n-gram size, the more easily identifiable the pseudowords are to existing words (at 1 character away). Conversely, the occurrence of polymorphic pseudowords increases steadily from 2-g (2 per cent) to 8-g (85 per cent), suggesting that the larger the n-gram size, the more word-like (real root plus affix) the pseudowords are. Near-polymorphic pseudoword counts rise then fall, climbing from 2-g (22 per cent) up to 3–4–5-g (40 per cent each) and back down to 8-g (9 per cent). This final drop appears to be due to the corresponding climb in polymorphic occurrences. The combined counts of polymorphic and near-polymorphic pseudowords rise steadily until they make up almost 100 per cent (96 per cent for 7-g, 94 per cent for 8-g). These correlations may be due to how CGCA chains character-grams together to create pseudowords. The smaller the character-gram size, the fewer characters being compared and therefore the less likely that affixes will be generated. Likewise, the larger the character-gram size, the more characters being compared, and the more likely that affixes will be generated. Furthermore, pseudowords that include affixes may be less likely to be one character away from real words—they may instead be n characters away, where n is the length of the added affix. For example, eightist (polymorphic) is three characters (ist) away from the word eight, whereas weat (not polymorphic) is one character away from wheat.

Comparing CGCA pseudowords with pseudowords from existing Systems

We first used the main existing systems to generate 100 pseudowords from each (ELP, the ARC Nonword Database, WordGen, Wuggy), and selected 100 pseudowords from Meara’s EFL tests (20 from each level). The first 10 words from each are given for illustrative purposes in Table 12. We then conducted a comparative legal evaluation, where we used the legal evaluation criteria to evaluate both our pseudowords and each of the 100 pseudowords from the other systems. We decided to create a non-biased wordlist for this evaluation, rather than using our origin list or CELEX, drawing on four corpora: the Corpus of Historical American English (COHA) (Davies 2002), the Wikipedia Corpus (Davies 2015), News on the Web (NOW) (Davies 2013b), and Global Web-Based English (GloWbE) (Davies 2013a). All unique tokens were extracted from the corpora and only words that were validated as real words by Wiktionary were kept. The resulting wordlist contained 72,783 tokens.

Table 12:

A sample of 10 pseudowords generated using ARC, ELP, WordGen, Wuggy, and Meara

ARC	ELP	WordGen	Wuggy	Meara
Grev	Drimaced	Daney	Dre	Berrow
Bloap	Nightkine	Biled	Woubt	Whaley
Shrusks	Sonehead	Ragio	Istye	Contrivial
Zoc	Creemason	Applk	Hu	Detailoring
Spails	Selectove	Hoory	Roud	Eldred
Gir	Nonclude	Loeer	Pliedes	Gumm
Thwiped	Gastrami	Adoke	Onsce	Pocock
Grear	Asjoins	Cheed	Buit	Pernicate
Prirr	Guinbess	Flort	Fims	Eluctant
Crenched	Egocative	Fraze	Sussest	Limidate

ARC	ELP	WordGen	Wuggy	Meara
Grev	Drimaced	Daney	Dre	Berrow
Bloap	Nightkine	Biled	Woubt	Whaley
Shrusks	Sonehead	Ragio	Istye	Contrivial
Zoc	Creemason	Applk	Hu	Detailoring
Spails	Selectove	Hoory	Roud	Eldred
Gir	Nonclude	Loeer	Pliedes	Gumm
Thwiped	Gastrami	Adoke	Onsce	Pocock
Grear	Asjoins	Cheed	Buit	Pernicate
Prirr	Guinbess	Flort	Fims	Eluctant
Crenched	Egocative	Fraze	Sussest	Limidate

Table 12:

A sample of 10 pseudowords generated using ARC, ELP, WordGen, Wuggy, and Meara

ARC	ELP	WordGen	Wuggy	Meara
Grev	Drimaced	Daney	Dre	Berrow
Bloap	Nightkine	Biled	Woubt	Whaley
Shrusks	Sonehead	Ragio	Istye	Contrivial
Zoc	Creemason	Applk	Hu	Detailoring
Spails	Selectove	Hoory	Roud	Eldred
Gir	Nonclude	Loeer	Pliedes	Gumm
Thwiped	Gastrami	Adoke	Onsce	Pocock
Grear	Asjoins	Cheed	Buit	Pernicate
Prirr	Guinbess	Flort	Fims	Eluctant
Crenched	Egocative	Fraze	Sussest	Limidate

ARC	ELP	WordGen	Wuggy	Meara
Grev	Drimaced	Daney	Dre	Berrow
Bloap	Nightkine	Biled	Woubt	Whaley
Shrusks	Sonehead	Ragio	Istye	Contrivial
Zoc	Creemason	Applk	Hu	Detailoring
Spails	Selectove	Hoory	Roud	Eldred
Gir	Nonclude	Loeer	Pliedes	Gumm
Thwiped	Gastrami	Adoke	Onsce	Pocock
Grear	Asjoins	Cheed	Buit	Pernicate
Prirr	Guinbess	Flort	Fims	Eluctant
Crenched	Egocative	Fraze	Sussest	Limidate

Each set of 100 pseudowords was compared against the COHA-Wikipedia-NOW-GloWbe wordlist and any character combinations (C+, V+, CV+C) that appeared in a pseudoword but not in the wordlist were noted (Table 13). When comparing the CGCA pseudoword errors derived from the original wordlist with the CGCA pseudoword errors derived from the COHA-Wikipedia-NOW-GloWbe wordlist, the error counts for the CGCA pseudowords have: decreased from 10 to 9 for 2-g, increased from 1 to 2 for 3-g, increased from 0 to 1 for 4-g, and stayed the same for all others (Table 13, first nine rows). Comparatively, for the externally generated pseudowords: WordGen had the highest error count (26 out of 100 pseudowords contain errors), Wuggy had the second highest (21 out of 100), ARC and ELP had 16 and 10, respectively, and Meara’s EFL pseudowords had the lowest error count, with only 6 out of 100 pseudowords containing errors. Surprisingly, ARC, WordGen, and Wuggy all included pseudowords that did not contain at least one vowel (sprymphs, brft, grrpe, ymn), although in the case of sprymphs (ARC), the y could be considered to be acting as a vowel.

Table 13:

Results from the comparison legal evaluation (per 100 pseudowords)

Pseudowords	C+ errors	V+ errors	CV+C errors	Non-legal words
2 g	4	0	5	9
3 g	0	0	2	2
4 g	0	0	1	1
5 g	0	0	0	0
5 g	0	0	0	0
6 g	0	0	0	0
7 g	0	0	0	0
8 g	0	0	0	0
r-g	1	0	0	1
ARC	3	1	14	16
ELP	4	2	6	10
WordGen	8	5	18	26
Wuggy	3	1	19	21
Meara	0	0	6	6

Pseudowords	C+ errors	V+ errors	CV+C errors	Non-legal words
2 g	4	0	5	9
3 g	0	0	2	2
4 g	0	0	1	1
5 g	0	0	0	0
5 g	0	0	0	0
6 g	0	0	0	0
7 g	0	0	0	0
8 g	0	0	0	0
r-g	1	0	0	1
ARC	3	1	14	16
ELP	4	2	6	10
WordGen	8	5	18	26
Wuggy	3	1	19	21
Meara	0	0	6	6

Table 13:

Results from the comparison legal evaluation (per 100 pseudowords)

Pseudowords	C+ errors	V+ errors	CV+C errors	Non-legal words
2 g	4	0	5	9
3 g	0	0	2	2
4 g	0	0	1	1
5 g	0	0	0	0
5 g	0	0	0	0
6 g	0	0	0	0
7 g	0	0	0	0
8 g	0	0	0	0
r-g	1	0	0	1
ARC	3	1	14	16
ELP	4	2	6	10
WordGen	8	5	18	26
Wuggy	3	1	19	21
Meara	0	0	6	6

Pseudowords	C+ errors	V+ errors	CV+C errors	Non-legal words
2 g	4	0	5	9
3 g	0	0	2	2
4 g	0	0	1	1
5 g	0	0	0	0
5 g	0	0	0	0
6 g	0	0	0	0
7 g	0	0	0	0
8 g	0	0	0	0
r-g	1	0	0	1
ARC	3	1	14	16
ELP	4	2	6	10
WordGen	8	5	18	26
Wuggy	3	1	19	21
Meara	0	0	6	6

The next step was to conduct a comparison suitability evaluation with the same sets of pseudowords. Each pseudoword was manually coded following the same procedure as in the section titled “Designing post-production evaluation criteria and requirements” (Table 14). All have a relatively low count of compound pseudowords, however our 4-g have the highest of them (10%), followed closely by Meara (9%). This suggests that using the CGCA algorithm with 4-g should be preferred over other methods if aiming to generate compound pseudowords, but we stress here that suitability is highly dependent on the task. All of our polymorphic pseudowords, except those generated using 2-g (2%), have higher counts than any of the externally generated pseudowords. Our highest come from our 7-g (80 per cent) and 8-g (85 per cent), while the highest in the externally generated pseudowords come from Meara (10 per cent) and WordGen (8 per cent), suggesting that the CGCA pseudowords are more word-like, in terms of a real stem plus affix. The near-polymorphic pseudoword counts are a little better balanced than for polymorphic pseudowords. ARC and ELP have the highest counts (58 per cent each), followed by our 3-g, 4-g, and 5-g (40 per cent each). The lowest counts come from our 8-g (9%), 7-g (16%), and 6-g (18%). Finally, for pseudowords that are easily identifiable as one character away from a real word (character dissimilarity), the highest counts come from WordGen (48%) and Wuggy (47%), followed closely by ELP (43%), 2-g (43%), and 3-g (39%). The lowest counts come from Meara (14%) and our r-grams (14%). Each of these statistics may be seen as either an advantage or a disadvantage, depending on the intended use of the pseudowords, and the form or structure required.

Table 14:

Results from the comparison suitability evaluation (per 100 pseudowords)

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14
ARC	1	1	58	34
ELP	6	3	58	43
WordGen	1	8	36	48
Wuggy	3	7	37	47
Meara	9	10	26	14

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14
ARC	1	1	58	34
ELP	6	3	58	43
WordGen	1	8	36	48
Wuggy	3	7	37	47
Meara	9	10	26	14

Table 14:

Results from the comparison suitability evaluation (per 100 pseudowords)

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14
ARC	1	1	58	34
ELP	6	3	58	43
WordGen	1	8	36	48
Wuggy	3	7	37	47
Meara	9	10	26	14

Category	Compound	Polymorphic	Near polymorphic	Char dissimilarity
2-g	3	2	22	43
3-g	5	10	40	39
4-g	10	22	40	20
5-g	5	44	40	18
6-g	1	71	18	16
7-g	1	80	16	21
8-g	4	85	9	20
r-G	5	47	33	14
ARC	1	1	58	34
ELP	6	3	58	43
WordGen	1	8	36	48
Wuggy	3	7	37	47
Meara	9	10	26	14

The suitability considerations can be used to compare pseudowords created using different origins. For instance, if researchers wanted to create pseudowords using the CGCA, but that had the same structure as Meara’s pseudowords, they could select those with similar compound, polymorphic, near-polymorphic, and character dissimilarity counts as his. Furthermore, as argued when discussing Designing post-production evaluation criteria and requirements, suitability criteria can be used to select more or less word-like pseudowords, for example polymorphic or near polymorphic pseudowords for morphology experiments, and non-polymorphic pseudowords for vocabulary testing. The criteria can also be used to draw a comparison between sets of pseudowords, for example, if we wanted to create pseudowords that reflect the same form as some existing type (ELP, ARC, etc.).

Cross-linguistic application of the CGCA algorithm

The final set of observations concerns the linguistic domain of pseudowords. Here, we ask whether pseudowords can be generated to reflect a particular language. Our solution was to develop the CGCA to work with any alphabet-based language, without requiring any knowledge of that language. The CGCA can be used to create pseudowords in any language because it just requires an origin wordlist. As an example, we have generated a sample of 10 pseudowords each from German, Spanish, and Italian using CGCA (using 4-g), as shown in Table 15. The table also contains a sample of 10 English words. By specifying the desired language, we were able to use Wiktionary to validate each set of pseudowords. The three foreign language wordlists were derived from movie and television series subtitles from Buchmeier (2008a, 2008b, 2009). The German origin list was derived from the first 1,000 words in a frequency list of 25 million words (Buchmeier 2009); the Spanish origin list was derived from the first 1,000 words in a frequency list of 27 million words (Buchmeier 2008a); and the Italian origin list was derived from the first 1,000 words in a frequency list of 5.6 million words (Buchmeier 2008b).

Table 15:

Language-specific pseudowords generated using the CGCA

German	Spanish	Italian	English
Bisscheint	Mirande	Abbastardo	Acknowier
Kinden	Puestra	Dicevuto	Reorganic
Viellen	Suficio	Dentre	Sweaten
Wassen	Oporta	Momente	Clinist
Alleich	Histos	Dimente	Inflatting
Scheinlich	Tambiar	Ufficile	Puddiness
Entschuld	Suerto	Pagari	Tonnect
Viellein	Grando	Ottimana	Incling
Entschule	Accidentro	Finalmeno	Epidest
Bisscheiße	Dentra	Lavore	Prograph

German	Spanish	Italian	English
Bisscheint	Mirande	Abbastardo	Acknowier
Kinden	Puestra	Dicevuto	Reorganic
Viellen	Suficio	Dentre	Sweaten
Wassen	Oporta	Momente	Clinist
Alleich	Histos	Dimente	Inflatting
Scheinlich	Tambiar	Ufficile	Puddiness
Entschuld	Suerto	Pagari	Tonnect
Viellein	Grando	Ottimana	Incling
Entschule	Accidentro	Finalmeno	Epidest
Bisscheiße	Dentra	Lavore	Prograph

Table 15:

Language-specific pseudowords generated using the CGCA

German	Spanish	Italian	English
Bisscheint	Mirande	Abbastardo	Acknowier
Kinden	Puestra	Dicevuto	Reorganic
Viellen	Suficio	Dentre	Sweaten
Wassen	Oporta	Momente	Clinist
Alleich	Histos	Dimente	Inflatting
Scheinlich	Tambiar	Ufficile	Puddiness
Entschuld	Suerto	Pagari	Tonnect
Viellein	Grando	Ottimana	Incling
Entschule	Accidentro	Finalmeno	Epidest
Bisscheiße	Dentra	Lavore	Prograph

German	Spanish	Italian	English
Bisscheint	Mirande	Abbastardo	Acknowier
Kinden	Puestra	Dicevuto	Reorganic
Viellen	Suficio	Dentre	Sweaten
Wassen	Oporta	Momente	Clinist
Alleich	Histos	Dimente	Inflatting
Scheinlich	Tambiar	Ufficile	Puddiness
Entschuld	Suerto	Pagari	Tonnect
Viellein	Grando	Ottimana	Incling
Entschule	Accidentro	Finalmeno	Epidest
Bisscheiße	Dentra	Lavore	Prograph

Moreover, given that all it requires is a wordlist or corpus, the legal evaluation criteria can be applied to pseudowords from any language, regardless of how they were created. The criteria can measure how well pseudowords fit within the legal orthographic form of any language, and are only limited by the size of the wordlist or corpus that is used. In conducting the legal evaluation on each of the language-specific pseudowords, using their relative origin wordlists as the lexicon, we found that none of them violated any of the legal evaluation criteria.

CGCA can also be used to create domain-specific or frequency-specific pseudowords, it is limited only by the text that is used to build the origin wordlist. As an example, we have used CGCA (using 4-g) to generate a sample of 10 pseudowords each from two different domains: Academic derived from the Academic Word List (Coxhead 2000) and Grade School, derived from the Basic Vocabulary Spelling List (Graham et al. 1993) (Table 16).

Table 16:

Domain-specific pseudowords generated using the CGCA

Academic	Grade school
Unconverse	Brough
Enormat	Brothes
Corresponse	Withough
Illustract	Mountries
Emergins	Grandmothes
Primarise	Countain
Majoritise	Cottom
Phasize	Mountry
Undiminution	Clother
Preliminish	Botton

Academic	Grade school
Unconverse	Brough
Enormat	Brothes
Corresponse	Withough
Illustract	Mountries
Emergins	Grandmothes
Primarise	Countain
Majoritise	Cottom
Phasize	Mountry
Undiminution	Clother
Preliminish	Botton

Table 16:

4 https://github.com/jlkonig/.

Domain-specific pseudowords generated using the CGCA

Academic	Grade school
Unconverse	Brough
Enormat	Brothes
Corresponse	Withough
Illustract	Mountries
Emergins	Grandmothes
Primarise	Countain
Majoritise	Cottom
Phasize	Mountry
Undiminution	Clother
Preliminish	Botton

Academic	Grade school
Unconverse	Brough
Enormat	Brothes
Corresponse	Withough
Illustract	Mountries
Emergins	Grandmothes
Primarise	Countain
Majoritise	Cottom
Phasize	Mountry
Undiminution	Clother
Preliminish	Botton

Limitations and future work

CGCA is only as good as its origin wordlist. If, for example, a general-purpose wordlist was used to generate pseudowords for a domain-specific vocabulary test, the resulting pseudowords would reflect the general-purpose language of the wordlist rather than the domain-specific language of the test. Similarly, the legal evaluation is only as good as the wordlist or origin that pseudowords are compared against. The results of the evaluation criteria would be affected by a wordlist that included misspelled words or partial words, for instance. We look forward to applications of CGCA in vocabulary and testing research, for example, using specialized wordlists such as Coxhead and Demecheleer (2018) in English for Specific Purposes, as well as wordlists in languages other than English, such as Jakobsen et al. (2018) in Danish.

Although CGCA does not require a large lexicon, the number of pseudowords that can be generated is proportionally related to the number of unique tokens in the origin wordlist. For example, when using an origin of 100 words, we were able to generate 100 pseudowords (using 3-g), but with an origin of 1,000 words we were able to generate 1,000 pseudowords (using 3-g). However, the larger the character-gram size, the fewer character combinations there are, and therefore, fewer valid pseudowords can be generated. For example, when using an origin of 100 words, we were able to generate 100 pseudowords each using 2-g, 3-g, and r-g, but only half as many using 4-g.

CGCA uses Wiktionary to validate whether a potential pseudoword does or does not exist within the language. Although this is advantageous as it supports 8,000 languages, there are of course languages that it does not support fully (Te Reo Māori, for example). To address this problem, we intend to implement the algorithm as a web-based solution, meaning researchers would be able to specify whether they wish to validate using Wiktionary, or use their own specially supplied wordlist for validation.

CGCA was implemented using the Python programming language and can be downloaded from GitHub.⁴ The bulk of the planned future work involves porting the Python code for CGCA over into an online web-based solution that can be made publicly available. The online version would allow researchers to upload an input corpus or wordlist, specify whether they want the words in their corpus to be cleaned and validated, specify the size of the character-grams that they wish to use to create their pseudowords, specify the number of pseudowords they wish to generate, and specify whether they want to validate using Wiktionary or an uploaded wordlist. The system would then return the desired pseudowords for researchers to use as they wish.

Finally, majority of the pseudoword evaluations performed so far have focused on the English language. However, we are very interested in conducting more in-depth evaluations of CGCA pseudowords for other languages as well. Another aspect left for future research is comparing how participants in various tasks perform in relation to pseudowords obtained in different ways, and how their reaction times might vary.

CONCLUSION

This paper has introduced a new way of generating pseudowords that does not require any knowledge of the language and does not rely on a large lexicon. It uses a character-gram chaining approach to create pseudowords that reflect their origin, allowing us to create language or domain-specific pseudowords with varying word-likeness.

We also argue that pseudowords need to be evaluated and propose two sets of criteria to this end—a legal evaluation and a suitability evaluation. The former evaluates character patterns against an origin to determine whether the pseudowords are legal within the language, while the later allows researchers to evaluate and compare the structure of pseudowords to determine suitability.

Footnotes

1 One anonymous referee points out that pseudowords generated by manipulating characters within words can be problematic for some languages, particularly those with more rigid (and easily recognizable) syllable structure, such as, Italian, Spanish, and Arabic. While the algorithm we introduce here does not specifically take syllabic structure into consideration, the fact that it pays close attention to letter combinations is likely to lead to a close resemblance between syllabic structure of real words in the language and the pseudowords generated. Ultimately, we feel that this further highlights the importance of evaluating any pseudowords generated, regardless of the method used to do so.

2 We use the term origin to refer to the list of unique words that are extracted from the input text.

3 We thank one of the anonymous reviewers for reminding us of this fitting example.

5 https://github.com/jlkonig/.

SUPPLEMENTARY DATA

Supplementary material is available at Applied Linguistics online.⁵

Acknowledgements

All authors thank Emeritus Professor Ian Witten from the University of Waikato for forging collaborative networks between researchers at the University of Waikato and the Victoria University of Wellington. Jemma König thanks the University of Waikato Doctoral Scholarship for their financial support. Andreea S. Calude thanks the Royal Society Marsden Fast Grant for their financial support.

AUTHOR CONTRIBUTIONS

All authors contributed a third of the work, and to all stages of the project.

ETHICAL STATEMENT

No data were collected from participants for this study, so no ethics was required. This project did not enlist the help of any participants, the authors have no conflict of interest and the experimental design was not shared with anyone else or required external validation.

Conflict of interest statement. None declared.

REFERENCES

Arndt

H. L.

,

Woore

R.

.

2018

. ‘

Vocabulary learning from watching YouTube videos and reading blog posts

,’

Language Learning & Technology

22

:

124

–

42

.

Baayen

R. H.

,

Piepenbrock

R.

,

van Rijn

H.

.

1993

.

'The CELEX Database on Cd-Rom

.

Linguistic Data Consortium

.

Baayen

R. H.

,

Schreuder

R.

.

2011

.

Morphological Structure in Language Processing

.

Walter de Gruyter

, p.

151

.

Balota

D. A.

,

Yap

M. J.

,

Hutchison

K. A.

,

Cortese

M. J.

,

Kessler

B.

,

Loftis

B.

,

Neely

J. H.

,

Nelson

D. L.

,

Simpson

G. B.

,

Treiman

R.

.

2007

. ‘

The English Lexicon Project

,’

Behaviour Research Methods

39

:

445

–

59

.

Bauer

L.

,

Nation

P.

.

1993

. ‘

Word families

,’

International Journal of Lexicography

6

:

253

–

79

.

Bell

T. C.

,

Cleary

J. G.

,

Witten

I. H.

.

1990

.

Text Compression

.

Prentice Hall Englewood Cliffs

, p.

348

.

Berko

J.

1958

. ‘

The child’s learning of English morphology

,’

Word

14

:

150

–

77

.

Buchmeier

M.

2008a

. Bilingual Dictionaries for Offline Use—Spanish Frequency List. (Spanish), available at: https://en.wiktionary.org/wiki/User:Matthias_Buchmeier/Spanish_frequency_list-1-5000.

Buchmeier

M.

2008b

. Bilingual Dictionaries for Offline Use—Italian Frequency List. (Italian), available at: https://en.wiktionary.org/wiki/User:Matthias_Buchmeier/Italian_frequency_list-1-5000.

Buchmeier

M.

2009

. Bilingual Dictionaries for Offline Use—German Frequency List. (German), available at: https://en.wiktionary.org/wiki/User:Matthias_Buchmeier/German_frequency_list-1-5000.

Cardenas

J. M.

2009

.

Phonics Instruction Using Pseudowords for Success in Phonetic Decoding.

Florida International University

.

Coxhead

A.

2000

. ‘

A new academic word list

,’

TESOL Quarterly

34

:

213

–

38

.

Coxhead

A.

,

Demecheleer

M.

.

2018

. ‘

Investigating the technical vocabulary of Plumbing

,’

English for Specific Purposes

51

:

84

–

97

.

Davies

M.

2002

. The Corpus of Historical American English (COHA), available at: https://www.english-corpora.org/coca/.

Davies

M.

2013a

. Global Web-Based English (GloWbE), available at: https://www.english-corpora.org/glowbe/.

Davies

M.

2013b

. News on the Web (NOW), available at: https://www.english-corpora.org/now/.

Davies

M.

2015

. The Wikipedia Corpus, available at: https://www.english-corpora.org/wiki/.

Duyck

W.

,

Desmet

T.

,

Verbeke

L. P. C.

,

Brysbaert

M.

.

2004

. ‘

WordGen: a tool for word selection and nonword generation in Dutch, English, German, and French

,’

Behavior Research Methods, Instruments and Computers

36

:

488

–

99

.

Elgort

I.

,

Warren

P.

.

2014

. ‘

L2 vocabulary learning from reading: explicit and tacit lexical knowledge and the role of learner and item variables

,’

Language Learning

64

:

365

–

414

.

Graham

S.

,

Harris

K. R.

,

Loynachan

C.

.

1993

. ‘

The basic spelling vocabulary list

,’

The Journal of Educational Research

86

:

363

–

8

.

Groff

P.

2003

. ‘The usefulness of pseudowords,’ available at http://www.nrrf.org/old/essay_pseudowords.html.

Jakobsen

A. S.

,

Coxhead

A.

,

Henriksen

B.

.

2018

. ‘

General and academic high frequency vocabulary in Danish

,’

Nordand

2

:

64

–

89

.

Keuleers

E.

,

Brysbaert

M.

.

2010

. ‘

Wuggy: a multilingual pseudoword generator

,’

Behavior Research Methods

42

:

627

–

33

.

Keuleers

E.

,

Stevens

M.

,

Mandera

P.

,

Brysbaert

M.

.

2015

. ‘

Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment

,’

The Quarterly Journal of Experimental Psychology

68

:

1665

–

92

.

Kučera

H.

,

Francis

W. N.

.

1967

.

Computational Analysis of Present-Day American English

, 1st edn

Brown University Press

.

Meara

P.

2010

.

EFL Vocabulary Tests

, 2nd edn

Lognostics

.

Miller

G. A.

,

Selfridge

J. A.

.

1950

. ‘

Verbal context and the recall of meaningful material

,’

The American Journal of Psychology

63

:

176

–

85

.

Nation

I. S. P.

2012

. The BNC/COCA word family lists. Document bundled with Range program with BNC/COCA lists, 25.

Nation

I. S. P.

,

Heatley

A.

,

Coxhead

A.

.

2002

. Range: A Program for the Analysis of Vocabulary in Texts [software], available at: https://www.victoria.ac.nz/lals/about/staff/paul-nation.

New

B.

,

Pallier

C.

,

Brysbaert

M.

,

Ferrand

L.

.

2004

. ‘

Lexique 2: a new French lexical database

,’

Behavior Research Methods, Instruments and Computers

36

:

516

–

24

.

Nordquist

R.

2018

. “Definition and examples of pseudowords’, available at https://www.thoughtco.com/pseudoword-definition-1691549.

Proença

J.

,

Lopes

C.

,

Tjalve

M.

,

Stolcke

A.

,

Candeias

S.

,

Perdigão

F.

.

2017

. ‘Automatic evaluation of children reading aloud on sentences and pseudowords’,

Proc. INTERSPEECH

,

Stockholm, Sweden

, pp. 2749–53.

Rastle

K.

,

Harrington

J.

,

Coltheart

M.

.

2002

. ‘

The ARC nonword database

,’

The Quarterly Journal of Experimental Psychology

55

:

1339

–

62

.

Rueckl

J. G.

,

Olds

E. M.

.

1993

. ‘

When pseudowords acquire meaning: effect of semantic associations on pseudoword repetition priming

,’

Journal of Experimental Psychology: Learning, Memory, Cognition

19

:

515.

Saragi

T.

,

Nation

I. S. P.

,

Meister

G. F.

.

1978

. ‘

Vocabulary learning and reading

,’

System

6

:

72

–

8

.

Schwartz

S.

2013

.

Measuring Reading Competence: A Theoretical-Prescriptive Approach

.

Springer Science & Business Media

.

Webb

S.

2005

. ‘

Receptive and productive vocabulary learning: the effects of reading and writing on word knowledge

,’

Studies in Second Language Acquisition

27

:

33

–

52

.

Webb

S.

2007

. ‘

The effects of repetition on vocabulary knowledge

,’

Applied Linguistics

28

:

46

–

65

.