How to transcribe a conlang

Geoff's homepage -> Artificial Languages -> Transcription

Last update: 7 November 2007


Introduction

This page provides a series of guidelines for conlangers who are having difficulty working out how to represent their conlangs in the Roman alphabet. Much valuable input, for which I am grateful, has come from various members of the Zompist Bulletin Board.

What romanisation is and is not

The point of romanisation is to provide your conlang with a way of being written in the Roman alphabet, or in more formal terms, to choose an orthographical representation of every phoneme in the language. Obviously, to start with, you will need to know which phonemes your conlang has; this in turn will require a basic knowledge of phonetic terminology (if you know what a "voiceless palato-alveolar affricate" is, you're probably OK) and the concept of the "phoneme". A Unicode font with pages 0 and 1 is also necessary for seeing many of the accented characters.

Languages are ultimately based on sounds, not letters; romanisation is thus not the art of designing your phonology by choosing an interesting use for every single letter and several combinations of letters. We've all been guilty of this; my first conlang used both <s z> and <t d> for /s z/, while /t d/ had to be represented by <pt bd>!

The principle of least surprise

Much of the difficulty in romanising a phonology is in choosing decent representations of tricky phonemes. Unless there's very good reason otherwise, your transcription scheme should contain as few surprises or difficulties as possible for anyone who might read it. Most of the letters of the Roman alphabet have well-established meanings, and it's best to stick to them; if, for example, you want to use <l m n> for /p t k/, go right ahead, but be don't be surprised if nobody else can pronounce your words properly :-) Even the great JRRT was caught out; Ralph Bakshi's animated film of The Lord of the Rings was made in ignorance of Appendix E, and the name Celeborn was pronounced with initial /s/, not /k/ as Tolkien intended.

Digraphs or diacritics?

If you're lucky, you'll be able to represent each phoneme in your conlang with a different unadorned lowercase Roman letter. More probably, though, you'll need to resort to digraphs (sequences of two letters, such as <ch>), diacritics (modifying marks on letters, such as <žá>), or extra letters (<þç>). Each of these has their own advantages and disadvantages, not to mention supporters and detractors.

Diacritics should generally be used consistently and sparingly. They are useful if, for example, you have two parallel series of phonemes for which similar transcriptions are clearly desireable - for example long and short vowels, or plain and palatalised consonants. On the other hand, more or less as Katherine "Deverry" Kerr put it, your readers don't necessarily want to be confronted with words which bristle like porcupines; if you decide to use diacritics, it is better to save them for less common phonemes.

Digraphs, on the other hand, are potentially ambiguous. For example, if you use <s> and <h> for /s/ and /h/ respectively, and decide to use <sh> for /S/, you have to think about how to represent the sequence /sh/. If it doesn't occur at all in your conlang, or the ambiguity doesn't trouble you, there's obviously no problem; otherwise you could separate the letters with an apostrophe, viz. <s'h>.

FInally, extra letters are useful for one-off transcriptions of otherwise difficult phonemes, but there aren't very many of them, and they tend to have well-defined uses which may not fit your conlang. They also tend to stand out somewhat as unusual, and they're not always easy to type.

Ultimately it's down to your personal preference which you use; the discussion below will present examples of all three.

Conventions

The descriptions below provide what I consider to be the most obvious, or least surprising, values of the letters of the Roman alphabet, supplemented by the most useful digraphs, diacritics, and a few other letters. Phonemes are given between /slashes like this/, using the X-SAMPA representation of the IPA. Graphemes, or letters, are given in <angle brackets>; <t> thus refers to "the letter t", while /t/ is "the sound t".

Consonants

Single letters

In general, <b d f k m n p s t v z> naturally suggest /b d f k m n p s t v z/, while <l> suggests a lateral and <r> a rhotic.

<c> is probably the most troublesome single consonant letter; it is natural for /ts/ and /k/ in Slavic- and Celtic-flavoured romanisations respectively, and represents /tS/ in Indonesian, Malay, and Sanskrit. Less justifiably, it could be used for /S/, as I did in older transcriptions of some of my other conlangs; it might also be a possibility for /c/. Generally speaking, though, it should be treated with caution.

<g> is best used for plain /g/, unless you really have a rule like in English where /g/ often becomes /dZ/ before front vowels. It could also be used for the velar nasal /N/, or for /dZ/ or /Z/, if your conlang has no /g/.

<h> on its own is useful for /h/, or for /x/ if there is no /h/.

<j> most commonly suggests either /j/ or /dZ/; the first is typically Slavic or Germanic, the second English. It could also be used for /Z/, as in French.

<q> could be used for /kw/, or the uvular stop /q/ if you have one. Generally speaking, <c k q> all suggest back voiceless stops in increasing order of backness.

<w> is reasonable for /w/; /v/, to give a German or Polish flavour, is possible at a pinch.

<x> is awkward, and as with <c> should be treated with caution. Normally it suggests /ks/, for which <ks> is however preferable; this is useful if you have various clusters of stop + /s/ and want to represent them by single letters where possible. Among the other possibilities are /S/ (Portuguese and Old Spanish) and /x/ (from IPA via Cyrillic).

<y> as a consonant is best used for /j/ only.

Extra letters

Other single letters which may be useful are <þ> ("thorn") and <ð> ("eth"), which should only be used (their usual values are /T/ and /D/) if you'd rather not use <th> and <dh>; <ŋ> ("eng") for /N/; and <Ʒ> ("ezh") for /Z/.

It is no doubt possible to contrive a use for <ß>, the German "eszet", but it has no uppercase form, and there's probably a better transcription anyway.

Consonant digraphs

Within a consonant digraph, one letter (usually the second) is typically regarded as "modifying" the meaning of the other. The commonest modifying letters, which in some analyses may be regarded as special diacritics, are probably <h> and <j>.

<h> is best used after stops to indicate aspiration or spirantisation; for example, <th> suggests /t_h/ or /T/. Beware of <ch>, which is perhaps best used for /tS/; it could also suggest any of /S C x/. <sh zh> are obvious choices for /S Z/.

<h> before or after <m n l r w y> generally suggests voicelessness, as with <wh> or <hw> for /W/. Alternatively you could use, for example, <lh> to represent another lateral which contrasts with whatever <l> represents; <nh> for /J/, as in Portuguese; or <rh> for /R/, as Mark Rosenfelder did with his Verdurian.

<j> suggests palatalisation, especially if you already use it on its own for /j/; for example <tj> implies a palatal stop /c/ or a palatalised coronal stop /t_j/. If you already use <j> for /Z/, <tj> could represent /tS/.

<y> could be used with the same meaning as <j>, although it is more likely to be read as a vowel.

<z> might be used to give your orthography a Polish flavour, for example with <sz cz> for /S tS/.

If you really want to use <c> for /S/, <tc> will naturally represent /tS/; and similarly, if <j> is /Z/, then <dj> is /dZ/. I don't recommend either of these unless there's good reason.

<ng>, as in English, is probably the most convenient and least surprising representation of the velar nasal /N/. If you need to distinguish it from /Ng/ or /ng/, you could spell it <ng'>, as in Swahili, or leave it as <ng> and spell the cluster as <n'g> or the rather unwieldy <ngg>.

Double letters normally imply gemination ("long consonants"). If you don't have these, you could use the odd double consonant for an otherwise problematic phoneme; Welsh, for example, uses <dd ll> for /D K/.

Diacritics on consonants

As with extra letters, the odd consonant-with-diacritic may be useful to fill a gap. <ñ> ("n-tilde") is a possibility for the palatal nasal /J/, although Tolkien used it for the velar nasal /N/. <ç> ("c-cedilla") might be useful for something like /tS/ or /C/, but is of debatable value; <ș ț> ("s-cedilla" and "t-cedilla") may also have some use somewhere.

Otherwise, the only diacritics which are systematically useful on consonants are the hachek or caron (the little <v> used in many Slavic languages) and the acute accent. Generally, they suggest palatalisation or palato-alveolar consonants; <č š ž> are thus reasonable representations of /tS S Z/, and the corresponding transcription of /dZ/ is then <dž>. <ğ> might do for /dZ/, although this implies that <g> represents /dz/; it is better used for something like /G/.

<ť ď> (lowercase <Ť Ď>), if anything, suggest /t_j d_j/ or /c J\/; they could be used for /T D/ if you don't like <th dh> or <þ ð> and don't otherwise use hacheks. <ň> could be used for /n_j/ or /J/, and <ř> for /r_j/, /R/, or the Czech /r_r/ if you have it.

<ć ĺ ń ŕ ś ź> should have a systematic relationship to <c l n r s z> if possible, of which they suggest palatal or palatalised equivalents. Otherwise you could use, for example, <ś ź> for /S Z/ and <ĺ ŕ> for another lateral and rhotic.

Punctuation marks

It's not a good idea to use punctuation marks as actual letters; they may be confused or conflict with actual punctuation marks, after all. The only exception is the apostrophe, and then in exceptional circumstances only; it is best confined to indicating elided or omitted sounds, but may be justifiable for marking phonemic gottal stops, ejectives, ingressives, and so on. Such consonants are problematic to transcribe whichever scheme you use, however.

Vowels

Transcribing vowels is more difficult and more dependent on the phonology than is transcribing consonants; the broadest generalisation is that the commonest vowel-system, the Latinate /i e a o u/, is obviously best represented by <i e a o u>. The only other normal Roman letters which can be reasonably used to represent vowels are <y>, which on its own suggests either /1/ or /y/, and <w>, which I have used for /2/ and /u/ but should be avoided unless absolutely necessary.

Extra letters

The only viable extra vowel letters are:

Vowel digraphs

A digraph which represents a single vowel phoneme should ideally be composed of the single letters which represent similar phonemes; a vowel like /{/, for example, is close to both /a/ and /e/, and a transcription like <ae> is thus a good choice for it. Similarly, <ei> more naturally represents /e/ than /E/, and <oe> suggests both /o/ and /e/ and would do for /2/ or /9/.

Digraphs are a natural possibility for representing long vowels, as with <oo> or <ou> for /o:/.

Diacritics on vowels

Vowels admit more diacritics than consonants; exactly which ones you use, and what you use them for, is dependent on your own preferences, on your vowel-system, and on those you find easy to use with your available technology. Here's what the various vowel diacritics suggest to me; in general, it's best to stick to the first five or six of them.

AccentExamplesUses
AcuteáéíóúýRising tone; length; more close quality ([i e o u])
GraveàèìòùFalling tone; more open quality ([I E O U])
CircumflexâêîôûComplex tone; length
TildeãĩõũNasality
Diaeresis or umlautäëïöüSystematically modified qualities (see below)
MacronāēīōūLength
BreveăĕĭŏŭShortness
OgonekąęįųNasality
Double umlautőűLong umlauted vowels, typically /2: y:/

The original use of the diaeresis was to indicate the fronting of an original back vowel, as with <ä ö ü> for /{ ø y/. This has often been extended in conlangs and some phonetic transcription systems to indicate generalised reversal of backness, as with <ë ï> for /7 M/, although this is use not found in any natural languages. <ë> is also a common choice for phonemic schwa /@/.

If you have diphthongs, one option is to represent them as digraphs which indicate their component vowels; thus /ai au/ are best transcribed <ai au> or, for something slightly more exotic, <ay aw> or <ae ao>. But note that if you have, for example, <a ä> for /{ a/ and an /{i/ but no /ai/, there's no point in representing the diphthong with <äi>; <ai> is simpler and thus preferable.

Tones are probably best indicated with diacritics, although you may be able to work out a system with digraphs.

A conlang which uses lots of vowels with diacritics, and which uses them well, is Alurhsa.


From phonemes to letters

The table below shows, arranged phonetically, the suggested transcriptions for obstruents and nasals. The X-SAMPA representation of the appropriate phoneme is shown in a separate cell on a grey background, and the next cell contains several reasonable transcriptions of the phoneme; slashes and angle brackets are omitted for clarity. Transcriptions in (parentheses) should only be used if there's sufficient justification for them; stop-plus-<h>, for example, is reasonable if you have an Irish-style mutation system, and <ť> (the lowercase of <Ť>) for /T/ is maybe not advisable if you use hacheks for something else.

 Stops and affricates FricativesNasals
PlaceVoicelessVoiced VoicelessVoiced  
Labialpp bb ff (ph) vv (bh w) mm
Dentaltt dd Tth þ ŧ (ť) Ddh ð đ (ď) nn
Alveolartsts c (ţ) dzdz ss zz
Palato-alveolartSch č dZj dž Ssh š Zzh ž
Palatalctj ć ť J\dj ď C(ch ś) j\(jh ź) Jnj ny ñ ň ń
Velarkk (c) gg xh kh (x ch) Ggh ğ Nng ŋ (ñ)
Uvularqq (k) G\ğ Xqh (x xh) Rrh N\

It's much harder to provide a similar table for vowels, so I won't try. Instead, here are some possible transcriptions of a vowel system like that of Vulgar Latin; I'm sure you can invent many more.

MethodieEaOouNotes
Digraphs 1ieeaaoaou1
Digraphs 2ieieaoouu2
Digraphs 3ieieaaoaouu3
Diacritics 1ieèaòou1
Diacritics 2iéeaoóu2
Diacritics 3iéèaòóu3
Extra lettersieæaåou-
Roman only 1iyeaowu4
Roman only 2yieaouw4

Notes:

  1. These are more suitable for this vowel system if /e/ and /o/ are more frequent than /E/ and /O/.
  2. Similarly, these are preferable if /E O/ are commoner.
  3. These are not really recommended, although the idea is useful if you have long versions of each vowel but short /i e a o u/ only.
  4. These are here for giggle value only and are definitely not recommended.