For linguists and programmers who work with data in multilingual computer systems, this free programming book serves as a practical Unicode reference.
We lay out the fundamental ideas required to comprehend how writing systems, character encodings, and their interactions with the Unicode Standard and the International Phonetic Alphabet all fit together.
Although users frequently express irritation with these standards, they still give lexicographers and programmers the uniform computational architecture they need to process, disseminate, and evaluate lexical data from the world's languages. Thus, we highlight typical—yet occasionally obscure—pitfalls that researchers using Unicode and IPA encounter.
We developed a set of open-source Python and R tools to work with languages using orthography profiles that describe author- or document-specific orthographic conventions after identifying and overcoming these difficulties in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be).
In this cookbook, we outline a formal specification of orthography profiles and offer recipes made with free software to demonstrate how users can divide text into separate sections, examine it for errors, and format it differently for comparative linguistics study.