Joe Celko's Data and Databases: Concepts in Practice
By Joe Celko
Chapter 7: Character String Data
Chapter 7: Character String Data
Overview
Characters are represented internally in the computer as bits, and there are several different systems for doing this. Almost all schemes use fixed-length bit strings. The three most common ones in the computer trade are ASCII, EBCDIC, and Unicode.
ASCII (American Standard Code for Information Interchange) is defined in the ISO 464 standard. It uses a byte (8 bits) for each character and is most popular on smaller machines.
EBCDIC (Expanded Binary Coded Digital Information Code) was developed by IBM by expanding the old Hollerith punch card codes. It also uses a byte (8 bits) for each character. EBCDIC is fading out of use in favor of ASCII and Unicode.
These two code sets are what is usually meant by CHAR and VARCHAR data in database products because they are what the hardware is built to handle. There is a difference in both the characters represented and their collation order, which can cause problems when moving data from one to the other.
7.1 National Character Sets
Unicode is what is usually meant by NATIONAL CHARACTER and VARYING NATIONAL CHARACTER datatypes in SQL-92, but this standard represents alphabets, syllabaries, and ideograms.
An alphabet is a system of characters in which each symbol has a single sound associated with it. The most common alphabets in use today are Roman, Greek, Arabic, and Cyrillic.
A syllabary is a system of characters in which each symbol has a single syllable associated with it.
Copyright Morgan Kauffmann Publishers 1999 under license agreement with Books24x7