Joe Celko's Data and Databases: Concepts in Practice
By Joe Celko
Chapter 10: Textual Data
Chapter 10: Textual Data
Overview
The most obvious nonrelational data in organizations is textual. Manuals, company rules, regulations, contracts, memos and a multitude of other documents actually run the organizations. In fact, most business rules are not in relational databases, but are in word processor files and vertical filing cabinets.
SQL and traditional records are based on a strict syntax and formal rules for extracting data and information. Natural language is a semantic system, and the syntax is not what carries the meaningful information. Unfortunately, while computers are good at formal syntax, they are lousy at semantics.
The result is that textbases use pattern matching operations instead of actually reading and understanding the documents they store. The textbase market is growing at the rate of 50% per year, according to a 1991 study by IDC and Delphi Consulting Group. Most of that growth is in centralized data centers.
10.1 Terminology and the Basics
There are two basic terms used in this field, textbase and document base, which are gradually being grouped under the term textbase. Strictly speaking, a textbase is just text. The documents stored in a textbase are just blocks of text with a name. This name can be as simple as an accession number or a timestamp on a fax, or it can have some meaning. There is also no requirement that the documents have the same structure.
A document base is text that comes arranged in a structured format. Usually the documents have a...
Copyright Morgan Kauffmann Publishers 1999 under license agreement with Books24x7