Joe Celko's Data and Databases: Concepts in Practice

Chapter 10: Textual Data

Chapter 10: Textual Data
Overview
The most obvious nonrelational data in organizations is textual. Manuals, company rules, regulations, contracts, memos and a multitude of other documents actually run the organizations. In fact, most business rules are not in relational databases, but are in word processor files and vertical filing cabinets.
SQL and traditional records are based on a strict syntax and formal rules for extracting data and information. Natural language is a semantic system, and the syntax is not what carries the meaningful information. Unfortunately, while computers are good at formal syntax, they are lousy at semantics.
The result is that textbases use pattern matching operations instead of actually reading and understanding the documents they store. The textbase market is growing at the rate of 50% per year, according to a 1991 study by IDC and Delphi Consulting Group. Most of that growth is in centralized data centers.
10.1 Terminology and the Basics
There are two basic terms used in this field, textbase and document base, which are gradually being grouped under the term textbase. Strictly speaking, a textbase is just text. The documents stored in a textbase are just blocks of text with a name. This name can be as simple as an accession number or a timestamp on a fax, or it can have some meaning. There is also no requirement that the documents have the same structure.
A document base is text that comes arranged in a structured format. Usually the documents have a...

UNLIMITED FREE
ACCESS
TO THE WORLD'S BEST IDEAS

SUBMIT
Already a GlobalSpec user? Log in.

This is embarrasing...

An error occurred while processing the form. Please try again in a few minutes.

Customize Your GlobalSpec Experience

Category: Optical Character Recognition Software (OCR)
Finish!
Privacy Policy

This is embarrasing...

An error occurred while processing the form. Please try again in a few minutes.