Translation, Theory and Technology Homepage

[CLS Framework]

CLS Framework
Introduction
Section map
Overview
Applications:
·Representation ·Design
·Sharing
ISO 12620 data categories
Downloads
XML information

The CLS Framework: Reltef specification

Daniel Hardman, CMR-TermSoft

Revised 1999-03-18

Reltef™ and MetaRef™ © 1996-2000
Daniel Hardman and CMR-TermSoft. This format is copyrighted to control its evolution. Feel free to use it in your own applications, but please send us a note to let us know that you've found it useful or to get the latest updates.

Background assumptions

This document builds on the termbase data model framework outlined in a separate document entitled "The CLS Framework". The CLS Framework is a logical organization of ISO 12620 data categories that reflects certain tenets of terminology theory and conceptualizes terminology data models using building blocks familiar to professional terminologists.

12620-->CLS Framework, CLS --SGML-->MARTIF, CLS --MetaRef-->Reltef

In order to arrive at a final data model for a termbase, we assume the CLS Framework as a point of departure. The data categories and the categorical hierarchy proposed in the CLS Framework can be implemented in various ways. Two possible encoding paradigms are SGML and relational database systems. When the CLS Framework is implemented with SGML, the result is MARTIF (ISO 12200). When the CLS Framework is implemented in relational database systems using MetaREF™ (outlined below), the result is Reltef™. Because MARTIF and Reltef™ share a common framework, they are completely compatible and congruent.

Reltef™ is a specific application of MetaREF™, an abstract data model designed by CMR-TermSoft. The MetaREF™ model consists of an E-R diagram and a set of tables and relationships that can be implemented in any mid-range (e.g., MS Access) to high-range RDBMS (e.g., Oracle, SQL Server, Ingres, Informix, Sybase, etc.). The Reltef™ model consists of the abstract structure of MetaREF™ plus tuples in several MetaREF™ "meta" tables. This data locks the DBMS into the specific terminology model an application requires. A Reltef™ implementation consists of the Reltef™ model plus a coherent body of terminological data stored in non-meta tables.

Reltef™ takes advantage of many strengths of a traditional DBMS, including integrity constraints, data normalization, access to data through traditional and extended SQL, business rules, and so forth. It also provides enormous flexibility and can be customized to address the needs of many data sets in a wide variety of human languages.

Basic Philosophy

Relational systems do certain things very well when it comes to data storage, manipulation, and retrieval. They share a fairly standardized query language (SQL); they are able to impose many constraints to ensure the validity of data; they typically import and export data to other RDBMS engines with a high degree of transparency (assuming congruent data models); they support multiple concurrent access, record locking, etc. However, RDBMSs also have certain weaknesses when it comes to highly linguistic data. They typically index information using a single collating sequence that must be applied to the database as a whole; they have no standardized way of encoding multiple languages (especially if the CJK languages are part of the mix); many validation routines for text fields are English-oriented; their query interface has no built-in linguistic savvy. Also, various relational implementations of terminological data models are highly incompatible with one another because of fundamentally different assumptions about what the atomic units of data and structure in the database should be.

SGML can be considered more properly a document encoding language as opposed to a database system. It allows for detailed "templates" (DTDs) that specify a particular structure; these templates can be populated by an infinite (but bounded) variety of specific instantiations. A well-known SGML DTD is HTML, the language used to encode documents on the Web. Like RDBMSs, SGML has both strengths and weaknesses. Its strengths include true platform independence, extremely flexible handling of multiple languages, automated processing and conversion by SGML-aware software, structural and data integrity constraints, and conformity validation. Its weaknesses include the lack of a query language, no indexing capability, an inability to regulate and distribute access to data, and lack of SGML management engines that provide database-like functionality.

Reltef™ stands for "Relational terminology encoding format", and represents a fusion of these two normally distinct approaches. Reltef™ data is stored in relational tables and joined by standard relationships; it can be queried using SQL, described by an E-R diagram, and edited using RDBMS tools. However, it also parallels the structure of an SGML document and borrows from SGML to deal with character encoding, collating sequences, and other multilingual issues. It can be converted automatically between an SGML document and relational tables to allow information to flow across platform, language, and software boundaries with maximum ease and minimum data loss.

Molecular and atomic views of data

Chemists make important generalizations about matter on at least two very different levels: molecular and atomic. Certain molecules may have common characteristics even though their specific structures are somewhat different (acid~base, organic~inorganic, strand~ring~crystal, etc.) It is valid and useful to think about matter from a molecular perspective. However, an atomic-level classification is also necessary to explain many phenomena, such as why particular molecules form and how they can combine.

Reltef™ views terminological data at two different levels of abstraction. At one level the major objects under consideration are data category instances--pieces of information that can be classified using the CLS Framework, such as concepts, terms, definitions, and the like. At a more elemental level, Reltef™ recognizes the root entities of MetaREF™: languages, picklists, text values, dates, and so forth. These two perspectives could be called "molecular" and "atomic", respectively. To illustrate these perspectives, consider the following piece of fictional terminological data:

**hyperdrive**
POS:	noun
NUM:	singular
SYN:	faster-than-light drive
SYN:	FTL
DOM:	science fiction ~ space travel ~ propulsion
DEF:	Any method of propulsion that achieves travel at speeds that exceed the speed of light. Typically this travel is conceived of as non-linear, meaning that no continuous traversal of space-time occurs.
REF:	Dictionary of Science Fiction and Fantasy, vol. 1, p. 234.
CON:	"Hold on, kid. I'm gonna kick this thing into hyperdrive. Let's see the star cruiser follow us then!"
REF:	Star Wars, Act II, scene 6.

What pieces of data are present in this sample? A terminologist might identify the following data category instances: three terms (one preferred, the other two synonyms), a part of speech, an indication of the grammatical number of a term, a domain, a definition, a context, and two bibliographical references.

This valid and relatively intuitive way of conceptualizing the data (in which the primary unit of data is the data category instance) can be thought of as "molecular", in the sense that the data units can be further reduced or generalized. Terms, definitions, contexts and the like share certain common features, regardless of their specific type: they each modify a single parent element; they each have a single value; they each fall into a single data category. A particular configuration of these more elemental building blocks [atoms] yields a data category instance [molecule]. Molecular perspectives are important both to MARTIF and to relational terminology applications.

Molecular approaches

Because of its origins in SGML, MARTIF is able to address information at an atomic level. For convenience, however, most discussions of MARTIF focus on the molecular perspective, working directly with data categories familiar to a terminologist. Thus the major building block of a MARTIF file is a <termEntry> (container for all information on a given concept). A <termEntry> must consist of one or more <ntig>s (container for all information about a particular term), plus additional (optional) descriptive information (such as a definition). The <ntig>s, in turn, each contain exactly one <term>, followed by optional descriptive information (the part of speech, a contextual example, etc.).

Typical relational terminology models are also molecule-oriented. Concepts, terms, definitions, contexts, bibliographic references, and grammatical information are each conceived of as separate entities, with predefined relationships between the entities encoded as part of the data model: CONCEPT is labeled by one or more TERMs; TERM is demonstrated by one or more CONTEXTs; CONCEPT is identified by one or more DEFINITIONs, etc. In such systems, the concept is usually privileged against all other entities, in the sense that the concept forms a nucleus around which other pieces of information must cluster. Each entity usually corresponds to a separate table, with fields that reflect its particular data characteristics. For example, the term table might have a numeric ID, a 120-character text field to hold the value of the term, and a small field to identify the term's language. It could also have a status field to classify the term as preferred or deprecated, a part of speech field, a grammatical number field, and so forth.

Molecular-only models of terminology data have the advantage of being relatively intuitive to a terminologist. However, they also have certain drawbacks. They are typically built in such a way as to preclude the sharing of information with incompatible models. For example, if a relational model requires that all terms have an associated status, and if terminology from an external system has no status on the terms, then the integrity constraints of the DBMS will make it impossible to import the external data. This is a good thing in terms of internal consistency, but may make it difficult to share information.

Once a molecular model has been implemented, it also adopts all of the linguistic idiosyncrasies of its RDBMS. For example, Japanese information in an Oracle database may be reliably and safely encoded on a particular flavor of UNIX, but may not transfer at all if that database must be moved to a different machine running a different OS. It almost certainly will not transfer if Oracle is laid aside in favor of another RDBMS somewhere in the future. Some database engines use Unicode internally, but few export it and even fewer allow Unicode-aware queries. If they do allow Unicode queries, there is still the issue of conversion between the OS code page and Unicode during keyboard input, web access, and so forth; any solutions are virtually guaranteed to be implementation-specific and non-transferable. And the RDBMS will probably not sort CJK languages properly. If it does, then it is likely to impose a CJK sort order throughout the database (or possibly allow CJK plus one other order).

A Dual Perspective

Reltef™ is innovative in that it conceives of terminological data at both the atomic and molecular levels of abstraction. It uses the MetaREF™ data model to coordinate data at the atomic level, and gains from MetaREF™ all the inherent power of an RDBMS to manage entities, relationships, and integrity constraints. And it allows a query-driven management of the more familiar molecular-level constructs as well.

How it works

The atomic level in Reltef™ is implemented directly through the MetaREF™ data model. The core of this data model is a set of thirteen entities [atoms]:

Entity name	Description
data category	a particular class of terminological information (e.g., a term, a part of speech, etc.)
data category name	a language-dependent, user-friendly name for a given data category (e.g., in Spanish a term could be called "término")
data category index type	an indexing strategy that corresponds to a data category (e.g., don't index it, index it as a single value, index it word-by-word)
lang	a distinct language, consisting of a uniform encoding scheme that employs a single charset (e.g., French, German, Italian, etc.)
charset	a unique combination of characters that may be used to represent one or more languages (e.g., ISO 8879-1)
picklist	a set of possible values for terminological data from a given data category (e.g., for the data category "part of speech": {noun, verb, adjective})
element	a unique piece of terminological data
date value	a date(-time stamp) that constitutes the value of an element
number value	a number that constitutes the value of an element
picklist value	a member of a picklist that constitutes the value of an element
text value	a sequence of characters that constitutes the value of an element
index value	a string that represents a normalized, indexed form of all or part of the value of a particular element
link	a connection between two elements

The first six items above are "meta" entities; they are created (and their corresponding tables filled with information) before any terminological data is added to the database. It is from these meta entities that "MetaREF" (Meta Relational Encoding Format) gets its name. MetaREF™ defines a very abstract layer of informational possibilities. By populating the six meta entities (plus one meta relationship), the specific data model used by an application (in this case, a termbase) is outlined and enforced. In other words, the meta tables together define the structure that constrains and unifies the terminological data at a molecular level. They might be considered catalyst atoms, necessary to the combination of the other atoms in molecular reactions.

The remaining seven MetaREF™ entities are populated directly through traditional data entry or import and (in Reltef™) hold the actual terminology data visible to an end user of the system. The information these entities contain may be validated at a molecular level using standard SQL queries. Most information-retrieval queries formulated by end users of the DBMS focus almost entirely on information contained by these entities.

atomic and molecular perspectives on terminological data

Sample database

To illustrate how these entities are actually implemented in the tables of a Reltef™ database, we will create a small sample. We begin by encoding the CLS Framework in the seven meta tables inherited from MetaREF™. This process identifies all the data categories that the Reltef™ database will recognize, how instances of the categories may combine and be indexed, what values are valid for a given data category, and so forth. Once we have completed the meta table population, our database will be specifically terminology-oriented, and will be ready for the input of data recognizable to the termbase end user.

The first step is identifying the languages the database will support. This fills the langs table and implements the lang entity. Note that the collating sequence is important for certain kinds of linguistic processing, but is left blank to simplify our example for now.

**Langs** table
langID	description	charsetID	collating sequence	noise list
en	English	ISO 8879-1		the, of, is, are, it, to, be, not, and
es	Spanish	ISO 8879-1		de, y, o, la, el, que, los, las

We must also explain what charset "ISO 8879-1" is. This fills the charsets table and implements the charset entity.

**Charsets** table
charsetID	description	wsd
ISO 8879-1	standard "lower ASCII" plus Western European language chars such as accented vowels, etc. Supports English, French, German, Spanish, Italian, Swedish, Dutch, Norwegian, etc.	ISOLAT1.WSD

Next we should define some data categories. All Reltef™ databases populate the data categories table (and implement the data categories entity) from the CLS Framework (and via CTI, from ISO 12620). Of course, a particular implementation of Reltef™ need not use all or even most of the myriad data categories outlined there-we only define those kinds of data we are interested in. But we must draw our definition from the standard to ensure compatibility and convertibility.

Here it is vital to note that from an atomic perspective, all data categories are created equal. A basic assumption of most relational terminology databases is the centrality of the concept entity, to which terms, definitions and so forth are subservient. There is no intrinsic quality of the MetaREF™ data model that parallels this assumption. In the data categories table, a concept is just another tuple. It is possible to enforce the concept's core quality in Reltef™, but the enforcing occurs in the valid families table (discussed below), not here.

The structure of the data categories table is relatively straightforward. The value of each categoryID (and GI name + type) derive directly from ISO 12620. Possible values of the "value type" field are: container (element is a shell that contains other elements), none (element has no value at all), text (free-form language-specific information), code (free-form language-independent information), number, date, and picklist. The "forms link" field tells whether this data category is required to point to another element (as in cross references).

**Data categories** table
categoryID	GI name	type	value type	needs lang	forms link
0	termbase	*	container	No	No
0.1	body	*	container	No	No
0.3	termEntry	*	container	No	No
1.1.1	term	*	text	Yes	No
2.1.1	termNote	partOfSpeech	picklist	No	No
2.1.2	termNote	grammaticalGender	picklist	No	No
2.1.3	termNote	grammaticalNumber	picklist	No	No
2.2.2	termNote	geographicalUsage	text	Yes	No
5.1	descrip	definition	text	Yes	No
5.3	descrip	context	text	Yes	No
8	note	*	text	Yes	No
10.18	ptr	cross-reference	none	No	Yes

Now we need to give these data categories (each identified by a somewhat cryptic categoryID) a meaningful, language-specific name. This fills the data category names table and implements the data category name entity.

**Data category names** table
categoryID	langID	name
0	en	termbase
0	es	base de datos de terminología
0.1	en	terminological information
0.1	es	informacíon terminológica
0.3	en	concept
0.3	es	concepto
1.1.1	en	term
1.1.1	es	término
2.1.1	en	part of speech
2.1.1	es	función gramatical
2.1.2	en	gender
2.1.2	es	masculino/femenino
2.1.3	en	number
2.1.3	es	número gramatical
2.2.2	en	geographical usage
2.2.2	es	uso geofráfico
5.1	en	definition
5.1	es	definición
5.3	en	contextual example
5.3	es	ejemplo contextual
8	en	note
8	es	nota
10.18	en	cross-reference
10.18	es	véase también

Having given a name to each data category, we need to create picklists for those data categories that require them. This fills the picklists table and implements the picklist entity.

**Picklists** table
categoryID	value
2.1.1	adjective
2.1.1	noun
2.1.1	verb
2.1.2	feminine
2.1.2	masculine
2.1.2	neuter
2.1.3	mass
2.1.3	plural
2.1.3	singular

Now we identify a data category index type for all data categories that can be indexed (value-less data categories cannot be indexed). This fills the data category index types table and implements the data category index type entity. Note that in field "index type", the allowed values are: "none" (the data category is not indexed), "whole" (the entire value of each instance of the data category is indexed as a single value, "chunk" (the value of each instance of the data category is divided into chunks and indexed on a per-chunk basis, and "both" (whole + chunk).

**Data category index types** table
categoryID	index type
1.1.1	both
2.1.1	whole
2.1.2	whole
2.1.3	whole
2.2.2	chunk
5.1	chunk
5.3	chunk
8	chunk

Our last task with the "meta" entities is to specify how data categories may combine. This fills the valid families table and implements the relationship data category A may be parent of B (see E-R diagram below). Note that in field "child occurrence rule", the allowed values are: 1 (child must occur exactly once), ? (child must occur 0 or 1 times), * (child must occur 0 or more times), and + (child must occur 1 or more times).

The data we place in this table will be used to enforce so-called "business rules" as well as certain kinds of data constraints. For example, by specifying that all concepts (data category 0.3) must have at least 1 child term (data category 1.1.1), we impose a portion of the concept centrality feature mentioned earlier. If a concept is added, then the valid families table will require that at least one term be added as well. We can check for conformance to this information with a simple query. Many DBMS engines allow this kind of query as a validity check whenever a record is added, making molecular-level integrity constraints straightforward and bulletproof.

**Valid families** table
parent categoryID	child categoryID	child occurrence rule
0	0.1	1
0.1	0.3	+
0.3	1.1.1	+
0.3	5.1	+
0.3	8	*
0.3	10.18	*
1.1.1	2.1.1	?
1.1.1	2.1.2	*
1.1.1	2.1.3	*
1.1.1	2.2.2	*
1.1.1	5.3	*
1.1.1	8	*

Once these seven tables contain data, we have a sufficient framework to began actually entering terminology data. Let's create a two concepts, three terms, and some simple supporting information, using the following structure:

Data tree--DB is root, concepts are children, terms are grandchildren...

Here we have uniquely labeled each piece of information using a somewhat arbitrary ID ("term1", "concept2", etc.) Reltef™ imposes some restrictions on these IDs. They are strings of at most 32 chars. They must begin with an alpha, and can contain alphas, digits, hyphens, and periods ("."). But they need not reflect their associated data category. The terminologist who manages a Reltef™ database decides if the IDs should follow a particular convention, or the IDs could be generated automatically during data entry or import. In this case, we have chosen a reasonable convention that seems easy to follow. We begin by adding each data category instance to our elements table:

**ELEMENTS** table
ID	categoryID	parentID
testdb	0	testdb
body1	0.1	testdb
concept1	0.3	body1
concept2	0.3	body1
definition1	5.1	concept1
term1	1.1.1	concept1
ptr1	10.18	concept1
term2	1.1.1	concept1
term3	1.1.1	concept2
context1	5.3	term1
geousage1	2.2.2	term1
note1	8	term2
pos1	2.1.1	term2
gender1	2.1.2	term2

By adding these tuples to the elements table, we have created instances of various terminological (molecular) data categories. However, at a molecular level some of these newly-created objects are incomplete. For example, the terms, the definition, the context, the note, and the geographical usage note all have text values. In Reltef™'s atomic perspective, the text values are separate entities associated through the element has text value relationship by a common ID. Thus we add the following information to the text values table:

**TEXT VALUES** table
ID	value
definition1	Any method of propulsion that achieves travel at speeds that exceed the speed of light. Typically this travel is conceived of as non-linear, meaning that no continuous traversal of space-time occurs.
term1	hyperdrive
term2	sistema de propulsión más rápido que la luz
term3	warp drive
note1	this particular equivalent is not likely to be reliable
geousage1	USA
context1	"Hold on, kid. I'm gonna kick this thing into hyperdrive. Let's see the star cruiser follow us then!"

Each element may also have an explicitly associated lang in the elements use lang table (if the element's data category is defined to take a lang attribute):

**ELEMENTS USE LANG** table
ID	langID
definition1	en
term1	en
term2	es
term3	en
note1	en
geousage1	en
context1	en

The gender and part of speech data category instances both draw their value from picklists. Consulting the earlier information we entered in the picklists table, we choose a valid value for each data category instance and enter it in the picklist values table:

**PICKLIST VALUES** table
ID	value
pos1	noun
gender1	masculine

Now we can implement the link created by ptr1. We do this in the links table:

**LINKS** table
ID	targetID
ptr1	concept2

Our final step is to index the values of various elements, according to the instructions in the data category index types table. Indexed values are stored in the index values table.

**INDEX VALUES** table
ID	offset	value
definition1	1	any
definition1	2	method
definition1	3	propulsion
definition1	4	achieves
definition1	5	travel
definition1	6	speeds
definition1	7	exceed
definition1	8	speed
definition1	9	light
definition1	10	typically
definition1	11	travel
definition1	12	conceived
definition1	13	non-linear
definition1	14	meaning
definition1	15	continuous
definition1	16	traversal
definition1	17	space-time
definition1	18	occurs
context1	1	hold
context1	2	kid
context1	3	gonna
context1	4	kick
context1	5	thing
context1	6	hyperdrive
context1	7	lets
context1	8	see
context1	9	star
context1	10	cruiser
context1	11	follow
context1	12	us
context1	13	then
term1	0	hyperdrive
term1	1	hyperdrive
term2	0	sistema de propulsion mas rapido que la luz
term2	1	sistema
term2	2	propulsion
term2	3	mas
term2	4	rapido
term2	5	luz
term3	0	warp drive
term3	1	warp
term3	2	drive
note1	1	particular
note1	2	equivalent
note1	3	likely
note1	4	reliable

At this point all of our data has been entered. Several Reltef™ tables have not been used (date values, code values, number values). This is because our tiny sample has no data category instances that have date, code, or number values. The structure of these tables and the method of data entry is relatively similar to that of the text values table.

Molecular-level interface

Once the tables and relationships have been created in a Reltef™ database, the DBMS will automatically maintain what we have called the "atomic" or MetaREF™ level components-the most abstract layer of entities and relationships. For example, the database engine will prevent a text value from being associated with a non-existent language; it will only allow elements to be assigned to valid data categories; it will force picklist values to come from pre-defined picklists; when a particular element is deleted, its associated value, index value, and so forth will also be deleted. This provides one important layer of validation.

However, the atomic level is highly abstract, and does not parallel the typical needs of an end user who knows about terms and definitions but not elements and data categories. The typical end user will need to answer queries such as "find all French software engineering terms that are only appropriate in Canada" or "list Spanish equivalents for English terms containing with 'polymorphic'". And the database needs to perform validation at the molecular level as well, to implement rules like "all English terms must have a contextual example" or "each definition must have a bibliographic reference". Both of these kinds of operations are possible running SQL queries against the Reltef™ database:

To find:	Use SQL:
all concepts	SELECT DISTINCTROW elements.ID FROM elements WHERE (((elements.categoryID)="0.3"));
all contexts	SELECT DISTINCTROW elements.ID, [text values].value, [text values use lang].langID, elements.parentID FROM elements INNER JOIN ([text values] INNER JOIN [text values use lang] ON [text values].ID = [text values use lang].ID) ON (elements.ID = [text values].ID) AND (elements.ID = [text values use lang].ID) WHERE (((elements.categoryID)="5.3"));
all terms	SELECT DISTINCTROW elements.ID, [text values].value, [text values use lang].langID, elements.parentID FROM elements INNER JOIN ([text values] INNER JOIN [text values use lang] ON [text values].ID = [text values use lang].ID) ON (elements.ID = [text values].ID) AND (elements.ID = [text values use lang].ID) WHERE (((elements.categoryID)="1.1.1"));
all terms matching pattern __	SELECT DISTINCTROW terms.ID, terms.value, terms.langID, terms.parentID FROM terms INNER JOIN [index values] ON terms.ID = [index values].ID WHERE ((([index values].value) Like [search pattern:]) AND (([index values].offset)=0));
all equivalents for term __ in lang __	SELECT DISTINCTROW terms_1.value, terms_1.langID, terms_1.parentID FROM [terms matching pattern __] INNER JOIN (terms AS terms_1 INNER JOIN concepts ON terms_1.parentID = concepts.ID) ON [terms matching pattern __].parentID = concepts.ID WHERE (((terms_1.langID)=[target language code:]));

Of course, many more queries could be written. The queries make molecular-level constructs accessible to the end-user, and also provide a mechanism for molecular-level validation of the database.

Correspondence between tables and entities/relationships

Table name	Entity (relationship) name
charsets	charset
code values	code value; also relationship "element has code value"
data categories	data category
data category index types	data category index type; also relationship "data category takes index type"
data category names	data category name; also relationships "data category name uses lang" and "data category has lang-specific name"
date values	date value; also relationship "element has date value"
elements	elements; also relationship "element A is parent of element B"
index values	index value; also relationship "element value is indexed"
langs	lang; also relationship "lang uses charset"
links	associative relationship "link"
number values	number value; also relationship "element has number value"
picklist values	picklist value; also relationship "element has picklist value"
picklists	picklist; also relationships "data category uses picklist" and "picklist value derives from picklist"
text values	text value; also relationship "element has text value"
elements use lang	relationship "element uses lang"
valid families	relationship "data category A may be parent of B"