ISO 12620 data categories
Copyright © 2000
Translation Research Group
Send comments to firstname.lastname@example.org
Last updated: January 27, 2001
About the CLS Framework
The CLS Framework is the result of a joint effort of the Brigham Young University Translation Research Group (BYU TRG) and the Kent State University Institute for Applied Linguistics (KSU IAL). The framework deals with the structure and content of terminological databases, which we will call termbases. The Framework can be used for representation of existing termbases, design of new termbases, and sharing, dissemination, and interchange of terminological data.
Terminological data can be respresented in various ways, for example, as a relational database or as a file of structured text marked up using SGML. Holmes-Higgins and Ahmad (1996: 215-2241) point out the importance of providing explicit data models for all types of terminological data. ISO 12620 provides an inventory of types of data items, each type being called a "data category". A terminological concept entry (term entry, for short) is composed of data items (an item being a field, a cell, or an element, depending on the representation); each item is an instance of a data category, but 12620 does not specify the structure of a term entry, i.e., it does not specify the relationships among data items in an entry. The present framework provides (1) an approach to structuring the items in a term entry in a manner consistent with current theory and practice in concept-oriented terminology and (2) provides a set of data categories taken from 12620.
This particular framework is called the CLS Framework, i.e., Concept-oriented with Links and Shared references. Shared references and links will be explained later.
The CLS Framework could be used to create a family of compatible data models, where data model is used to refer to a specific conceptual schema within a general approach to data modeling, such as the relational approach. It already ties together an SGML approach called MARTIF and a relational approach called Reltef. MARTIF includes an SGML DTD, and Reltef includes an entity-relationship (E-R) diagram, but they both are consistent with the overall structure specified by the CLS Framework and the data categories specified by 12620. MARTIF is used for interchange among termbases, and Reltef is used for retrieval and maintenance of termbases. Neither would be suitable for the other's purpose, but since they are both consistent with the CLS Framework, it is possible to write automatic bi-directional conversion routines between them with little or no loss of critical information.
Currently there is much talk of object-oriented databases. As suggested by Holmes-Higgen and Ahmad, it should be possible to build an object-oriented termbase as a layer on top of the relational database. We hope that such an object-oriented database can be consistent with the CLS Framework as well.
The rest of this overview consists of:
We hope the CLS Framework, along with 12620, MARTIF, and Reltef, will be useful to you in exchanging terminological data between existing termbases and in designing new ones.
A termbase data model framework
A graphical representation of the structure described below is available.
ISO FDIS 12620, which is based on the content of many actual termbases, consists of an organized inventory of many data categories for use in various environments, including interchange and retrieval, but, as mentioned above, it does not specify the structure of a concept-oriented terminology entry.
ISO 12620 explicitly catalogues four major classes of data categories: terms, term-related data categories, descriptive data categories, and administrative data categories. ISO 12200 adds a fifth class for links, which are nevertheless implicit in 12620. We will describe how these classes of data categories can be structured in entries and used to link with global and shared information needed in a termbase. Clearly, the following framework is not the only possible application of 12620, but we feel it is a flexible and powerful one.
Within the CLS Framework, a termbase consists of (a) global information about the termbase, (b) a set of concept entries (called ConceptEntrys in this document), and (c) a set of references (here called SharedRefs) that can be shared among multiple ConceptEntrys or parts of ConceptEntrys, as shown in the graphical representation.
Global information can include publication information, such as the name of the database, the name of the copyright holder, information about the creators of the database, dates, and version numbers. In addition, it can include information about the languages and writing systems used in the termbase, which user group the termbase is intended for, and other administrative information. In some cases, some or all of the global information may be external to the termbase itself, but this is not ideal, especially in interchange environments.
Each ConceptEntry consists of information about one concept in a specified subject field (also called a domain) and one or more terms that are each assigned as a language-specific designation of that concept. In accordance with the principle of term autonomy articulated by Schmitz (Schmitz 1997), each term is accompanied by various pieces of term-related, descriptive, and administrative information. Usually, the bulk of the information in a termbase will reside in the set of ConceptEntrys, which in this context can be called the body of the termbase.
It is understood that, ideally, each term in a given ConceptEntry designates the same concept, at least within a particular subject field. Nevertheless, the terms in various languages may still not be perfect equivalents of each other, since the relevant concepts in various languages may not be identical. Transfer comments may be needed to detail degrees of non-equivalence between certain pairs of terms in some direction. If the lack of equivalence exceeds some threshold determined by the judgment of the terminologist, it may be more appropriate to define two distinct concepts and to split the ConceptEntry into two entries, one for each concept. The ideal concept entry has the following properties: the entry is based on a single well-defined concept, and all the terms in the entry are equivalent to each other. In collections of standardized terminology, the goal may be to document one term in each language for each concept, but if synonyms exist, they should also be documented and identified to reflect their acceptability or status.
SharedRefs and Links
In addition to the types of information listed above, ConceptEntrys and individual data items in ConceptEntrys can include links to other items in the ConceptEntry, to other ConceptEntrys, and to SharedRefs. SharedRefs consist of pieces of information that are potentially shared by many ConceptEntrys. An example of a SharedRef would be a complete bibliographic entry or a responsibility entry listing the biodata about a person who created or updated entries in the termbase. Individual items scattered throughout the termbase can include links or pointers to the relevant SharedRefs as needed. SharedRefs can also include graphics, external video or audio resources, charts, tables, and the like.
The use of Shared-Refs is grounded in the well-known principle of database management that redundant information should be avoided. There are several reasons for this rule:
The links used to connect data items with SharedRefs can take two forms:
SharedRefs may be implemented differently in various data models. In a structured text data model, the SharedRefs may reside in an SGML back matter element. In a relational data model, the SharedRefs may be in a separate table or tables.
In addition to their association with SharedRefs, links can also be used between ConceptEntrys, such as to indicate that one ConceptEntry has a generic-specific relation to another. For a link to be unambiguous, the targeted item must include an identifier that is unique throughout the database.
These three components, i.e., global information, ConceptEntrys, and SharedRefs, along with unique item identifiers and a link mechanism, constitute the minimal features of a modern termbase.
Other Data Element Attributes
Other aspects of data categories implicit in 12620 are language and data type. Pieces of textual information in the database require language codes to specify the language (and writing system, including the associated computer representation). Some items in the database may have a date as a value. A date is a data type distinct from normal text. Other items may have text as a value. Still others may be restricted to one or more options taken from a list of permissible instances. The data category specification for each data category must indicate the data type associated with that item. That data type can then be used to write automatic validation routines for the database.
Within a ConceptEntry the various terms are grouped by language. Each piece of textual information has a language code associated with it. A descripNote can be attached to any descriptive element (descrip), an adminNote can be attached to any administrative element (admin) and a general note can be attached to any item except another general note. A link can be attached to any item and can target another ConceptEntry or a SharedRef in the back matter. The back matter consists of a number of expanded SharedRefs or external references to foreign data residing in the system or even on a network, such as the World Wide Web or a proprietary intranet.
Traditionally, termbases have exhibited fundamentally different entry structures from those used in lexical databases, i.e., from the structures familiar to the users of standard dictionaries or from the structures required for many machine translation or natural language processing environments. Whereas termbase entries are concept oriented, these databases are headword-oriented and list all the meanings associated with a headword. If a database were extended to include lexicographical information linked to terminological ConceptEntrys, then it might be called a lex-termbase. The design of such a database is the objective of the MARCLIF project (MAchine-Readable Concept- and Lexicographically oriented Interchange Format) being pursued by the AMTA SIG (Association for Machine Translation in the Americas Special Interest Group) on data exchange standards (DXS). The MARCLIF project was initially described at the LISA Forum in Mainz, Germany, in March, 1997.
We have now described the termbase data model framework based on 12620 as elaborated by the BYU TRG and the KSU IAL. It should now be somewhat clearer why we have called it CLS framework: because it consists primarily of Concept-oriented entries, and these entries can be Linked to each other or to Shared references as needed.
The reader is now encouraged to explore the rest of the CLS Framework such as the graphical representation, the paper on Blind MARTIF, and the CLS Framework data categories. These data categories are a subset of those in ISO 12620 and their XML representation is more restricted than the SGML representation allowed in ISO 12200.
An introduction to the ISO 12620 data categories
The following explanations are designed to illustrate how the ISO 12620 data categories fit into the above approach to encoding terminological data. When we state that an item is "attached" to another item, we leave open how that attachment is implemented. In an SGML representation, attachment may be shown by containment, i.e., by embedding an item or items inside another element. In a relational database, it may be shown by using pointers between tables. Some less stringent data models attempt to achieve association between elements through the principle of adjacency, but we do not recommend or try to support this methodology because it is subject to ambiguity and does not ensure robust behavior, as pointed out by TC 37 French delegate Andre LeMeur.
Section 1 consists of the data category term. Each term is attached to a LanguageSection, which is attached to the ConceptEntry. This attachment indicates that the term designates, in a particular language, the concept associated with the ConceptEntry. As noted above, each term is autonomous, but it can be modified and categorized using the items listed in Section 2, term-related information.
Section 2 consists of various pieces of term-related information such as term type, part of speech, geographical usage, register, etymology, syllabification, and administrative status. These data categories are simply attached to a term, indicating that they are properties of the term.
Section 3 consists of pieces of information about degrees of equivalence. These data categories are attached to a term and refer to another term in the same entry or a closely-related concept entry.
Section 4 consists of data categories that tell how a concept relates to a subject field, also called a domain. There are two methods provided by Section 4. One method is to simply attach the name of a subject field to a ConceptEntry. The other method is to place a classification system listing the subject fields in the back matter and to attach a link from the ConceptEntry to its respective node in the classification system.
Section 5 consists of descriptions of the concept, such as definitions and contextual examples. Often, a definition will be attached directly to the ConceptEntry and apply to all its terms. Sometimes, definitions will be available in multiple languages and will be attached to language sections or individual terms. Contextual examples will, of course, be attached to the term they exemplify. A graphic image or non-textual item may be used to describe a concept. In such cases, the non-textual item, or at least an external reference to it, will be placed in the back matter and a link to the back matter item will be attached to the ConceptEntry.
Section 7 consists of data categories that relate a ConceptEntry to its position in a concept system. The ConceptEntry can include an indication of, for instance, its superordinate concept, together with a link to the appropriate entry. Another option is to link the ConceptEntry to a node in a concept system that has been placed in the back matter. For instance, in a concept system for vehicles, the ConceptEntry for handlebars could include a link to an entire concept system and an indication that a certain node in that concept system is the whole for which handlebars is a part.
Section 8 consists simply of the data category note, a general note, and a suggestion that more specific data categories should be used when possible. Some of these more specific data categories are TermNote, DescripNote, and AdminNote, each of various types. A general note could attach to a ConceptEntry, a language section, a term, or even to some item attached to a ConceptEntry, or term, such as a contextual example. However, a general note cannot be attached to a general note, as this would result in recursive attachment.
Section 9 consists of data categories that relate a ConceptEntry to a node in a thesaurus or documentation language.
It may appear to some users that Sections 4, 7 and 9 are all very much alike. It even may look logical to combine these elements somehow. It is important to bear in mind that these elements represent different traditions and frequently serve different objectives in terminology management.
Thesauri and concept systems come from different traditions. A concept system attempts to organize the concepts in a particular subject field, frequently, but not necessarily, with a narrow focus. An effort is made to include all concepts within the hierarchical structure.
A thesaurus, on the other hand, is usually built to facilitate document retrieval by attaching thesaurus items to documents and indexing on those items. Thesauri typically identify so-called descriptors, which serve as primary search elements for information retrieval. All nondescriptors, which may be synonyms, subordinate terms, or even just related terms, are mapped to the descriptors that are used for that group of concepts. Consequently, whereas concept systems attempt to account for all concepts in a system, thesauri and documentation languages are streamlined retrieval tools that use a smaller designated set of terms. Furthermore, concept systems are just that, i.e., they are strictly concept oriented, whereas thesauri and documentation languages are based on terms.
Library classification systems are, of course, generally broader than the typical concept system or terminological thesaurus, both of which are restricted to a particular subject field. On the other hand, some thesauri, such as the British Root Thesaurus or the NASA thesaurus, cover such a global range of fields that it may be more accurate to say that the complexity of these systems varies with their intended range of application.
Section 10 consists of administrative data categories that indicate when things happened, such as creation and updating, who is responsible for certain sections of an entry, what subset or subsets an entry is part of. These data categories are attached where they apply, that is, to a ConceptEntry, a language section, a term, or an item attached to a term. However, here again, recursive attachment of a date to a date or other administrative item is not allowed. Some administrative data categories may include a link to a SharedRef item in the back matter, such as a biodata entry for a responsible person.
Holmes-Higgen, P. & Ahmad, K. (1996). "Is your terminology in safe hands? Data analysis, data modeling and term banks." Fourth International Congress on Terminology and Knowledge Engineering Proceedings. Frankfurt: Indeks Verlag.
| Applications: Representation; Design; Sharing |
| ISO 12620 Data Categories | Downloads | XML info |