Translation, Theory and Technology Homepage

[CLS Framework]

CLS Framework
Introduction
Section map
Overview
Applications:
·Representation ·Design
·Sharing
ISO 12620 data categories
Downloads
XML information

Backward Contents Forward

MARTIF - Putting Complexity in Perspective

3. Conflicting Views on the Complexity of the MARTIF Standard

All natural systems are fraught with complexity-complexity that cannot be eliminated. It is only possible to shift the burden or effort involved in dealing with complexity from one facet of a system to another, i.e., simplification in one area produces inevitable complication elsewhere (Budin 1996). The purpose of this section is to address the ways in which MARTIF is complex and to explain why these complexities exist. The apparent complexity of the standard involves data category proliferation, data modeling variants, tag naming conventions, and structural elements. (Data modeling as discussed in this article pertains to the way that information in the term entry is assigned to specific kinds of data fields and permissible instances, and the way that elements are related structurally to other elements in the database using hypertext links and other pointing mechanisms.)
3.1 Data Category Proliferation
The companion data category standard (ISO FDIS 12620) defines approximately 150 data categories and a total of approximately 120 permissible instances, i.e., items where the content of a data category must be one of several specified values. At first examination, this seems excessive considering the fact that most terminology databases (TDBs) use a fairly small subset of data fields.

The developers of the standard began their work by discussing the relative merits of fashioning a terminology interchange format around an ideal data entry model. At that time, some experts argued that creating an interchange medium for converting data between different database systems with different data architectures was unnecessary [Stellbrink 1993]. Instead, it seemed to make good sense for everyone to adopt a uniform entry format analogous to the MARC record used in library management. There were, for instance, many attempts in Germany to define a standard--or at least a minimum-terminologica1 entry, but all of these efforts failed when confronted with the fact that different work groups need to fashion different data models in response to individual needs (Hohnhold 1988; Mossmann 1988; COTSOWES 1990).

The GTW Report Guidelines for the Design and Implementation of Terminology Data Banks was in development at that time, and the former Text Encoding Initiative (TEI) work group was encouraged and even strongly tempted to adopt the GTW model as a basic superstructure for the interchange format (GTW 1996). If they had done so, their document would be considerably simpler today--simpler, but probably less inclusive and less powerful.

One of the first steps that led the group down the road to increased complexity was taken when the decision was made to examine a wide range of data models actually being used in the industry rather than to think strictly in terms of models they personally liked (in which case the GTW model would have suited quite well). This study revealed which data categories actually occur in real databases and how these elements are modeled (Wright and Budin 1994).

Budin and Wright found that most database environments function with a fairly modest set of data categories, although some well-funded national term banks such as Termium use a richer mix of information. The new COLTERM (Colombian) database project, whose entry format was modeled based to a great extent on ISO 12620, provides for over 50 categories of information (Wright 1997, in press). The limited number of categories used in individual databases did not, however, result in an overall short list. Although the core set of categories that was repeated from system to system was indeed short (term and definition, plus variants of subject field, context, source, and responsibility), the list rapidly grows long when all the categories that only occur occasionally are included. In other words, few applications are using very many categories, but all told, the number is huge when a tally is made of all those categories that are used somewhere in some system. This is true even when all the synonymous and polysemic data category names have been properly accounted for.

As noted above, the significant factor in the proliferation of data categories is that various users or user groups have very specific information management needs, and terminology management is being applied in increasingly divergent environments, from language planning to inventory control, from safety to quality assurance, and from monolingual to bilingual to broadly multilingual venues. Each unique combination of language and information management needs is coupled with resource management considerations that dictate either a simple or a more granular data entry, resulting in custom-tailored sets of labels designed for the storage and retrieval of information in that specific environment.

Repeated efforts have been made to shorten the list in order to make it more manageable. At one point, the work group extracted individual data categories whose use is limited to certain restrictive environments and relegated them to a document annex, but critical comment within the standards process quickly moved them back into the standard proper because the move to the annex disassociated these items from their position with respect to closely related data elements. The GTW Report compiled an "exemplary selection" of categories designed to simplify the category selection process (GTW 1996, Annex B). This list provides an excellent overview of basic categories, but it was never intended to replace the full. ISO 12620 list as a comprehensive resource for modeling data.

The obvious lesson to be learned from these efforts is that any kind of global attempt to shorten the list within the standard itself is also likely to be subject to dissatisfaction. The selection of pragmatic subset lists designed to meet the needs of specific users or groups of users (such as inside the LISA organization) must remain the task and the right of the groups in question rather than the purview of any standardizing body. Nevertheless, there remains a definite need for greater clarification and explanation of the 12620 standard and procedures for matching data element names used in local applications to the data category names and definitions spelled out in the standard.

term (main entry term):	Automobil
synonym (full form):	Personenkrafttwagen
short form (1):	Auto
short form (2):	Wagen
abbreviation:	PKW

Figure 2: Sample Data Model 1

3.2 Data Category Variation

Examination of numerous data modeling conventions has also revealed that different needs result in different data structures, a phenomenon that developers of the standards have sometimes called "bimodality", but which is perhaps more clearly called "data modeling variance". Some evaluators have voiced the view that it would be appropriate to simply concentrate on the "normal" way of structuring terminological data, but examination of the full spectrum of existing systems reveals that there is no one "normal" way. Consequently, the standard has been designed to ensure the compatibility of the various approaches. The apparent complexity that results is, however, a reflection of real-world conditions.

Deviations in ata modeling manifest themselves at two levels. In some cases, variation involves simply the choice of divergent data category names, which results in polysemy and synonymy. This phenomenon is not difficult to deal with because such names can be reassigned fairly easily during the data conversion process, provided database developers have followed consistent procedures in assigning data to any given field, which may not always be the case.

The second type of modeling variation involves the way that categories behave in individual databases. Some databases, especially those influenced by paper-management traditions or lexicographical approaches, use data categories for different types of terms as field names in the data entry and enter terms as the content of these fields, for instance as follows in the German language group of an entry for the concept automobile (Figure 2).

		Attributes
term:	Automobil	full form
term:	Personenkraftwagen	full form
term:	Auto	short form
term:	Wagon	short form
term:	PKW	abbreviation

Figure 3: Sample Data Model 2

Other databases simply use a data category term to which all terms, regardless of term type, are assigned. The philosophy behind these types of databases is that all terms in a terminological entry are "created equal" as synonyms or equivalents (or at least quasi-synonyms and equivalents). There are several practical advantages of this system. For instance, sometimes in any given terminological entry, a particular term will be considered to be a main entry term or preferred term by one customer, but only a synonym or even a deprecated term by another. This approach also simplifies look-up routines because queries only have to search for one data category (term), and automatic cut-and-paste routines for translation purposes do not have to be designed to accommodate any other data categories. If additional information are added to categorize these terms, it is supplied as individual attributes describing the term field as shown in Figure 3.

Figure 4 represents yet a third, less common modeling structure. In Figures 2 and 3, only those items that specifically apply in a given instance appear in the data entry. In contrast, the model in Figure 4 can lead to an accumulation of redundant data because each permissible instance is enumerated for every item and given a yes or no value instead of treating negative instances as unstated default values.

A fourth approach to data "modeling" in an import environment involves merely dumping all information as text data into an undifferentiated text file (i.e., without assigning individual items to data categories at all). Information is retrieved using all-word indexes. Interchange back to any articulated database environment from such a system would, of course, be impossible, and if the system is large, human users will be burdened with an unmanageable number of hits for many queries. Hence, this "solution" will not be treated here as a serious alternative.

term:	Automobil
full form:	yes
short form:	no
abbreviation:	no
term:	Personenkraftwagen
full form:	yes
short form:	no
abbreviation:	no
… etc.

Figure 4: Sample Data Model 3

Based on these examples, it is obvious that even simple data structures pose complex problems if there is a desire to combine entries with different structures. At one point in the evolution of the standard, an attempt was made to allow for the expression of all possible data modeling variations at the interchange level, but this led to elaborate levels of complexity that also threatened to create difficulties in parsing the resulting data. The choice was made to simplify the interchange format by providing for a single way of expressing these relations.

The final version of die MARTIF standard favors Sample Data Model 2 (Figure 3), primarily because it is the most concept-oriented of the three and because it allows for the inclusion of multiple attributes associated with a given term. In this, as in any other information-management situation, problems of complexity cannot be eliminated simply by not taking cognizance of them. In those cases where source or target databases differ from the chosen model, the import and export routines used to convert those databases to and from MARTIF will have to be more complex. This degree of complexity is not simply the problem of MARTIF, however; it is an expression of the complexities existing in the "real world". Defining a simpler MARTIF would not resolve the problems posed by complexity—it would only hide them and potentially lead to lost or misinterpreted data.

3.3 Generic Identifiers and Compound Tags

Some evaluators have registered the criticism that the full "tag" naming convention used in MARTIF appears to be excessively verbose. Instead of giving each element a simple generic identifier (GI) in Standard Generalized Markup Language (SGML), composite names are used. These composites consist of a GI taken from the small set of GIs defined in the standard, followed by the attribute type plus a value. It is at this value level that the full tag is uniquely identified. For example, the common field name "Definition", becomes <descrip type='definition'> in MARTIF.

Early in the development of the standard, the collected data categories were classified and proposed as a list of GIs to be used to tag the individual elements of the MARTIF entry. The initial assumption was that each item would constitute the basis for a GI in the MARTIF standard. For instance, a short list of GIs might read: <term>, <definition>, <source>, <context>, <responsibility>.

A major consideration within the TEI environment, however, was whether the data categories identified as GIs would be a closed set (such as can be defined for a standard format like the MARC record) or whether it should remain an open list that can be augmented as need arose. This decision is a critical one, for GIs and attributes in SGML must be declared in the Document Type Declaration DTD), which serves as the master 'program" that specifies the rules for parsing the SGML document. Once a standard DTD is set, it is not feasible to change the GIs.

At the time this problem was considered in the TEI group, many new data categories were being added to the list every year. Although new additions have become rare in the last few years, it is simply not realistic to assume that terminologists have now discovered all the terminological applications that can exist. Consequently, the decision was made to declare a relatively short list of GIs that would be truly generic and to modify these tags using the value of the attribute type.

The resulting composite tags for the data categories cited above now are as follows: <tern> (which remains unchanged and reflects the data modeling structure illustrated in Figure 3), <descrip type='definition'>, <ref type='source'>, <descrip type='context'>, and <admin type='responsibility'>. On the surface, this decision indeed introduces wordier (but not necessarily more complicated) full tag names. What is gained is first of all the ability to add data categories as needed because only the GIs and the attributes have to be declared in the standard DTD, whereas values assigned to the attribute type do not. (They can, however, be listed in the header of the DTD. Standard SGML parsers would not catch anomalies at the level of the type value, but work groups would be see to design their own subsidiary validation routines to check the actual data categories and in some cases, even their content.) Furthermore, one also gains the ability to classify the behavior for whole groups of data categories by defining the behavior of the GIs associated with them.

The decision to use the long multi-component tags is certainly an added complexity so far as human, manual markup is concerned, for die omission of one small element can result in parsing errors, but the objective of the standard is not to produce a mechanism that will be easy for humans to produce and read. The goal is rather to provide a powerful tool for automatic conversion of data. What is "simple" to a stupid machine in terms of SGML logic is not necessarily the same thing that will be simple to a human being. The length of the tags is not particularly relevant to the machine only the fact that they have unique values matters. Nor does the number of values associated with any given GI constitute a problem so long as they are clearly identifiable to the machine, which has no trouble remembering all of them the way a human would. Furthermore, limiting the GIs significantly simplifies and shortens the DTD, which constitutes one way in which MARTIF is not as complex as it might be had it been designed differently. Listing each value of type as a separate GI would make the DTD unwieldy even if the list did not change. Doing so would also discourage the definition of specific subsets in individual work groups.

The aspects in this section lead us to the recurring question: MARTIF is potentially complex, but for whom is it complex? Looking at the raw data is complex for the average user, but not necessarily for a system designer or programmer. The ideal situation would be to retain the complexity of MARTIF, together with its built-in flexibility and power, but to present potential users with an interface that would drastically simplify the reuse of MARTIF documents (see Section 6).

3.4 Embedding and GI Behavior

Early on, some criticisms questioned the idea of using SGML as a language for expressing terminological data structures. These critics argued that SGML is fundamentally designed as a document exchange format that is only capable of expressing flat structures and is unsuitable for implementing relational links. Today these comments seem ironic when we consider how SGML-related formats such as HTML (Hypertext Markup Language) and HYTIME are being used to create highly flexible, multi-layered information systems capable of facilitating powerful links across computer platforms on a world-wide scale.

It is, of course, possible to use SGML to produce very flat, simple structures. Figure 5a represents an example of what a hypothetical flat model might look like, whereas Figure 5b shows a similar simple entry expressed in MARTIF format. The examples are arranged side-by-side for easy comparison of parallel structures.

<entry>

<concept>

<domain>intemet</domain>

<note>sample entry</note>

</concept>

<termInfo lang=fr>
<term>r�seau principal</term>
<grammar>noun</grammar>
<termType>preferred term</termType>
<termSource>Doc.No. 73 54</termSonrce>
<definition>D�signe les inforbutes d'Internet qui servent de principaux points d'acc�s � partir desquels les autres r�seaux se connectent.</definition>
<defSource>URL: http://wwli.com/translation/netglos/ glossary/french</defSource>
</termInfo>

<equivInfo lang=es>
<term>red de columna vertebral</term>
<termType>noun</termType>
<termSource>Sanchez95</termSource>
</eqnivInfo>

<equivInfo lang=en>
<term>backbone</term>
<termType>preferred term</termType>
<termSource>Hardwired96</termSource>
<context>The term is relative, as backbones in small networks smaller than those in large networks.</context>
<ctxtSource>URL:www.matisse.net/files/glos </ctxtSource>
</equivInfo>
</entry>

Figure 5a: Flat SGML Representation

<termnetry id='IDxyz'>
<descrip type='subjectField'>appearance of materials</descrip>

<admin type='projectSubset'>MARTIF test suite</admin>

<ntig lang=de>
<termGrp>
<term>Opazit�it</term><termNote type='partOfSpeech'>noun</termNote>
<termNote type='gender'>f</termNote>
<termNote type='termType'>preferred term</termNote>
</TermGrp>
<descripGrp>
<descrip type='definition'>Ma� f�r Lichtundurchl�ssigkeit</descrip>
<ref type='sourceID' target-'DIN67.1992'>p. 8</ref>
</deseripGrp>
</ntig>

<ntig lang=fr>
<termGrp>
<term>opacit�</term>
<termNote type='partOfSpeech'>noun</termNote>
<termNote type='gender'>f</termNote>
<termNote type='termType'>preferred term</termNete>
</termGrp>
</ntig>

<ntig lang=en>
<termGrp>
<term>opacity</term>
<termNote type='partOfSpeech'>noun</termNote>
<termNote type='termType'>preferred term</termNote>
</termGrp>
</ntig>
"</termnetry>

Figure 5b: MARTIF Representation

Both models follow the typical SGML practice of placing related information into containers, identified here by full tag names that appear in boldface. At first glance, both models appear to express equally simple data structures, although obviously the MARTIF tags are slightly longer. Both entries begin by including concept-related information that refers to the entire terminological entry. 5a includes a container for concept-level information, but MARTIF does not. Such a container is unnecessary, however, since the <termnetry> contains all the information that belongs to the concept-oriented terminological entry, and all the data categories that are not grouped in the <ntig>s pertain to the concept itself.

Both models place term-related information inside containers. In Figure 5a, information associated with the main entry term is packaged in <termInfo> tags and that for equivalent terms in <equivlnfo> tags. This procedure works well if a single language is given preference, but it remains unclear how such a system would work if it were desirable to present terms without preferencing a specific language.

Figure 5b places all information having to do with a given term inside <ntig> tags, i.e., nested terminology information groups. In the MARTIF model, all terms are treated equally. The equivalence 'of these terms is implicit in the fact that they appear in the same entry. Not only is their equal treatment in the entry more "politically correct" (it does not give preference to any one language) it also facilitates directional change in using the database or in importing the data and readily allows for the interchange of data to or from other data models where different languages are given preference. If there is a clear need to preference a single term in a single source language, the MARTIF notation <termNote type='termType'> main entry term</termNote> can be used.

Example 5a expresses dependency between items by just placing them adjacent to each other: <grammar>, <termType>, and <termSource>: simply follow <term>, and <defSource> follows <definition>, thus producing an essentially flat structure. These relations are uncomplicated enough for humans to intuit based on their knowledge of terminological behavior, but such dependencies are not, however, easy for dumb computers to keep straight: Are each of the items inside <term> equally and independently reliant on the. <term>, or is <def> somehow dependent on <termSource> perhaps? How would an import routine manage to package the data correctly in order to redistribute it in a target system that looks very different from the source database structure? It would appear that this model explicitates some information that is already clearly implicit in the entry (i.e. term equivalence), while at the same time failing to mark important dependencies that are not at all implicit.

In contrast, the MARTIF entry is structured rather like a Russian matrushka doll with small elements tucked inside of bigger elements tucked inside of bigger elements, i.e., the, dependency of items on other items is very clearly and securely expressed by use of the powerful embedding capability that is native to SGML. Packaging information in MARTIF is also like organizing a kitchen cupboard: each individual herb or spice goes first into a separate labeled container. No sensible cook would just dump them in piles one after the other in the comer of the kitchen and hope to be able to retrieve them based on the fact that they are merely adjacent to one another.

Data, like condiments, are easier to retrieve and manipulate if they are clearly packaged and marked than if one has to search through the whole collection to find a single small item. In order to "know" that certain elements belong together, these relations must be explicitly stated in away that the SGML parser can understand in order to avoid introducing ambiguities into the parsing process that pose serious dangers of data loss, contamination, or misinterpretation.

The markup inside the <ntig> (shown here in boldface) provides a powerful tool for indicating that certain pieces of data are closely related to one another. By putting all the information pertaining to a term (in this case, the term string itself, its part of speech, gender, and term status) into a <termGrp> container and all the information pertaining to the definition into a description container (<descripGrp>), the interchange format begins to look much more complex than a simple hardcopy entry designed for use by the human reader, and yet in so doing, it provides a mechanism for transporting that information into a new environment, perhaps even into a data architecture where these elements of information are organized in a very dissimilar way. The introduction of what looks like complexity to the human reader is actually a powerful tool for making data manipulation much simpler: for the computer.

The MARTIF entry departs from the flat model in yet another powerful way. The <termSource> tags in Example 5a appear simply to provide the information concerning the source of the terms. Even if other information resides elsewhere in the database, there is no indication in terms of SGML notation that "Sanchez 45" or "Hardwired96" could serve as relational links to that information. Thus the markup again remains flat and is confined to the entry at hand. In MARTIF, the <ptr> and <ref> tags enable the user to program links to targeted locations elsewhere in the body or backmatter of the document or even to facilitate links to external references beyond the document. This feature, coupled with the security offered by the use of SGML embedding features, does indeed make MARTIF more complex, but in exchange the user is liberated from the flat world of Example 5a to an environment where powerful hypertext links can express a wide variety of relations within the entry, between entries, and between entries and other types of data. Figure 5a constitutes the results of a self-fulfilling prophecy that proclaims that SGML is a flat system, incapable of fully representing terminological data, whereas MARTIF demonstrates the full power of SGML to accommodate knowledge on a multi-level scale. The difference between the flat structure of 5a and MARTIF is even more pronounced when one examines more complex examples and more sophisticated databases. This contrast is as dramatic as the difference between a flat piece of hardcopy printed on paper and the dynamic hypertext environment of the World Wide Web.

Backward

Contents

Forward