0.1 Intended audience

This SALT document defines an XML-based application referred to as the Default XLT Format (DXLT). DXLT is the primary member of the XLT family of formats. This document also provides the basis for defining other members of the XLT family. The intended audience for this document consists of three groups: (1) programmers and analysts who desire to develop software applications that process XLT-compliant data streams, for example, by converting them to data streams in some other format or by deriving XLT-compliant data streams from some other format; (2) terminologists and other language specialists who desire to analyze a terminological data collection for representation in some XLT format, in particular in DXLT, or to define either a user-group subset of DXLT or some other XLT format, and (3) managers who desire to obtain an overview of the XLT family and its default format, DXLT.

Each of these three groups should be familiar with this Introduction. In addition to an understanding of this Introduction, terminologists and other language specialists need a basic understanding of the structure of XML documents and the data categories in ISO 12620. Besides having or obtaining this background information, they should study the body of this SALT document (sections 1-8) and annexes C and D, but they do not need the ability to write or modify XML DTDs or schemas. An introduction to the data categories of ISO 12620 is available through www.ttt.org. Programmers and analysts developing software applications to process DXLT and other XLT formats must have a thorough knowledge of XML and familiarity with the entirety of this SALT document and the various standards on which it is based.

0.2 A family of formats

The XLT family of formats is based on various international standards. The X in XLT stands for XML, indicating that each member of the XLT family is an XML application. The L in XLT stands for Lexicons, indicating that information from human-oriented lexicons and NLP lexicons (especially machine translation lexicons) can be incorporated into XLT. The NLP aspect of XLT is based on OLIF (see Otelo project, http://www.olif.net/olif/OLIF1.html). The T in XLT stands for Terminologies. The terminological approach of XLT is based on two ISO standards (ISO 12620 and 12200). ISO 12620 provides an inventory of data categories (i.e., data element types, often implemented as column names in a table or field names in a record). ISO 12200, also known as Martif, provides the basis for the core structure for the family of formats. Thus, XLT is a standards-based family of formats for representing, manipulating, and sharing terminological data.

Each member of the XLT family differs from others only in which data categories are allowed and what values they can take. These choices are represented in a Data Constraint Specification (DCS) file. The following figure shows how XLT is based on the classic form-content distinction. Each combination of the core DTD/schema (which defines the structure) and a particular DCS file (which defines the allowed content) results in a format that is a member of the XLT family of formats.

XLT Family of Formats

Form Content

Core DTD/schema DCS 1 DCS 2

Format 1 Format 2 … Format n

0.3 Distinction between DXLT and other XLT formats

Default-XLT (DXLT) is one member of the XLT family of formats. The DCS file that defines DXLT is naturally called the Default DCS file of XLT. It is anticipated that the data categories in the Default DCS file will suffice for most dissemination and interchange tasks. It thus expected that most members of the XLT family of formats will be defined using strict subsets of the Default DCS file. However, it is possible that some particular application will require data categories or data-category values not allowed by the Default DCS file. In that case, a DCS file can be defined that is not a subset of the Default DCS file. Subsets of the Default DCS file define "children" of DXLT, and custom DCS files that are not subsets of the Default DCS file define "siblings" of DXLT. XLT is simply the family of formats defined by the XLT core structure and all the various DCS files that combine with it.

The data models underlying terminology resources can be very complex, and therefore XLT formats can also be complex. Complexity is managed by identifying generalizations and breaking down complex objects into simpler modules that can each be understood on its own. The XLT approach abstracts away the structure found in a variety of formats and places it in the core structure module that contains very general data elements such as <descrip> (descriptive information) and <admin> (administrative information). The specialization of the core structure to specific data categories is represented in a DCS file, which may include the data category definition as a particular type of <descrip> element. This allows XLT-aware software to deal with a relatively simple core structure and adapt automatically to various members of the XLT family by consulting a DCS file, which has a very simple structure. Complexity is not magically eliminated, since the logical combination of the core structure and a particular DCS file can indeed be rather complex. But in XLT each of the two modules (form and content) can be dealt with separately, in accordance with basic principles of object-oriented design. No one terminology format can satisfy the needs of user groups; however, based on experiments to date, most user groups can use the same core structure and accommodate their particular needs using a user-group-specific DCS file.

It is anticipated that the LISA OSCAR TermBase eXchange format (TBX) will be a subset of DXLT. Also, the European Union project called IATE is using an intermediate format (IATE-XLT) that is a subset of DXLT. Any two members of the XLT family are interoperable in so far as their respective DCS files are compatible.

SALT project — XML representations of Lexicons and Terminologies (XLT) — Default XLT Format (DXLT)

1 Scope

For various types of machine processing, including transmission over the Internet, terminological data can be represented using XML. The format defined by this SALT document is an XML application designed to support machine processing of terminological data in various computer environments, including standalone computers, the Internet, and intranets.

The format defined in this SALT document is designed to represent terminological data in a relatively "blind", that is, neutralized fashion for purposes of (a) interchange, (b) dissemination, and (c) data analysis. This SALT document is based on (1) an XML-compliant core structure compatible with “Negotiated MARTIF” (ISO 12200) and (2) an XML formalism called the Data Constraint Specification (DCS) schema for specifying constraints on the core structure. In addition this SALT document contains one set of constraints, the Default set of constraints, expressed in that formalism. Each set of constraints specifies (a) which data categories, primarily from ISO 12620, are allowed as instantiations of the meta data categories in the core structure, (b) which values the data categories can take, and (c) at which levels in the core structure data-category elements can appear. In addition, this set of constraints can de-activate selected modules and options of the core structure, such as which languages are allowed, whether certain text markup tags are allowed, and whether particular types of Complementary Information are allowed in the current family member. The format defined by the core structure and data-category specification included in this SALT document is called DXLT (the Default-XLT format).

This SALT document further provides guidelines for specifying user subsets of DXLT. The specification of a user subset does not involve modification of any XML DTDs or schemas. Other members of the XLT family of formats can be defined using the core structure and DCS formalism included in this document. XLT formats include no recursive XML elements, thus reducing the processing burden on import routines.

XLT formats are members of the lcollection of formats intended to be compliant with ISO Technica Committee project called TMF (ISO/CD 16642 – Terminology Markup Framework). XLT is being developed in parallel with TMF (see Annex D). It is intended that DXLT and its subsets, in particular, will qualify as Terminology Markup Languages (TMLs) within TMF.

2 Relevant ISO standards

The following ISO standards relevant. For dated references, subsequent amendments to, or revisions of, any of these publications do not apply. However, parties to agreements based on documents are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below. For undated references, the latest edition of the standard referred to applies. Members of ISO and IEC maintain registers of currently valid International Standards.

The key ISO standards and projects upon which this document is based are: (1) ISO/CD 16642 (the TMF project) (2) ISO 12200:1999 (Negotiated MARTIF) as amended by TC37/SC3 NWI 318, (3) ISO 12620:1999 (Data Categories), (4) ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to allow for the definition of XML, and (5) ISO 10646-1 (commonly known as Unicode).

Expanded list of relevant ISO standards (not including projects which are not yet International Standards):

- ISO/IEC 639, Information technology – ISO 639:1988, Code for the representation of names of languages.

- ISO 639-2:1998, Code for the representation of names and languages—part 2:Alpha-3 code.

- ISO/IEC 646:1991, Information technology – ISO 7-bit coded character set for information interchange.

- ISO 1087:1990, Terminology – Vocabulary.

- ISO/1087-2:1999, Terminology work – Vocabulary – Part 2: Computer applications.

- ISO 3166-1:1997, Code for the representation of names of countries and their subdivisions – Part 1: Country codes

- ISO 8601:1988, Data elements and interchange formats – Information interchange – Representation of dates and times.

- ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to allow for XML.

- ISO/IEC 10646-1:1993, Information technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and basic multilingual plane.

- ISO 12200 as amended, Computer applications in terminology – Machine-readable terminology interchange format (MARTIF) – Negotiated interchange.

- ISO 12620, Terminology –- Computer applications – Data categories.

3 Terms and definitions

For the purposes of this SALT document, the following terms and definitions apply:

3.1

analysis

identification of the elements and structure of a terminological data collection so that the data fields, their types, and their relationships are made explicit

3.2

blindness

property of a data format indicating the degree to which the data are so rigorously defined that it is unnecessary for the importer to establish contact with the originator of the data in order to interpret them

NOTE: The property of blindness is achieved through the process of neutralization of differences between original formats. The metaphor behind the term blindness, which has its origin in the engineering phrase “blind transmission”, is that on the receiving end of a transmission, it is unnecessary to “see” who is sending the information in order to process it. Blindness is not an absolute property but is a matter of degree.

3.3

core-structure module

component of a format’s definition that specifies some elements as meta data categories and indicates which structural relations are allowed among elements

3.4

data category

result of the specification of a given data field [ISO 1087-2:2000], (i.e. a type of data field, such as definition)

NOTE: ISO 12620 is an inventory of data categories.

3.5

data stream

a sequence of bytes that correspond to the contents of a document or file

NOTE: an XML document can be called a “document”, a “file”, or a “data stream” interchangeably

3.6

data constraint module

component of a format’s definition that constrains the core-structure module, e.g., by specifying which data categories are allowed and how each data category can be used

3.7

dissemination

representation of data in an intermediate format that allows a wide range of potential users to access and reuse the data

3.8

pre-negotiation

property of an intermediate format indicating that it is adapted to maximizing the preservation of both content and structural nuances found in the source data, even at the expense of blindness

NOTE: Pre-negotiation and blindness, although sometimes at odds with each other, should not be considered antonyms, but rather choices imposed by the tension between complete neutralization and complete preservation of information in a data collection.

3.9

interchange

transaction involving exporting data from and importing data into a terminological data collection where those data are represented in some intermediate format for the purpose of facilitating access to the data by computer programs

3.10

meta data category

a name used to group similar data categories together; thus, a category of data categories

NOTE: Meta data categories XLT include descrip, admin and termNote.

3.11

modularity

property of an electronic format whereby the complexity of the structure and content treated by the format is managed by defining sub-components that can be studied separately, side by side, and then logically combined

NOTE: In XLT, one module defines the core structure using meta data categories, and the other module specifies constraints on the core structure module, including which data categories can instantiate each meta data category.

3.12

metadata registry

description of the fields in a database for the purpose of facilitating understanding by outside parties [cf. definition in ISO 11179].

3.13

neutralization

process whereby the differences between the representation of data elements from various original data collections are reduced by re-expressing them using the pre-specified structural features, data categories, and data-category values of an intermediate format

3.14

representation

expression of data content and structural relationships in an intermediate format outside the environment of the originating data collection

NOTE: Representation may involve the retention of all or part of the information from the originating data collection; in addition, it can involve various degrees of neutralization and thus tend toward either blindness or pre-negotiation.

3.15

XML™ (eXtensible Markup Language)

universal format for structured documents and data on the World Wide Web (WWW); a particular subset of SGML.

NOTE: XSLT is a programming language specifically designed for manipulating XML documents

4 Requirements for DXLT documents

For an XML document to be considered DXLT-compliant, it must qualify on three counts: (1) It must be a well-formed XML document. (Well-formedness is a purely formal XML notion based on such criteria as all elements being explicitly empty or explicitly terminated and not overlapping.) (2) It must be valid according to the XLT core-structure module (described informally in section 6 and defined formally by the XML DTD in Annex A). (Validity is also a formal XML notion.) (3) It must adhere to the constraints in the Default data constraint specification (DCS) module or user-defined subset thereof currently applicable. These three counts are levels of conformance to the DXLT specification. Requirements for other members of the XLT family are similar, the only difference being that the third count requires adherence to the particular DCS module associated with that family member.

In practice, DXLT documents are typically created by an export routine in some piece of HLT (Human Language Technology) software, and they can either be displayed using a tool such as XSLT or be processed by an import routine that is part of some other piece of HLT software. So long as the XML documents that are created and processed are DXLT-compliant, it is not necessary for a human to inspect them and no formal conformance check is necessary. However, in some circumstances, such as dealing with suspected data corruption, DXLT-compliance can be checked using DXLT-validation software.

The first two aspects of DXLT-compliance can be checked by validating the DXLT document against the DTD of the core structure using a validating XML parser, and the third aspect can be checked using a custom software application that checks for adherence to the constraints in the DCS module.

As noted above, it is possible to validate whether any given well-formed XML data stream is DXLT compliant. However, this validation is a formal process and does not ensure that appropriate terminological methods have been used to create the data or that the content of the data categories is accurate. Validation may determine, for instance, that the value of an XML element such as term type is not one of the allowed values, but validation cannot detect a poorly written definition. See Figure 4.1 for examples of these distinctions in DXLT. The first part is not well-formed, since the first <descrip> element has a spelling error in the end tag and since the second <descrip> has no closing tag at all. The second part is well-formed but not valid, since the core-structure module of DXLT does not allow for a <desskrip> tag. The third part conforms to the XLT DTD but not to the Default DCS of DXLT, since there is no DXLT data category called "conflagration". The fourth part is valid but not accurate, since a kitten is not a dog or wolf.

Not well-formed:

<term>kitten</term>

<descrip type=’definition’>content</decrip>

<descrip type=’definition’>other content

Well-formed but not valid:

<term>kitten</term>

<desskrip type="definition">content</desskrip>

<descrip type="definition">other content</descrip>

Valid but not DCS-adherent:

<term>kitten</term>

<descrip type="conflagration">content</descrip>

Valid and DCS-adherent but not accurate:

<term>kitten</term>

<descrip type="definition">a young dog (canis lupus)</descrip>

Figure 4.1 — Well-formedness, validity, adherence, and accuracy

5 An example of a DXLT document

The following is an example of a simple but complete DXLT document. The numbers in square brackets to the left of certain lines are not part of the DXLT document. They serve as footnote numbers to the comments below.

[1] <?xml version='1.0'?>

<!DOCTYPE martif PUBLIC "ISO 12200:1999A//DTD MARTIF core (XLTcdV04)//EN">

[2] <martif type='DXLT' lang='en' >

[3] <martifHeader>

<fileDesc><sourceDesc>from an Oracle corporation termBase</sourceDesc></fileDesc>

<encodingDesc>DXLTdV04</encodingDesc>

</martifHeader>

[4] <text> <body>

[5] <termEntry id='ID67'>

[6] <descrip type='subjectField'>manufacturing</descrip>

[7] <descrip type='definition'>A value between 0 and 1 used in …</descrip>

[8] <langSet lang='en'>

[9] <tig>

<term>alpha smoothing factor</term>

[10] <termNote type='termType' >fullForm</termNote>

[11] </tig>

[12] </langSet>

[13] <langSet lang='hu'>

[14] <tig><term>Alfa simítási tényezõ </term></tig>

[15] </langSet>

[16] </termEntry>

[17] </body> </text>

[18] </martif>

Only a minimal acquaintance with XML is assumed in the following explanation. Indeed an acquaintance with HTML from building simple web pages, along with the knowledge that XML allows user-defined tag names whereas HTML comes with a set of pre-defined tag names, should be sufficient to allow understanding of the following explanation. For key DXLT elements, the correspondence to the structural component of the meta-model in the ISO TC 37 TMF project is given.

[1] <?XML ... : These lines state that the following lines constitute an XML document that conforms to version 1.0 of the definition of XML by the World Wide Web consortium (W3C) and to the DXLT DTD.

[2] <martif ...: This line states that this particular XML document is an DXLT document and thus, along with other members of the XLT family, an be validated against a specification of the XLT core structure, which, for this document, is called XLTcdV04, and can be checked for adherence against the master Default DCS module. The lang attribute indicates that the default language for text in this document is English (ISO 639 code 'en').

[3] <martifHeader ...: These lines provide global information about the collection: specifically, a file description indicating that the example was derived from an entry in an termbase used at Oracle corporation and that the DXLT DCS (DXLTdV04, 'd' for DCS), not to be confused with the XLT core DTD (XLTcd04, 'cd' for core DTD) is being used.

[4] <text> <body>: The text element surrounds the body element, which contains the collection of concept-oriented "Teminological Entry" (<termEntry>) elements.

[5] <termEntry ...: Each termEntry element is one instance of the "Terminological Entry" object class. The id attribute has a value that is unique throughout the document, making it possible for other elements to point unambiguously to this element.

[6] <descrip type='subjectField' ...: The subject field data category is authorized by the DCS (Data Constraint Specification) mentioned above. It consists of a meta data category element (descrip) with the specific data category indicated in the value of the type attribute.

[7] <descrip type='definition' ...: This piece of descriptive information is also associated with the concept.

[8] <langSet lang='en'>: The langSet element corresponds to a "Language Section" object class, according to which a Terminological Entry consists of associated information and language sections. This line begins the English Language Section.

[9] <tig><term> ...: The meta-model states that a Language Section consists of instances of a "Term Section" object class, which, in DXLT corresponds to a <tig> (or <ntig>) element. An instance of a Term Section consists of a term and associated information, which in this case is the term type. The name tig stands for term information group.

[10] <termNote type='termType' ...: This piece of descriptive information associated with the term is the 12620 data category "term type". Its value is "fullForm". A termNote tag is used instead of descrip since the information is closely associated with the term itself rather than the concept being described.

[11] </tig>: This element simply ends the current Term Section.

[12] </langSet>: This element ends the English Language Section.

[13] <langSet lang='hu'>: This element begins the Hungarian Language Section.

[14] <tig> ...: This line consists of a Term Section with a Hungarian term but no definition and no explicit term type. Each character of the term that is not found in ISO 646 is represented as a hex character reference corresponding directly to a Unicode character. The actual Hungarian term is "Alfa simítási tényezõ". Note that the final character "õ" (o-tilde) should more properly be an o-double-acute, which is represented by the following Unicode hex character reference: "ő", a character not available in a typical font. In XML, a Unicode hex character reference consists of "&#x" + four hex digits from the Unicode standard + a semicolon.

[15] </langSet>: This element ends the Hungarian Language Section.

[16] </termEntry>: This element ends the current Terminological Entry.

[17] </body> </text>: These elements end the set of terminological entries, which in this case consist of only one entry, and the XLT text element, which is the composite of terminological entries and other resources called Complementary Information in the meta-model. In this DXLT document, there are no resources outside the terminological entry. If there were, they would be in the XLT element back.

[18] </martif>: This element ends the entire DXLT document.

This sample DXLT entry has several properties:

1. It corresponds directly to the meta-model in the TMF project.

2. It is a well-formed XML document.

3. It conforms to DXLT, by being welli-formed as well as being valid according to the core structure and by adhering to the master data constraint specification (DCS) module of DXLT.

6 Definition of the core-structure module

6.1 General

This section defines the core structure of XLT informally, particularly for a human analyst who is either seeking to understand an XLT document or to analyze source or target terminological data in order to prepare a mapping that a programmer can use to write an automatic conversion routine from the source format to, for example, DXLT or from DXLT to the target format.

6.2 Hierarchical overview

The highest-level XML element in an XLT document is the "martif" element, which consists of a "martifHeader" element and a "text" element. (See Figure 6.1.)

The text element in Figure 6.1 consists of terminological entries (that together make up the XLT body element) and "Complementary Information" (a meta-model object class) that are found in the front and back elements.

The martifHeader element corresponds to "Global Information" in the meta-model and consists of a description of the whole terminological data collection (in the fileDesc element), information about the data-category specification and character encoding ( in the encodingDesc element), and a history of major revisions to the collection (in the revisionDesc element).

A question mark after an element in the box-and-line diagrams below indicates that it is optional.

See Annex A for more detail on these elements.

Figure 6.1 — The highest-level elements

Each terminological concept entry in the body element is called a termEntry (see Figure 6.2) and follows the structure of the meta-model.

The "auxInfo" element in Figure 6. 2 corresponds to "Terminology-related Information" in the meta-model, and each piece of terminology-related information can associated with any one of three levels: the Terminological Entry level (termEntry in XLT, i.e. the concept level), the Language-section level (LangSet in XLT), and the Term-section level (ntig, or its simplified version, tig, in XLT). The termNote and termNoteGrp elements at the Term -section level are also part of Terminology-related Information in the meta-model and consist of term-related descriptive elements that can only appear at the Term-section level and below. The termCompList element corresponds to the "Term Component Section" object class of the meta-model.

entry-level language-level term-level

Figure 6.2 — The structure of a terminological entry in body

In XLT, auxInfo consists of any combination of the following elements:

descrip, descripGrp, admin, adminGrp, transacGrp, note, ref, and xref.

A ref element is a crossreference that points somewhere inside the martif element. An xref element is a crossreference that points to an external object using a URI (a URL or other Web address). A note element, as expected, is a note. These three elements appear at various levels to allow the creation of links and the recording of supplementary information.

A transacGrp element gives information about a transaction. ISO 12620 (A.10.2) states that the two terminology management functions concerning a transaction are date and responsibility. A date is specified by a date element, and a responsibility is specified by an adminNote element. Thus, a transacGrp contains a transac element that describes the transaction, accompanied by any combination of transacNote, date, note, ref, and xref elements that apply to the transaction. Any date in XLT must appear within a transacGrp, even if an implicit transaction must be made explicit.

An adminGrp element is similar to a transacGrp in that it groups information pertaining to another element, in this case an admin rather than a transac, specifically, a combination of adminNote, note, ref, and xref. An admin is a simplified adminGrp in which there is just a single admin element and the adminGrp container has been omitted.

A descripGrp element consists of a descrip element followed by any combination of descripNote, admin, adminGrp, transacGrp, note, ref, and xref elements.

The descrip and admin elements are examples of meta data categories in XLT. Each instance of a meta data category in XLT is an element that is specialized by the value of its type attribute. The various instantiations of the meta data categories are given in section 7. The DXLT DCS file restricts each instantiation of a descrip to certain levels.

A termNoteGrp element, like other …Grp elements, consists of a base element, in this case a termNote, and auxiliary information, in this case, admin, adminGrp, transacGrp, note, ref, and xref elements. A comparison with descripGrp shows that the difference is that there are no descrip elements in a termNoteGrp. This is because descrip contains concept-related data categories that do not apply to the term itself.

A termCompList element shows the internal composition of a term and consists of a combination of termCompGrp and, in the simplified case, termComp elements. A termCompGrp, consistent with the pattern set by other ...Grp elements, consists of a termComp element and a combination of termNote, termNoteGrp, admin, adminGrp, transacGrp, note, ref, and xref elements that apply to it. Each termComp element contains some component of a term, such as one of the words of which it is composed.

6.3 Text elements, i.e., elements that contain plain, basic or note text

In XLT, elements such as descrip, descripNote, admin, adminNote, transac, transacNote, termNote, note, ref, and xref, contain text. Sometimes, the permissible values of the element are restricted to a picklist. In other cases, the element can contain free text. There are three types of free text in XLT: plain, basic, and note. Plain text (#PCDATA) is defined by the XML specification. It contains no elements, only characters and character entities. Basic text is plain text with the addition of optional embedded hi elements. A hi element highlights a segment of text and optionally points to another element. One use of hi is to mark an entailed term inside a definition. A term element contains basic text. Note text, which is used in definitions and contextual examples and similar elements, allows the following additional embedded elements besides hi: foreign, bpt, ept, it, ph, and ut. The foreign element is used to mark a segment of text that is in a different language from the surrounding text, e.g. "a <foreign lang='fr'> pamplemousse </foreign> is a grapefruit."

The five elements, bpt, ept, it, ph, and ut, are meta-markup tags that are used to mark up (i.e., encapsulate) markup to distinguish it from text. They allow XLT elements to contain various kinds of markup that needs to be retained but not necessarily processed during terminology management functions. Any such enclosed markup is modified so that start-tag characters ('<') become entities (<) and ampersands become entities (&). If a piece of markup to be encapsulated consists of two paired pieces of markup, such as the markup used to show that a piece of text is to be in bold or italics, then bpt and ept (begin and end paired tags) are used. If the markup to be encapsulated consists of one piece that would be paired except that the other piece was cut off and appears outside the current element, then an it (isolated tag) is used. If the piece of markup to be encapsulated stands on its own, marking a place such as a footnote, then ph (placeholder) is used. If the categorization of the piece of markup is unknown, then ut (unknown tag) is used.

Suppose one has the following segment of text to put into an XML element in XLT:

"We need a big dog."

The marked-up text might be underlying this presentation might be:

"We need a <bold> big </bold> dog."

This is not a problem for meta-markup tags. One can put it into an XLT element as follows:

"We need a <bpt i='1'><bold>/bpt> big <ept i='1'></bold></ept> dog."

Then one can get the original segment back by taking out the meta-markup tags and converting any "<" inside a meta-markup tag back to "<".

Now consider about the following segment (that uses SGML markup):

"We need a big but < 50 pound dog", which might have the following underlying SGML markup:

"We need a <bold> big but < 50 pound </bold> dog"

(i.e. a "big but less-than-fifty-pound dog" in which the less-than sign "<" has already been converted to an SGML entity in the source segment before placing it into XLT, since in this case the less-than sign is a literal rather than an escape character).

One would put it into an XLT segment as follows:

We need a <bpt i='1'><bold>/bpt> big but &lt; 50 pound <ept i='1'></bold></ept> dog.

Then, when we try to re-construct the original segment, we will get what we started with, since the & will be converted back to an ampersand.

HTML tags are one kind of markup that may be enclosed inside meta-markup elements. This allows the markup to be retained and processed during display or import without unduly complicating the core structure by including the XHTML DTD include in the XLT core structure. Any kind of markup, including RTF, can be encapulated in meta-markup tags and later retrieved without loss of information. The XLT approach to meta markup is borrowed from the TMX format of LISA, an ISO/TC 37 liaison organization and supporter of the SALT project.

6.4 Meta data categories

The meta data categories of DXLT are as follows. Each of them can potentially be given multiple instantiations in a DCS module, each instantiation specifying one data category. In DXLT, the specific data category instantiation is indicated by the value of a type attribute (e. g. <descrip type='definition'>).

a) termNote
(A termNoteGrp element receives the data category of its termNote element.)

b) termComp
(Each termComp element in a termCompList inherits the data category of the list; then each termCompGrp element receives the data category of its termComp element.)

c) admin
(An adminGrp element receives the data category of its admin.)

d) adminNote

e) transac
(A transacGrp element receives the data category of its transac element.)

f) transacNote

g) descrip
(A descripGrp element receives the data category of its descrip element.)

h) descripNote

i) ref

j) xref

k) refObject
(Each refObject element in a refObjectList inherits the data category of the list.)

In general, a …Grp element in DXLT receives the data category of the first element of the group, and all the elements of a …List element inherit the data category of the list. If the …Grp elements were not optional in the simple case of a single element, then the data category would be specified on the …Grp element directly.

A term is not formally a meta data category in DXLT, but the termType data category used with a termNote element is used to specify term type, thus rendering a term element an indirect meta data category.

6.5 Attributes

The main attributes used in DXLT are lang (language), type, id (to identify an element uniquely), and target (to point to an ID). Additional attributes are found in Annex A.

The value of the lang attribute inherits downward through the implied tree structure of the XML document unless overridden by another lang attribute. The martif element is required to have a lang attribute. The language specified in the martif element becomes the working language of the entire DXLT file. Each langSet element must also specify a language that applies to that Language Section. Thus, a definition at Terminological Entry level is assumed to be in the working language of the martif file unless otherwise specified, and a note in a Language Section is assumed to be in the language of that Language Section unless otherwise specified.

The the allowed values of the lang attribute in XLT are the same as the allowed values of the lang attribute in TMX.

The id and target attributes work together to point unambiguously between elements in the same martif file. For example, one entry:

...(entry for "hunting dog")

</termEntry>

could be pointed to by another entry:

<descrip type="superordinateConceptGeneric" target="c5574">hunting dog</descrip>

…(entry for "Retriever" [a type of hunting dog])

</termEntry>

The redundant content "hunting dog" in the second entry is for display purposes. It provides a name for the link to the other entry that can be viewed by a human who is deciding whether to follow the link.

7 Definition of the data-category module selected from ISO 12620

7.1 General

This section describes the Default data constraint specification (DCS) module for DXLT, which is based on a selection of data categories from ISO 12620 selected to support somewhat blind interchange. The formal, machine-processable version of the DXLT master DCS module can be found in Annex B. It is referred to as the master DCS of DXLT when distinguishing it from a particular user-group subset DCS.

NOTE: The list orders the data categories according to the section of ISO 12620 in which they are described. It is also the order in which they appear in the master DCS module.

7.2 Systematic listing of data categories in DXLT

The following tables define the DXLT master DCS (data constraint specification), which describes the data categories in DXLT that are implemented as XML elements that instantiate a meta data category. The remaining data categories are implemented as the term element, the note element, the date element, the lang attribute, the id attribute, the hi element, and the foreign element. These basic data categories are mentioned in section 6, since they are part of the core structure.

Guidelines for encoding particular data categories in DXLT as XML are given in Annex C.

Each data category other than the basic data categories is related to the meta-model by being classified as either administrative or descriptive. Descriptive data categories may describe either a concept or a term. All data categories that use the Martif tag name descrip are concept-related descriptive data categories. All data categories that use the Martif tag names termNote or termComp are term-related descriptive data categories. All data categories that use the tag name admin are administrative. Descriptive and administrative data categories are further divided into properties and relations. In DXLT, a data category is a relation if the target attribute is allowed by the DCS file. Notes can be either administrative or descriptive.

In the following table (split into parts for convenience), the first column (ISO 12620) is the position code of the data category in ISO 12620. The second column (Martif Data Category Name) is the name of that data category when given as the value of the type attribute. Typically, it consists of the name in ISO 12620 with spaces removed and the first letter of the second and subsequent words upper-cased. The third column (TextType) tells what kind of text is allowed in the element. The fourth column tells whether this element can take a target attribute, in which case it indicates what kind of element can be targeted. The fifth column (Martif Tag Name) tells which meta data category is used in DXLT for this data category. The sixth column (Level) gives any exceptional information about the levels in the meta-model at which a particular data category can appear. Admin elements can appear at any level. Descrip elements can appear at the entry, language, or term levels unless otherwise restricted (using codes TE for Terminological Entry, LS for Language Section, and TM for term). TermNote elements can appear at only at the term level, unless authorized (by a TC code) to appear also at the Term Component level.

The last column (column seven) contains various comments. The code PA means that this data category is not yet officially in ISO 12620 and is thus Pending Approval (PA). Picklists are found after the tables in footnotes. If the comment column contains a position code of a data category, this indicates that the listed data category has been combined with the data category of the current row.

Data categories that do not have a picklist in the DXLT master DCS can have a picklist in a user-group subset DCS of DXLT (see section 8) if the user-group in question can agree on a picklist for that data category. One obvious candidate for a user-group picklist is partOfSpeech, for which there is no agreed-on picklist when all the languages of the world and all linguistic theories are to be taken into account.

List 7.2

Basic data categories:

- term [A.01]: <term>

- highlighted text: <hi>

- foreign language text: <foreign>

- language [A.10.07]: the lang attribute, e.g., lang="es" on an element

- element identifier [A.10.15]: the id attribute, e.g. <termentry id="eid-45631">

- date [A10.02.01]: <date>

- comment [A.08]: <note>

Table 7.2a-n

Table 7.2a — Types of terms (12620: A.2.1)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.02.01	termType	picklist	none	termNote		f1
A.02.01.05	commonNameFor	basicText	term	termNote
A.02.01.08	abbreviatedFormFor	basicText	term	termNote

f1: picklist: mainEntryTerm, synonym, internationalScientificTerm, commonName, internationalism, fullForm, shortForm, abbreviatedForm, variant, transliteratedForm, transcribedForm, symbol, formula, equation, logicalExpression, sku, partNumber, phraseologicalUnit, standardText

Table 7.2b — Grammar, Usage, and Origin (12620: A.2.2 – A.2.4)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.02.02.01	partOfSpeech	plainText	none	termNote	TC
A.02.02.02	grammaticalGender	picklist	none	termNote	TC	f2
A.02.02.03	grammaticalNumber	picklist	none	termNote	TC	f3
A.02.02.04	animacy	picklist	none	termNote	TC	f4
A.02.02.07	grammaticalValency	plainText	none	termNote	TC	PA
A.02.03.01	usageNote	noteText	none	termNote
A.02.03.02	geographicalUsage	picklist	none	termNote		f5
A.02.03.03	register	picklist	none	termNote		f6
A.02.03.04	frequency	picklist	none	termNote		f7
A.02.03.05	temporalQualifier	picklist	none	termNote		f8
A.02.03.06	timeRestriction	noteText	none	termNote
A.02.03.07	proprietaryRestriction	picklist	none	termNote		f9
A.02.04.01	termProvenance	picklist	none	termNote		f10
A.02.04.02	etymology	basicText	none	termNote	TC

f2: picklist: masculine, feminine, neuter, other

f3 picklist: singular, plural, dual, mass, other

f4: picklist: animate, inanimate, other

f5: picklist: SF, CH, FR etc from ISO 3166 (country codes)

f6: picklist: neutralRegister, technicalRegister, in-houseRegister, bench-levelRegister, slangRegister, vulgarRegister

f7: picklist: commonlyUsed, infrequentlyUsed, rarelyUsed

f8: picklist: archaicTerm, outdatedTerm, obsoleteTerm

f9: Picklist: trademark, tradeName

f10: transdisciplinaryBorrowing, translingualBorrowing, loan, translation, neologism

PA = Pending Approval

Table 7.2c — Term components (12620: A.2.5 – A.2.8)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.02.05	pronunciation	basicText	none	termNote
A.02.06	syllabification	basicText	none	termCompList
A.02.07	hyphenation	basicText	none	termCompList
A.02.08.01	morphologicalElement	basicText	none	termCompList
A.02.08.02	termElement	basicText	none	termCompList
A.02.08.03	termStructure	noteText	none	termNote		PA

PA = Pending Approval

Table 7.2d — Term status (12620: A.2.9)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.02.09.01	normativeAuthorization	picklist	none	termNote		f11
A.02.09.02	languagePlanningQualifier	picklist	none	termNote		f12
A.02.09.03	administrativeStatus	picklist	none	termNote		f13
A.02.09.04	processStatus	picklist	none	termNote		f14

Note: discussion is needed as to whether processStatus should have a picklist or be plainText

f11: picklist: standardizedTerm, preferredTerm, admittedTerm, deprecatedTerm, supercededTerm, legalTerm, regulatedTerm

f12: picklist: recommendedTerm, nonstandardizedTerm, proposedTerm, newTerm

f13: picklist: standardizedTerm, preferredTerm, admittedTerm, deprecatedTerm, supercededTerm, legalTerm, regulatedTerm

f14: picklist: unprocessed, provisionallyProcessed, finalized

Table 7.2e — Equivalence (12620: A.3)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.03.02	falseFriend	basicText	term	termNote
A.03.04	reliabilityCode	picklist	none	descrip		f15
A.03.05	transferComment	noteText	term	termNote

f15: picklist: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Table 7.2f — Classification System (12620: A.4)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.04	subjectField	plaintext	none	descrip	LS, TM
A.04.02	classificationCode	plaintext	classSysDescrip	descrip	LS, TM	A.4.1+

Note: In 12620, A.04.02 is called classificationNumber.

Table 7.2g — Concept-related descriptions (12620: A.5)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.05.01	definition	noteText	none	descrip	LS, TM
A.05.02	explanation	noteText	none	descrip	LS, TM
A.05.03	context	noteText	none	descrip	LS, TM
A.05.04	example	noteText	none	descrip	LS, TM
A.05.05.01	figure	noteText	binaryData	descrip	TE, LS, TM
A.05.05.02	audio	noteText	binaryData	descrip	TE, LS, TM
A.05.05.03	video	noteText	binaryData	descrip	TE, LS, TM
A.05.05.04	table	noteText	binaryData	descrip	TE, LS, TM
A.05.05.05	otherBinaryData	noteText	binaryData	descrip	TE, LS, TM
A.05.06	unit	noteText	none	descrip	TM
A.05.07	range	noteText	none	descrip	TM
A.05.08	characteristic	noteText	none	descrip	TM

Table 7.2h — Concept relations (12620: A.6 and A.7 combined)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.07.02	conceptPosition	plaintext	conceptSys Descrip	descrip	TE, LS	A.07.01+
A.07.02.01	broaderConceptGeneric	basicText	entry	descrip	TE, LS	A.06.01+
A.07.02.01	broaderConceptPartitive	basicText	entry	descrip	TE, LS	A.06.02+
A.07.02.02	superordinateConceptGeneric	basicText	entry	descrip	TE, LS	A.06.01+
A.07.02.02	superordinateConceptPartitive	basicText	entry	descrip	TE, LS	A.06.02+
A.07.02.03	subordinateConceptGeneric	basicText	entry	descrip	TE, LS	A.06.01+
A.07.02.03	subordinateConceptPartitive	basicText	entry	descrip	TE, LS	A.06.02+
A.07.02.04	coordinateConceptGeneric	basicText	entry	descrip	TE, LS	A.06.01+
A.07.02.04	coordinateConceptPartitive	basicText	entry	descrip	TE, LS	A.06.02+
A.07.02.05.01	relatedConceptBroader	basicText	entry	descrip	TE,LS
A.07.02.05.02	relatedConceptNarrower	basicText	entry	descrip	TE,LS
A.07.02.05	relatedConcept	basicText	entry	descrip	TE, LS
A.07.02.06	sequentialRelatedConcept	basicText	entry	descrip	TE, LS	A.06.03
A.07.02.07	temporallyRelatedConcept	basicText	entry	descrip	TE, LS	A.06.03. 01
A.07.02.08	spatiallyRelatedConcept	basicText	entry	descrip	TE, LS	A.06.03. 02
A.07.02.09	associatedConcept	basicText	entry	descrip	TE, LS	A.06.04
A.10.18.06	antonymTerm	basicText	term	descrip	TM
A.10.18.06	antonymConcept	basicText	entry	descrip	TE

Note: further discussion is needed concerning whether antonyms are term-relations, concept-relations, or both

Table 7.2i — Specialized notes (12620: A.8)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.08.01	descripType	picklist	element	descripNote		PA, f16
A.08.02	definitionType	picklist	element	descripNote		PA, f17
A.08.03	contextType	picklist	element	descripNote		PA, f18

note: General note is not a meta data category

f16: picklist: translation

f17: picklist intensionalDefinition, extensionalDefinition, partitiveDefinition

f18: picklist definingContext, explanatoryContext, associativeContext, linguisticContext, metalinguisticContext

Table 7.2j — Documentary language (e.g., thesaurus) (12620: A.9)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.09.02	thesaurusDescriptor	basicText	thesaurus Descrip	descrip	TE	A.09.01+
A.09.04	keyword	plaintext	none	admin
A.09.05	indexHeading	plaintext	none	admin

Table 7.2k — Transactions (12620: A.10.1 – A:10.2-3)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.10.01	transactionType	picklist	none	transac		f19
A.10.02.02	responsibility	basicText	Person/Org	transacNote		two data categories
A.10.02.03	count	plaintext	none	transacNote
A.10.02.10	subsetOwner	basicText	personOrg	admin

Notes: <date> can also appear in a <transacGrp>; A.10.02.02 data categories are: responsiblePerson and responsibleOrg

f19: creation [formerly origination], input, modification, check, approval, withdrawal, standardization, exportation, importation, proposal, userAccess

Table 7.2l — Subsets (12620: A.10.3)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
A.10.03.01	customerSubset	plaintext	none	admin
A.10.03.03	projectSubset	plaintext	none	admin
A.10.03.05	productSubset	plaintext	none	admin
A.10.03.06	applicationSubset	plaintext	none	admin
A.10.03.07	environmentSubset	plaintext	none	admin
A.10.03.08	businessUnitSubset	plaintext	none	admin
A.10.03.09	securitySubset	picklist	none	admin		f20

Note: entailed term (10.6.1) is implemented in DXLT using the hi element, and foreign (10.8) is implemented in DXLT using the foreign element.

Note: (1 = public; 10 = highly confidential)

f20: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Table 7.2m — Other administrative information (12620: A.10.4 – A.10.21, except antonym)

ISO 12620	Martif Data Category Name	TextType	Target	Martif Tag Name	Level	Comments
10.06.02	sortKey	plainText	none	admin
A.10.06.03	searchTerm	basicText	none	admin
A.10.13	entrySource	noteText	?	admin	TE, LS, TM
A.10.14	conceptIdentifier	noteText	?	admin	TE, LS, TM
A.10.18	see	noteText	element	ref	TE, LS, TM
A.10.18	crossReference	noteText	any element	ref	TE, LS, TM
A.10.18	xCrossReference	noteText	external	xref	TE, LS, TM
A.10.18.05	homograph	basicText	term	termNote
A.10.18.06	antonym	basicText	term	descrip
A.10.19	source	noteText	none	admin	TE, LS, TM
A.10.20	sourceIdentifier	noteText	bibl	admin	TE, LS, TM
A.10.22.01	originatingInstitution	noteText	org	admin	TE, LS, TM
A.10.22.02	originatingPerson	noteText	per	admin	TE, LS, TM
A.10.22.03	originatingDatabase	noteText	?	admin	TE, LS, TM
A.10.23	sourceLanguage	picklist		admin	TE
A.10.24	domainExpert	plainText		admin	TE

Table 7.2n — Types of refObjects (Other Resources) in front and back elements

Note: These data categories are used to describe Other Resources and are not used inside a termEntry element.

Meta data category	Instantiation	Content/Comments
refObjectList	bibl	structured bibliographic reference (see 12620 Annex B)
refObjectList	binaryData	encoded as two hex digits per byte or as Mime
refObjectList	conceptSysDescrip	description of an external system
refObjectList	person/Org	details about a person or an organization
refObjectList	subjectFieldSet	set of subject field names and descriptions of them
refObjectList	colSequenceDescrip	description of collating sequence

In <back>, <refObject> elements in <refObjectList> elements to group similar types of reference objects.

In <front>, namespaces for various types of objects

8 Defining user-group subsets

8.1 General

As indicated in the Introduction, user-group subsets can be defined for DXLT. This section shall describe one such subset. In this subset, which we shall call the Supplier subset, we only allow minimal terminological information provided to a fictive supplier of translation along with the source text to be translated, in a very restrictive environmment involving manufacturing and finance only.

We shall allow only two types of terms, full forms and abbreviated forms. This is done by specifying a picklist of allowed values for the data category termType, which is an instantiation of the meta data category termNote. Here is the information that is placed in the DCS module concerning term type:

• meta data category: termNote

• data category: termType

• picklist:: fullForm abbreviatedForm

This specification is a strict subset of the specification for termType in the master DCS module. The only difference is that the master DCS module allows more options in the picklist. Clearly, any DXLT document that conforms to the subset specification shall also conform to the master specification.

We shall allow only two types of descriptive information: a subject field and a definition. The subject fields allowed in this subset are manufacturing and finance, and we only allow subject field specifications at the terminological entry level:

• Meta data category: descrip

• data-category: subjectField

• picklist: manufacturing finance

• levels: termEntry

The master DCS module allows any plain text value for a subject field, so a subset module can specify a picklist. Obviously, there can be no picklist of possible definitions, so we specify that a definition can contain the same type of text found in general notes (noteText). We shall allow definitions at two levels, entry and language.

This is done by placing the following information in the DCS module:

• meta data category: descrip

• data category: definition

• content: noteText

• levels: termEntry langSet

8.2 An example of a user-group DCS file

An actual machine-processable DCS file for the Supplier subset of DXLT would look like this:

<?xml version="1.0"?>

<header><title>subset DCS file for the Supplier example</title></header>

<contents datatype="picklist" targetType="none">fullForm abbreviatedForm</contents>

</termNoteSpec>

<contents datatype="picklist" targetType="none">manufacturing finance</contents>

<levels>termEntry</levels>

</descripSpec>

<levels>termEntry langSet</levels>

</descripSpec>

</datCatSet>

</martifDCS>

It should be noted that the machine-processable DCS module corresponds in a straightforward manner to the information listed for the three data categories presented in the above examples. The position numbers identify the location of the description of the data category in ISO 12620. Creating a DCS module does not require an extensive knowledge of XML nor expertise in writing XML DTDs or schemas.

Clearly, specifying only three data categories (term type, subject field, and definition) as instances of meta data categories defines a very limited subset of DXLT; nevertheless, this limited data-category module can be logically combined with the core-structure module of DXLT to allow such DXLT-compliant documents as the example in section 5. Elements that are not meta data categories, i.e., basic XLT data ategories such as <term> and <note>, are allowed without explicit mention in the DCS module because they are part of the core structure and need not be mentioned in the DCS.

As previously mentioned, It is anticipated that the TBX format of OSCAR will be a subset of DXLT.

Annex A
(normative)

Core structure module

A.1. The core-structure DTD for XLT

The entities allow mnemonic names to be given to pieces of text, especially text used in several places. The elements of XLT are divided into three groups: (a) the low-level elements used to mark up text, such as markup inside definitions and contextual examples, (b) elements needed to constitute a terminological entry (<termEntry>); and (c) high-level elements and other elements not used in a terminological entry, e.g. header elements. After these three groups of elements, attribute lists for all elements are given in alphabetical order. Meta data categories are given a comment indicating their specializations and other constraints on them are in the DCS module.

The following DTD is a formal XML representation of the core structure defined informally in Section 6. It can be found as as text file in the DXLT zip file as XLTCDv04.xml (XLT Core-structure DTD version 0.4).

There is also an XDR schema version of the core structure that work with Internet Explorer 5. It can be found in the DXLT zip file as XLTCSv04.xml (XLT Core-structure Schema version 0.4).

<!-- =================================================================================

SOME USEFUL ENTITIES THAT ARE REFERENCED BELOW

================================================================================== -->

<!ENTITY % basicText '(#PCDATA|hi)*'>

<!ENTITY % noteText '(#PCDATA|hi|foreign|bpt|ept|it|ph|ut)*' >

<!ENTITY % impIDLang 'id ID #IMPLIED lang CDATA #IMPLIED' >

<!ENTITY % impIDType 'id ID #IMPLIED type CDATA #IMPLIED' >

<!ENTITY % impIDLangTypTgtDtyp 'id ID #IMPLIED lang CDATA #IMPLIED

type CDATA #REQUIRED target IDREF #IMPLIED datatype CDATA #IMPLIED' >

<!-- ================================================================================

ELEMENTS USED FOR TEXT MARKUP

================================================================================ -->

<!ELEMENT hi (#PCDATA) >

<!ELEMENT foreign (%basicText;) >

<!ELEMENT bpt (#PCDATA)* >

<!ELEMENT ept (#PCDATA)* >

<!ELEMENT it (#PCDATA)* >

<!ELEMENT ph (#PCDATA)* >

<!ELEMENT ut (#PCDATA) >

<!-- ================================================================================

ELEMENTS NEEDED FOR TERMINOLOGICAL ENTRIES (IN ALPHABETICAL ORDER)

================================================================================ -->

<!ELEMENT admin (%basicText;) >

<!ELEMENT adminGrp (admin, (adminNote|note|ref|xref)*) >

<!ELEMENT adminNote (%noteText;) >

<!ELEMENT date (#PCDATA) >

<!ELEMENT descrip (%noteText;) >

<!ELEMENT descripNote (%noteText;) >

<!ELEMENT langSet ((%auxInfo;), (tig | ntig)+) >

<!ELEMENT note (%noteText;) >

<!ELEMENT ntig (termGrp, %auxInfo;) >

<!ELEMENT ref (#PCDATA) >

<!ELEMENT term (%basicText;) >

<!ELEMENT termComp (%basicText;) >

<!ELEMENT termCompGrp (termComp, (termNote|termNoteGrp)*, %noteLinkInfo;) >

<!ELEMENT termCompList ((%auxInfo;),(termComp | termCompGrp)+) >

<!ELEMENT termEntry ((%auxInfo;),(langSet+)) >

<!ELEMENT termGrp (term, (termNote|termNoteGrp)*, (termCompList)* ) >

<!ELEMENT termNote (%noteText;) >

<!ELEMENT termNoteGrp (termNote, %noteLinkInfo;) >

<!ELEMENT tig (term, (termNote)*, %auxInfo;) >

<!ELEMENT transac (%basicText;) >

<!ELEMENT transacGrp (transac, (transacNote|date|note|ref|xref)* ) >

<!ELEMENT transacNote (%noteText;) >

<!ELEMENT xref (#PCDATA) >

<!-- ===================================================================================

OTHER ELEMENTS (in hierarchical order)

=================================================================================== -->

<!ELEMENT martif (martifHeader, text) >

<!ELEMENT martifHeader (fileDesc, encodingDesc?, revisionDesc?) >

<!ELEMENT p (%noteText;) >

<!ELEMENT fileDesc (titleStmt?, publicationStmt?, sourceDesc+) >

<!ELEMENT titleStmt (title, note*) >

<!ELEMENT title (#PCDATA) >

<!ELEMENT publicationStmt (p+) >

<!ELEMENT sourceDesc (p+) >

<!ELEMENT encodingDesc (ude?, p+) >

<!ELEMENT ude (map+) >

<!ELEMENT map EMPTY >

<!ELEMENT revisionDesc (change+) >

<!ELEMENT change (p+) >

<!ELEMENT text (front?, body, back?) >

<!ELEMENT front (#PCDATA) >

<!ELEMENT body (termEntry+) >

<!ELEMENT back ((refObjectList)*) >

<!ELEMENT refObjectList (refObject+) >

<!ELEMENT refObject ((itemSet | itemGrp | item)+) >

<!ELEMENT item (%basicText;) >

<!ELEMENT itemGrp (item, %noteLinkInfo;)>

<!ELEMENT itemSet ((item | itemGrp)+)>

<!-- =================================================================================

ELEMENT ATTRIBUTES

================================================================================= -->

<!ATTLIST admin %impIDLangTypTgtDtyp; >

<!ATTLIST adminGrp id ID #IMPLIED >

<!ATTLIST adminNote %impIDLangTypTgtDtyp; >

<!ATTLIST back id ID #IMPLIED >

<!ATTLIST body id ID #IMPLIED >

<!ATTLIST bpt i CDATA #IMPLIED x CDATA #IMPLIED type CDATA #IMPLIED >

<!ATTLIST change %impIDLang; >

<!ATTLIST date id ID #IMPLIED >

<!ATTLIST descrip %impIDLangTypTgtDtyp; >

<!ATTLIST descripGrp id ID #IMPLIED >

<!ATTLIST descripNote %impIDLangTypTgtDtyp; >

<!ATTLIST encodingDesc id ID #IMPLIED >

<!ATTLIST ept i CDATA #IMPLIED >

<!ATTLIST fileDesc id ID #IMPLIED >

<!ATTLIST foreign id ID #IMPLIED lang CDATA #REQUIRED >

<!ATTLIST front id ID #IMPLIED >

<!ATTLIST hi type (entailedTerm | xlink) #IMPLIED

target IDREF #IMPLIED

lang CDATA #IMPLIED

href CDATA #IMPLIED

show CDATA #IMPLIED

actuate CDATA #IMPLIED

role CDATA #IMPLIED

behavior CDATA #IMPLIED >

<!ATTLIST it pos (begin|end) #REQUIRED x CDATA #IMPLIED type CDATA #IMPLIED >

<!ATTLIST item %impIDType; >

<!ATTLIST itemGrp id ID #IMPLIED>

<!ATTLIST itemSet %impIDType; >

<!ATTLIST langSet id ID #IMPLIED lang CDATA #REQUIRED >

<!ATTLIST map unicode CDATA #REQUIRED

code CDATA #REQUIRED

ent CDATA #REQUIRED

subst CDATA #REQUIRED >

<!ATTLIST martif type (DXLT) #REQUIRED lang CDATA #REQUIRED >

<!ATTLIST martifHeader id ID #IMPLIED >

<!ATTLIST note %impIDLang; >

<!ATTLIST ntig id ID #IMPLIED >

<!ATTLIST p id ID #IMPLIED

type (langDeclaration|DCSName) #IMPLIED

lang CDATA #IMPLIED >

<!ATTLIST ph assoc CDATA #IMPLIED x CDATA #IMPLIED type CDATA #IMPLIED >

<!ATTLIST publicationStmt id ID #IMPLIED >

<!ATTLIST ref %impIDLangTypTgtDtyp; >

<!ATTLIST refObject id ID #IMPLIED >

<!ATTLIST refObjectList id ID #IMPLIED

type CDATA #REQUIRED >

<!ATTLIST revisionDesc %impIDLang; >

<!ATTLIST sourceDesc %impIDLang; >

<!ATTLIST term id ID #IMPLIED >

<!ATTLIST termComp %impIDLang; >

<!ATTLIST termCompGrp id ID #IMPLIED >

<!ATTLIST termCompList id ID #IMPLIED

type CDATA #REQUIRED >

<!ATTLIST termEntry id ID #IMPLIED >

<!ATTLIST termGrp id ID #IMPLIED >

<!ATTLIST termNote %impIDLangTypTgtDtyp;>

<!ATTLIST termNoteGrp id ID #IMPLIED >

<!ATTLIST text id ID #IMPLIED >

<!ATTLIST tig id ID #IMPLIED >

<!ATTLIST title %impIDLang; >

<!ATTLIST titleStmt %impIDLang; >

<!ATTLIST transac type CDATA #REQUIRED lang CDATA #IMPLIED target IDREF #IMPLIED

datatype CDATA #IMPLIED >

<!ATTLIST transacGrp id ID #IMPLIED >

<!ATTLIST transacNote %impIDLangTypTgtDtyp; >

<!ATTLIST ude id ID #IMPLIED

name CDATA #REQUIRED

base CDATA #IMPLIED >

<!ATTLIST ut x CDATA #IMPLIED >

<!ATTLIST xref %impIDType;

target CDATA #REQUIRED >

A.2 The schema version of the DXLT core structure

(See XLTdsV04.xml in the DXLT zip file: XLT DCS [XDR] schema version 0.4)

Annex B
(normative)

Master data category specification module

NOTE: This annex presents the master DCS file for DXLT expressed as an XML document. The informal version of this DCS file is found in Section 7.

(See DXLTDv04.xml in the DXLT zip file: DXLT DCS version 0.4)

Annex C
(normative)

Examples

This Annex consists of examples of how to encode DXLT data streams. Some examples are complete XML documents; others are fragments.

C.1 Low-level encoding (characters, dates, locales, etc) in DXLT

Tentative example of representing binary data:

The following example shows how binary data might be included in a refObjectList. This example will be expanded to include base64 encoding with MIME types. The W3C proposal on representation of binary data in XML is not yet final.

</refObject>

Character set issues:

The Latin1 entities and other entitites are allowed according to ISO 12200, but DXLT is more restrictive. Only hex character references are used. This is to reduce the burden on a blind import routine, which cannot anticipate all the mnemonic character entities that might be used. DXLT has adopted the LISA OSCAR convention that all “blind” data streams shall be in one of three encodings of Unicode: (a) UCS-2 (b) UTF-8 or (c) pure 7-bit ASCII in which non-ASCII characters are encoded as eight ASCII characters using an XML hex character reference, e.g. "㐡". Such hex character references are automatically converted to readable characters when an XML data stream containing them is displayed in various types of XML-aware software, such as certain web browswers, without any need for special programming. This third type of encoding can formally be consired to be UTF-8, although it does not use the UTF-8 method of encoding characters whose code point is above 127.

Note about dates:

Dates as standalone tags without a type attribute would not express a sufficiently explcit meaning. Thus they occur embedded in <transacGrp> elements, adding date information to the action described by the transaction.

C.2 Representing DXLT data categories in terminological entries

The following sample demonstrates the description of term components. In addition it also shows the possibility of using a tig in place of an ntig in simple cases:

fr : table des transitions d'états

en : state transition table

The tig usage is as follows:

<ntig>

<term>table des transitions d'états</term>

<termComp>table</termComp>

<termNote type="grammaticalGender">feminine</termNote>

</termCompGrp>

<termNote type="partOfSpeech">preposition+article</termNote>

</termCompGrp>
<termCompGrp>

<termComp>transitions</termComp>

<termNote type="grammaticalNumber">plural</termNote>

<termNote type="grammaticalGender">feminine</termNote>

</termCompGrp>

<termNote type="partOfSpeech">preposition</termNote>

</termCompGrp>

<termComp>états</termComp>

</termCompList>

</termGrp>

</ntig>

</langSet>

<tig><term>state transition table</term></tig>

</langSet>

The following ntig is equivalent to the English tig given above:

<ntig>

<term>state transition table</term>

</termGrp>

</ntig>

</langSet>

The following sample shows how synonyms can be represented in DXLT. The following terminological data sample indicates that there is a synonym for the German term “Abtastglied”:

fr : échantillonneur

en : sampling element

de : Abtastglied; Abtaster

The French and German information is represented as follows:

[here put descriptive and administrative info at termEntry level]

[here put descriptive and administrative info at langSet level]

<ntig>

<term>échantillonneur</term>

</termGrp>

[here put descriptive and administrative info at term level]

</ntig>

</langSet>

[here put descriptive and administrative info at langSet level]

<ntig>

<term>Abtastglied</term>

[here put descriptive and administrative info at term level]

</ntig>

<ntig>

<term>Abtaster</term>

<termNote type="termNote">synonym</termNote>

</termGrp>

[here put descriptive and administrative info at term level]

</ntig>

</langSet>

</termEntry>

Note: The English langSet has been omitted for brevity, as has descriptive and administrative information and character references are shown converted to glifs for readability (é = é)

The following samples show how abbreviations can be represented in DXLT in two different methods. In the following terminological data sample the German term has an abbreviation:

fr : élément à action proportionnelle et par intégration

en : proportional plus integral element

de : Proportionalglied plus Integrierglied; PI- Glied

The German langSet can be represented in DXLT as:

[here put descriptive and administrative info at langSet level]

<ntig>

<term>Proportionalglied plus Integrierglied</term>

[here put descriptive and administrative info at term level]

</ntig>

<ntig>

<term>PI- Glied</term>

<termNote type="termNote">abbreviation</termNote>

</termGrp>

[here put descriptive and administrative info at term level]

</ntig>

</langSet>

It can also be represented using "abbreviatedFormOf" as follows, when it is desirable to show the relationship between the the abbreviated form and that full form explicitly:

[here put descriptive and administrative info at langSet level]

<ntig>

<term ID="n337">Proportionalglied plus Integrierglied</term>

[here put descriptive and administrative info at term level]

</ntig>

<ntig>

<term>PI- Glied</term>

<termNote type="abbreviatedFormOf" target="n337">Proportionalglied plus Integrierglied</termNote>

</termGrp>

[here put descriptive and administrative info at term level]

</ntig>

</langSet>

<term>

transactions

<transac type=”modification”>marketing department requested change from gizmo to thinger</transac>

<transacNote type=”responsiblePerson">John Harris</adminNote>

</transacGrp>

Note: more examples of encoding transactions are given in the folllowing subsection.

C.3 Encoding guidelines

The following legend will be helpful in interpreting the encoding guidelines for the tables in section 7.2:

- <xxx>: an XML element in the core structure

- bib-descrip: description in natural language of a bibliographic reference; no formal structure is required but it should be a complete reference, not just a code for a reference somewhere else.

- bibl-idref: reference to the ID of a bibliographic reference in Other Resources

- bindat-descrip: a description in natural language of an Other Resource item that is binary data (e.g. a description of a figure)

- bindat-idref: a reference to the ID of a binary data element in Other Resources

- classSysDescrip-idref: a reference to the ID of the description of a classification system in Other Resources (e.g. for the Lenoch system, how to obtain a copy of the full set of codes; what version is being used, etc)

- concept-display: a description in natural language of the concept entry being linked to (e.g. to be displayed as a hypertext link to be followed if desired)

- conceptSysDescrip-idref: similar to above, but for a concept system

- content: element content restricted according to the TextType column in the table (e.g. basicText)

- datcat-name: the name of the data category as found in column two of the table

- descriptor: a particular thesaurus descriptor appropriate to the current concept entry

- display-text: description in natural language to display in conjunction with a hypertext link

- element-idref: reference to the ID of any element

- iso-date: date in ISO format, using the options specified in the XML schemas proposal

- link: a formal link using a target attribute that points to the the IDREF of another element; or an informal link accomplished by specifying the term to link to (which may be ambiguous but is sometimes all that is available in a real-life termbase)

- page-etc: the page number or other identifer of a place in a document

- part-of-speech: a user-chosen name for a part of speech

- personOrg-descrip: description in natural language of the person or organization responsible

- personOrg-idref: a reference to the ID of a person or organization element in Other Resources

- picklist-value: any one of the items in the picklist associated with that data category

- position-code: a number or other code that identifies the position of a concept in a classification system, concept system, or thesaurus (e.g. a Lenoch code)

- pronunciation: a guide to the pronunciation of a term

- subject-field: the name of a subject field (chosen by the user)

- term-component: a component of a term, e.g. one of its words (termElement)

- term-idref: a reference (link) to the ID of a <term> element

- term-structure: the internal structure of a term (used only when a tree structure is needed)e.g. [[Empire State][Building]] (New York is known as the Empire state) as opposed to [[Empire] [State Building]] (termElement can be used in conjunction with termStructure to show part of speech of each element)

- term: a term (usually a term being pointed to)

- thesaurusDescrip-idref: similar to above, but for a concept system

- thesaurusDescrip-idref: similar to above, but for a concept system

- transac-descrip: description in natural language of the transaction (e.g. why it was done)

- transfer-comment: an explanation in natural language of some lack of perfect equivalence between two terms (typically in two languages in the same terminological concept entry)

- URI: URL etc (see W3C spec for URIs)

In general: any item in italics in one of the patterns below is a variable that is a placeholder for a variety of specific possible values. Consult the legend for the meaning of a variable.

XML encoding guidelines for Table 7.2a (Types of terms, 12620: A.2.1):

A.02.01: termType without link

<termNote type='termType'>picklist-value</termNote>

A.02.01.05: commonNameFor;

A.02.01.08: abbreviatedFormFor

(types of term with link to another <term>)

note: The term that is the content of the element is redundant with the value of the <term> element being linked to; it is there only as a display value for the convenience of the user who has not yet followed the link or in case the link gets broken or in case there is not link at all. The target attribute is strongly suggested but not required; the display content (the term being pointed to) is not optional. However, this is not to be interpreted as a license to violate term autonomy by placing an abbreviated form, for example, as the content of an abbreviatedFormFor element and not giving it its own tig or ntig.

XML encoding guidelines for Table 7.2b (Grammar, Usage, and Origin, 12620: A2.2 - A.2.4):

A.02.01.08: partOfSpeech

<termNote type='partOfSpeech'>part-of-speech</termNote>

note: In a user-group subset DCS, the content could become a picklist value.

A.02.02.02 - A.02.04.02: grammar, usage, and origin data categories (rest of the table)

<termNote type='datcat-name'>content</termNote>

note: this very abstract pattern works as follows: the value of the type attribute is one of the data category (datcat) names in table 2 in the range from A.02.02.02 through A.02.04.02; the content of the element is a picklist value or free text (plain, basic, or note text), depending on what is in the TextType column for that row. For example, the datcat on row A.02.02.02 (grammatical gender) has picklist in the TextType column of Table 2, so for this grammatical gender, the general pattern becomes:

<termNote type='grammaticalGender'>picklist-value</termNote>

XML encoding guidelines for Table 7.2c (Term components, 12620: A.2.5 - A.2.8):

A.02.05: pronunciation

<termNote type='pronunciation'>pronunciation</termNote>

note: For the pronunciation to be considered blind, it must be written in IPA.

A.02.06: syllabification

A.02.07: hyphenation

A.02.08.01: morphologicalElement

A.02.08.02: termElement

(ways of breaking down a term into components)

<termComp>term-component</termComp>

<termComp>term-component</termComp>

<termNote type='datcat-name'>content</termNote> (e.g. part of speech of this component)

</termCompGrp>

</termCompList>

A.02.08.03: termStructure

<termNote type='termStructure'>term-structure</termNote>

The structure the components of a term is indicated using square brackets that are formally equivalent to a tree, e.g. :

"[bank statement] [total] v.s [bank] [statement total]

XML encoding guidelines for Table 7.2d (Term status, 12620: A.2.9):

A.02.09.01 - A.02.09.04 term status data categories (the whole table)

<termNote type='datcat-name'>picklist-value</termNote>

XML encoding guidelines for Table 7.2e (Equivalence, 12620: A.2.9):

A.03.02: falseFriend

(e.g. in a French <ntig> for librarie, a false friend would be 'library':

<termNote type='falseFriend' lang='en' target='T567-e3'>library</termNote>

where <term id='T567-e3'>library</term>)

A.03.04: reliabiltyCode

<termNote type='reliabilityCode'>picklist-value<termNote>

(where source reliability codes are mapped to the 'blind' range 1-10)

A.03.05: transferComment

<termNote type='transferComment' target='term-idref'>transfer-comment</termNote>

XML encoding guidelines for Table 7.2f (Classification system, 12620: A.4):

A.04 subjectField

<descrip type='subjectField'>subject-field</descrip>

note: In a user-group subset, the subject field can be restricted to a picklist value.

A.04.02 classificationCode

<descrip type='classificationCode' target='classSysDescrip-idref'>position-code</descrip>

note: In 12200, this is two elements, a descrip and a ref or ptr to the the description.

XML encoding guidelines for Table 7.2g (Concept-related descriptions, 12620: A.5):

(A.05.01-A.05.08)

<descrip type='datcat-name'>content</descrip>

binary data (A.05.05.01-A.05.05.05):

<descrip type='datcat-name' target='bindat-idref'>bindat-descrip</descrip>

XML encoding guidelines for Table 7.2h (Concept relations, 12620: A.6 and A.7 combined):

A.07.02: conceptPosition

<descrip type='conceptPosition' target='conceptSysDescrip-idref'>position-code</descrip>

rest of the table (concept relations between to entries), except antonym

<descrip type='datcat-name' target='entry-idref'>concept-display</descrip>

A.10.18.06: antonym

note: Antonym is a little different from the other datcats in Table 8, since it points to a term, not a concept entry. Antonym is hard to classify; it could even be thought of as similar to false-friend, which is a term note. It is currently in 12620 under administrative elements, but most outsiders view that classification as a mistake.

XML encoding guidelines for Table 7.2i (Specialized notes, 12620: A.8):

A.08.01-A.08.03:

<descripNote type=’dat-cat name’>picklist value</descripNote>

XML encoding guidelines for Table 7.2j (Documentary language, 12620: A.9):

A.09.02 thesaurusDescriptor

<descrip type='thesaurusDescriptor target='thesaurusDescrip-idref'>descriptor</descrip>

A.09.04 keyword

<descrip type='keyword'>plaintext</descrip>

note: a keyword is a word or group of words that are to be placed in some kind of index and used for retrieval; this keyword is not a descriptor but actually appears in the text of some element of <termEntry> in which the keyword element is placed. It has been changed from an admin to a descrip so that its level can be controlled. If it turns out that it applies only to terms, then it may become a termNote. A keyword need not be in the inflected form found in text; instead, it may in lemmatized. This is one reason to use a keyword even when field is indexed on all words, since the user may not know to search on the inflected form.

A.09.05 indexHeading

<descrip type='indexHeading'>plaintext</descrip>

note: an indexHeading is like a keyword, except that it does not appear anywhere in the <termEntry> yet is still useful for retrieval.

XML encoding guidelines for Table 7.2k (Transactions, 12620: A.10.1 - A:10.2):

<transac type='datcat-name'>transac-descrip</transac>

<transacNote type='responsibility' target='personOrg-idref'>personOrg-descrip</transacNote>

</transacGrp>

If neither responsibility nor date is associated with the transaction, then a transac can be used without being in a transacGrp.

(Note that the creation date and the author are considered in XLT to be the date and responsibility parts of a transaction; thus the representation of a modificaiton is: a modification transaction containing a responsiblePerson [A.10.02.02.12] in a transacNote and a date [A.10.02.01]:

<transac type="modification"> comment on the transaction, e.g., reason for doing it </transac>

<transacNote type="responsiblePerson" target="per-idref"> person-descrip </transacNote>

</transacGrp>

XML encoding guidelines forTable 7.2l (Subsets, 12620: A.10.3):

A.10.03.01-A.10.03.08

<admin type='datcat-name'>content</admin>

A.10.03.10 security subset

<admin type='’securitySubset'>picklist value</admin>

XML encoding guidelines forTable 7.2m (Other administrative information, 12620: A.10.4 - A.10.21, except antonym):

A.10.18 crossReference

<ref type='crossReference' target='element-idref'>display-text</ref>

note: element-idref must be inside the current XML document

A.10.18 xCrossReference

<xref type='xCrossReference' target='URI'>display-text</xref>

A.10.18.05 homograph

<termNote type='homograph' target='term-idref'>term-description</termNote>

- antonym: treated with concept relations

- source and sourceIdentifier:

- source is self-contained in element and self-explanatory

<admin type='source'>bib-descrip</admin>

- source identifier:

Annex D
(informative)

Design and application of XLT

D.1 Design principles

The following design principles apply to all members of the XLT family:

1) It is assumed that the reader already accepts the need to share terminological data.

2) It is also assumed that the reader accepts the need for both formats that are relatively blind (for cases where unknown parties will be processing files in those formats) and formats that are relatively less blind (for cases where parties negotiate details of a format in advance in order to reduce information loss when passing through those formats). Thus, in the remainder of these principles, "blind" should be read as "tending toward the blind side of the blind-negotiated spectrum".

3) It is further assumed that the reader acknowledges the wisdom of using Unicode in a blind format to avoid the problem of having to allow for many different code pages when importing data. Of course, this imposes a burden on the party placing data in the intermediate format, but this burden is offset by the increased predictability of the format, which permits the design of one routine to process data from any source so long as it comforms to the blind format specification.

4) In addition to the above assumptions, it is hoped that the reader accepts the following design principles for a blind interchange format (or is willing to discuss their objection):

a) A viable blind interchange proposal shall be formulated in the context of an overall framework for representing, designing, and sharing terminological data, namely, the ISO/TC37 meta-model and associated projects, whether these three tasks are addressed in separate standards or a multipart standard.

b) A viable blind interchange format should be an XML application, not just an SGML application, since XML is gaining momentum very rapidly worldwide and is specifically designed for data representation and sharing. The next generation of web browsers will even allow the viewing of any XML document, even if its DTD has not been seen before, but not arbitrary SGML documents. XML is substantially easier to process than full SGML. This will be a particularly important point for a translation technology vendor deciding whether or not to implement import and export routines from and to the format.

c) A viable blind interchange format should be customizable, since it is unlikely that any one particular interchange format could be acceptable to all user groups in all its details. That is, a single, monolithic format without the possibility of creating user-group subsets is out of the question. The ISO principle of providing methods of solving classes of problems, rather than overly specific problems, applies here. A viable blind interchange format is easy to grasp by a software developer, which means it should be modular, just as a software application should be divided into manageable modules. A blind interchange format will probably not be simple, but if it is broken into modules, each of which can be easily grasped, then the complexity of the format as a whole becomes more manageable.

d) A viable blind interchange format is suitable for sharing information over the Internet, using such protocols as http (browser), ftp (file transfer), and e-mail.

e) In the XML world, XML may be embedded in HTML, but HTML is not usually embedded in XML. If it is essential to embed a piece of HTML in an XML document, it can be treated as a piece of data rather than as XML markup, with the pointed brackets and ampersands of the HTML represented as entities. The HTML data can then be embedded inside meta-markup tags. The addition of meta-markup tags allows the embedding of various kinds of markup inside such elements as definitions and contextual examples. If the TMF project includes a module that embeds XHTML inside a TML, that module should be optional, just as the meta-markup tags should be optional.

f) It is important to maintain the distinction between descriptive and presentational markup. A blind interchange document should consist primarily of descriptive markup. If the presentational look of a termbase needs to be preserved, then pure HTML could be used, but this approach is incompatible with the tasks of exporting content from one termbase and importing content into a termbase. If a blind interchange format is to be viewed other than as raw XML, it can be automatically converted to a visually appealing presentation in HTML.

g) Links in a blind interchange format that point to another concept entry or to a piece of Complementary Information should be unambiguous. Merging data from sources need not result in ambiguity if the ids are preceded by distinct prefixes, in the spirit of the namespace specification for XML that allows for co-existence of elements from multiple sources without ambiguity through the use of GI prefixes.

h) The use of DTDs to declare constraints on the use of markup is being replaced by the use of XML itself to express constraints, i.e. rather than using the "bang" statements of DTDs, such as !ELEMENT (read "bang" element) and !ATTLIST. The XML activity page (www.w3.org/XML/Activity) says "While XML 1.0 supplies a mechanism, the DTD, for declaring constraints on the use of markup, automated processing of XML documents requires more rigorous and comprehensive facilities in this area." Therefore, we should not make decisions on the design of a blind interchange format based on the unfortunate limitations of DTD bang statements, such as the limitation that there is no formalism for constraining the content of an element to the values in a picklist. Instead of converting an element to an attribute, we could use a constraint system expressed in XML. When the W3 publishes its XML-Schema proposal, not to be confused with the RDF-Schema proposal, we should look at using it or else something in the spirit of it. Regardless of which particular type of schema is eventually used for a terminology blind sharing format, it is clear that the trend is toward using new XML schemas rather than relying exclusively on traditional SGML bang statements.

i) The task of representing, that is, making explicit, the structure of an existing termbase, need not involve the same degree of normalization that is needed for blind interchange, yet it would be highly desirable for the representation task to be done within the same framework as the sharing task. This means, for example, that it would be desirable to define a core structure for a family of formats. This core can then be successively constrained for design, by adding a list of data categories from ISO 12620, and for sharing, by defining user-group subsets of the available data categories and additional constraints on content and point of attachment of elements as needed for blind interchange.

j) Term autonomy (i.e. placing each term on an equal basis in a language section, with the possibility of additional information on each term) as opposed to simple synonyms dependent on a particular term, should be required in blind interchange but not in the representation task.

D.2 Applications of DXLT

This following list indicates various potential applications of the DXLT format:

A. blind interchange, including:

1. on-going data flow from one translation technology module to another with a different function in a complete business solution to the document production chain, which involves communication between authors, and translation requesters and suppliers,

2. integration of terminological data from multiple sources, and

3. data conversion necessitated by a shift from one application to another for the same function;

B. dissemination, including:

1. querying multiple termbases through a single user interface by passing data through a common intermediate format on a batch or dynamic basis,

2. placing data on an FTP site for download by interested parties, and

3. asking for suggestions from interested colleagues by making available entries in which some terms are tentative and asking for feedback,

C. analysis, including:

1. comparing the contents of various termbases [along the lines of the Interval project];

2. studying how lossless a roundtrip conversion with a given termbase can be; and

3. designing a new termbase intended to minimize loss during conversion

D.3 Connections between XLT and TMF

Overview

A member of the XLT family is defined by the logical combination of three components, which can be summarized as structure, content, and style.

The structure is a constrained version of the core structure found in ISO 12200 (as to be amended) in the form of an XML DTD. The amendment is not final but was approved as a New Work Item (ISO 37.3N318) and the changes it proposed from the October 1999 version of ISO 12200 are relatively minor. This revision provides the basis for a work item that has been called Martif with Specified Constraints or MSC[SEW1] . Significantly, the proposed amendment to 12200 consists of optional extensions to the DTD, so that valid pre-amendment Martif files remain valid according to the revised DTD. In August 2000 ISO TC 37/WG 4 resolved to place MSC on hold for the time being in order to develop a more pressing standard, ISO 16642, which will define a high-level meta-model designed to provide a uniform environment in which MSC, Geneter, and other possible interchange formats can function with a commitment to universal interoperability. At the same time (August 2000), the latest draft of the MSC specification (and the first draft to propose XLT) was withdrawn from ISO and placed in the SALT project. The XLT core structure will be mapped to the ISO TC/37 meta-model abstract structure when the meta-model TMF (Terminology Markup Framework) project is further along. The XLT core structure can also be expressed as an XML schema.

The content of a member of the XLT family consists of the set of the data categories (from ISO 12620 whever possible) needed for a particular user group and various constraints on those data categories and their appearance in the core structure. For a particular XLT family member, the various constraints that define it are gathered together into a module called a Data Constraint Specification (DCS) file. All DCS files have the same layout, which is defined by the XLT-DCS XML schema and will be equivalent to a subset of the RDF constraint format being developed in the ISO TC/37 Meta-model TMF project. One DCS file is privileged by being designated the Default XLT-DCS file. It functions as the “master file” containing the full list of data categories allowed in standard XLT. Various user groups are encouraged to define their formats as subsets of DXLT if possible. Note[SEW2] that DXLT, in the SALT environment, takes the place of the MSC format formerly under development in ISO TC/37.

In the XLT environment, style indicates the form of markup used in a format, particularly the approach to tag names and structures. The primary style [SEW3] of XML expression for XLT is the style of the XLT core structure, whereby broad tag names, sometimes called meta data categories, such as descrip, are specialized by the value of the type attribute, which is actually a data category name taken from the DCS file, e.g., <descrip type="definition">. The formal definition of an XLT family member, e.g., the features that distinguish it from other formats, thus depends simply on its DCS file. That is, each distinct DCS file defines a particular member of the XLT family. In XLT, alternative styles will only be used for internal processing.

Theoretical Context for the TMF Project

Multilingual terminological resources have been developed in machine-readable form since at least the 1960s. Being an integral part of the language industries and the information economy, these resources are processed using various kinds of database management systems and represented using various data models. In addition, these resources (and other types of resources, such as source and target texts and translation memory data) often need to be shared in various contexts, including interchange between computer systems and dissemination to human users. This sharing is usually accomplished using intermediate formats. Effective use and re-use of a variety of multilingual terminological resources is facilitated by a single high-level data model that supports analysis and design of both databases and representation formats for analyis and sharing.

Project 16642/CD of ISO Technical Committee 37 (consisting of a Terminology Markup Framework called TMF) provides such a model. The abstract data model in ISO 16642 is called a meta-model since it is a model of models, that is, a high-level model of what more specific data models have in common, thus providing the basis for a framework within which to defined various compatible representations. The Meta-model fits into an integrated approach to be used in analyzing existing terminological data collections and in designing (a) new databases (which are typically processed using a relational or text-based data management system) and (b) structured documents in a representation format (which are typically defined using XML markup). An integrated approach eases the task of importing information from a structured document into a database and the task of exporting information from a database to a structured document. Yet another motivation for an integrated approach, as opposed to entirely separate approaches for databases and structured documents is that XML-based formats are now being processed in new ways that infringe on the traditional role of database management systems. For example, XML documents are being queried and updated directly.

This integrated approach to analysis and design consists of three levels of data modeling: the meta-model level (level 1), the data-model requirements level (level 2), and the actual format (or database) level (level 3).

Level 1: The first (and most abstract) level of the integrated approach is the meta-model level. This level consists of three components: (a) the structural component, (b) content component and (c) a links between (a) and (b). The structural component is the meta-model from ISO 16642 and a diagram of it is found in Figure D.3.1. The content component is a meta-data registry based on ISO 12620, which is an inventory of data categories used in terminological data collections. These data categories are sometimes called "data element types" from a general information technology perspective. The links between the structural component and the metadata registry are (1) that the "term" object class in the structural component is the central data category called term in the metadata registry and (2) that all other data categories in the meta-data registry fall under "terminology-related information" in the structural meta-model. Thus, each specific type of terminology-related information is a data category in the meta-data registry.

Figure D.3.1 – The Structural Component of the Meta-model (part of level 1)

Note: The Associated Information box has been renamed "Terminology-related Information" and the "Other Resources" box has been renamed "Complementary Information". Also, some types of Terminology-related information can also apply to term components.

The meta-model level supports analysis and design at a very general level. The meta-model says that a terminological data collection consists of (a) global information about the collection (such as owner, title, and origin), (b) a number of entries (each entry performing three functions: describing a concept, listing the terms in each treated language that designate that concept, and describing the terms themselves), and (c) “Complementary Information” that is outside the entries. Each entry can have multiple language sections, and each language section can have multiple term sections. Each term section has exactly one term. Each level (entry, language, and term) can have associated with it various kinds of terminology-related information. The various types of Complementary Information that are not part of any one entry but that can be linked to from any element within one or more entries, even though those links are not shown explicitly in the diagram. Such Complementary Information includes bibliographic references, descriptions of ontologies, and binary data such as images that illustrate concepts.

Level 2: The second level is the data-model requirements level. At this level, the user of the integrated approach (who may be an analyst or designer) must make various choices, based on real-life needs. First, there is the question of the modality of the representation of terminological data: database or structured document. For the analyst, there is no real choice. Instead, it is a matter of identifying the nature of the terminological data collection to be analyzed. For the designer, there is a fundamental choice to be made. Will the terminological data be used primarily for queries and updates and be represented in some database management system? If so, which system? Or will the data be used primarily for sharing and be represented in a structured document with markup?.

Once the choice between database and XML modality has been made, a data model must be chosen that is slightly more concrete than the meta-model at level 1. The data model consists of a structural component and a data constraint component compatible with and parallel to the meta-model and meta-data registry at level 1. For a relational database, a typical method of describing a data model is an entity-relationship diagram. Analysis and design of relational databases is a complex topic treated elsewhere. In the SALT project, the default relational-database approach is RelTerm(tm). The ISO TC/37 TMF project deals with XML data models and representation formats. For an XML format, a typical method of describing a data model is an XML document type definition (DTD). An alternative method, using what is called an "XML schema", is provided by the World Wide Web Consortium (W3C). TMF is developing an abstract structure and a constraint component at level 2. The abstract XML structure of TMF can function as an intermediate representation between various other representations, such as MSC (and other members of the XLT family) and GeneterXLT. The core structure and DCS modules of XLT are expected to be compatible with the TMF project as it evolves, with the understanding that the XLT core structure may include an NLP component even if the TMF project does not.

The level 2 structural component of XLT is provided by the core structure in ISO 12200 (as amended by the project initiated by New Work Item SC3/N318), which includes an XML DTD that is compatible with the structural component of the meta-model. There is an optional module in the core structure that supports Natural Language Processing elements. By including in the level 2 data constraint component the possibilty of de-activating selected modules in the core structure, the same core structure module can provide the basis for all members of the XLT family of formats. It is anticipated that the XLT core structure will be generated from the TMF abstract structure through an automatic procedure.

At level 2, one conceptual data model is distinguished from another by the content of its data constraint specification. Each specification is guided by the real-world needs of some user group and consists of a list of data categories from the metadata registry (i.e., ISO 12620 and additional data categories that may become part of 12620), constraints on each of those data categories, and various other constraints on the core structure.

Constraints on data categories include restrictions on the values each data category can take (ranging from "text with markup" for contextual examples to a “picklist” for grammatical gender). Constraints on descriptive data categories also include restrictions on where a particular data category can appear in an entry, selected from the options provided by the core structure (including entry level, language-section level, and term level). Other possible constraints on the core structure include whether natural language processing elements will be allowed, whether meta-markup tags will be allowed, which languages will be allowed, and which types of Complementary Information will be allowed.

The XLT data constraint specifications in this document are XML documents that conform to an XDR-style schema found in Annex A. This schema is compatible with Internet Explorer 5. It is anticipated that this schema will be fully compatible with the RDF-based constraint specification format being developed in the TMF project.

An additional kind of choice that is expressed in a data-category specification is level of granularity. At the meta-model level, the data category inventory is given as a hierarchy. For example, under term-related descriptive information, the coarsest level of granularity is "term type". Under term type, we find "full form" and "abbreviated form of term". Under "abbreviated form of term", we find finer distinctions, such as "acronym", "initialism", and "clipped term". In creating each data-category specification at level 2, the user must choose which degree of granularity in the ISO 12620 data categories will be basic. The basic degree provides the name, while finer degrees become values and coarser degrees become headings. For example, a user could choose a coarser granularity as follows (where the code after each item is the position code in the ISO 12620 hierarchy):

heading: term-relate descriptive information (A.2)

name: term type (A.2.1)

value: abbreviated form of term (A.2.1.8)

or a finer granularity as follows:

heading: term type (A.2.1)

name: abbreviated form of term (A.2.1.8)

value: acronym (A.2.1.8.4)

Another aspect of the data-model requirements level is subsetting of data constraint specifications. Each “master” data constraint specification can have many valid subsets, each chosen by a particular user group in order to exercise more control over the choice of data categories and their values. Each user-group subset must be a strict subset of the master data constraint specification from which it was derived. That means that each data category name in the subset must appear in the master, that the structural level restrictions must be the same or tighter, and that the restrictions on values must be the same or tighter. For example, a data category in the master specification that is allowed to have any plain text value might be restricted to a picklist in a user-group subset.

A particular XML data model at level 2 is defined by the logical combination of (a) a core structure and (b) some constraint specification. That is, together, the core structure and a data constraint specification define a data model for an XML format. This can be visualized by thinking of the structural meta-model, which provides for a vast range of possible data models, particularized to a real-world application by constraining the core structure, which includes restricting it to certain data categories and data-category values applied to the entry, language-section, term-section, and term-component-section object classes in the structural meta-model. The level 2 abstract structure, together with the general layout of a constraint specification, will be found in the next draft of ISO/CD 16642. The XLT core structure will be derived, perhaps automatically, from the abstract structure of TMF. Thus, XLT can be thought of a subset of the TMLs (terminological markup languages) that can be generated from the ISO TMF (terminological markup framework), with NLP extensions.

Level 3: The third level of the integrated approach is where an actual XML format (or database layout) finally appears. At this level the user designing an intermediate format must make one additional choice: an XML representation style. One aspect of representation style is how to represent a data category from the metadata registry. The primaryt XLT style, based on ISO 12200, represents an instance of a data category as an XML element with the tag name being a class of data categories (descrip, admin, etc), the name of the data category, e.g. definition, being the value of the type attribute, and the content of the element being the value of the data category (i.e. the content of the element is the definition itself). This is not the only possible representation style, of course, and TMF will allow many different but interoperable styles.

For example, a definition might be represented as:

<descrip type='definition'>A piston with three grooves...</descrip>,

while an alternative representation style might place the data category name as the tag name, as follows:

<definition>A piston with three grooves...</definition>.

Whatever the style, the name of the data category must appear somewhere in the element. It can be represented as a tag or attribute name or as an attribute value or element content. All members of the XLT family of formats have the same primary representation style. Thus, a particular XLT format is derived simply by selecting a particular data constraint specification (DCS). All these particular formats taken together form a "family" of related formats.

The other formats defined within the TMF project can be thought of as "extended family" related to the XLT family. Since all core structures in the TMF family are equivalent and since all styles in TMF are equivalent, any two TMF formats that use the same DCS will be interoperable, that is, instances of them can be converted back and forth to the other format thorugh a lossless (i.e. bi-jective) mapping.

Figure D.3.2 is a diagram of how XLT family of formats fits together. The database modality is only given a token box labeled “DB mode”, since the focus of this diagram is XML formats. A particular XML format at level 3 is defined by the combination of a choice of core structure (at level 2), a data constraint specification (at level 2) and representation style (at level 3).

Figure D.3.2: A family of formats defined by three levels of abstraction

Note: the XML-mode core structure is split into the TMF abstract structure and various compatible core structures, such as one for MSC and other XLT formats and one for Geneter

Note: primary and secondary representations are alternative representation styles.

Note: Data-category specification in this figure should read "data constraint specification (DCS)".

Applications of analysis: Interchange and Dissemination

Analysis is normally performed with some purpose in mind. Typically, analysis will result in conversion of information between an original data collection and an intermediate format for the purpose of interchange or dissemination.

Interchange involves a transfer of information between two computer systems and is typically bi-directional. Dissemination is uni-directional and can be either for use by another computer system or for human viewing. Both interchange and dissemination can be performed either in batch mode or one entry at a time. Data analysis may lead to either dissemination or interchange or may be focused solely on improving understanding of the source data collection.

In a representation, for any purpose, there can be various degrees of "blindness", which involves neutralization of certain details of the source format, so that differences between formats are reduced, as represented in the intermediate format. The more blind a format is, the less an interpreter of data in that format needs to interact with the originators of the data or know about the original format of the data. When there are only two interchange partners and they are known to each other, blindness is not an issue. But when there are multiple sources of terminological data that must be imported by a single routine, especially if it is desirable to add more sources without modifying the import routine, blindness becomes very important. It is important to note that blindness is relative to the knowledge of the receiver of an instance document. There is no such thing as an absolutely blind format.

In interchange, the objective is usually to maximize preservation of information. But in the case of dissemination, the representation can be intentionally partial, leaving out some information that was in the original data collection. Such is the case in a map that preserves only certain aspects of geographical reality, such as elevation, or rivers, or roads, or buildings or political boundaries but not necessarily all of these. For example, a dissemination representation for human translators does not necessarily include some administrative information that is only relevant to terminologists maintaining the database.

Thus, as shown in Figure D.3.3, analysis is involved in the design of all representations of terminological data. Some analysis is for the purpose of dissemination of terminological data to people and to computer systems, and some analysis is for bi-directional interchange between computer systems. The specifics of a particular XML format will be influenced by the purpose of the format (simply for analysis, for dissemination, or for interchange) and by the degree of blindness that is required.

Whatever the purposes and real-world needs that guided the design of an XML intermediate format at level 2, the format, once implemented at level 3, takes on a life of its own as it is used to represent a variety of data, some of which may be unanticipated by the designer. By following the integrated approach described here, the resulting format will be more likely to be adaptable to varied circumstances and to be compatible with other formats that are part of the same family of formats.

Figure D.3.3 – Relationship between interchange, dissemination and data analysis

What is DXLT, in the context of other formats?

DXLT, the Default XLT Format, is a privileged member of the SALT XLT family of formats. The purpose of this family of formats is to represent terminological data. A particular member of the XLT family is distinguished from other members by one choice: data-constraint specification (DCS) file.. DXLT is defined by choosing a particular data-constraint specification with constraints designed to support relatively blind interchange. The DXLT core structure module and DXLT data-category specification module have formal definitions specified in Annex A and Annex B, respectively, of this SALT document. Note that Annex A may not currently include the NLP module because of on-going development in the OLIF project. Other members of the SALT XLT family can be defined by various user-groups. It is expected that for most purposes, a member of the XLT family can be defined by a subset of the Default DCS file. However, it is understood that unusual user needs may require the definition of a DCS file that is not a subset of the Default XLT DCS file.

There is another format mentioned in Annex E of this SALT document. It is called DXLT-SRa and it differs from DXLT only in representation style. Therefore DXLT and DXLT-SRa, are fully interoperable.

As mentioned above, user-groups can define various subsets of DXLT. All subsets use the XLT data-constraint specification layout. A subset of DXLT is defined by the combination of the XLT core structure and a subset of the Default XLT data constraint specification. When dealing with subsets of DXLT, the full data-category specification is called the DXLT master data-category-specification. Any XML data stream that is a valid instance of a DXLT-subset format will also be a valid DXLT data stream, but some valid DXLT data streams are not sufficiently constrained to be valid DXLT-subset data streams.

A significant benefit of this approach of defining DXLT subsets is that the person defining the subset does not need to understand or even see an XML DTD or schema. Only the data-category specification itself. And the user can be buffered not only from DTDs and schemas but also from the internal representation of the DCS itself through the use of a software application.

How does DXLT relate to other formats?

A format defined by a subset of the Default XLT DCS could be called a child of DXLT, and an XLT format not defined by the Default XLT or a subset of it could be called a sibling of DXLT. A format within the TMF project that is not part of the XLT family could be called a cousin of DXLT. All formats in the extended family defined by TMF that share the same DCS are interoperable.

It remains to be seen whether the TBX format being developed by the LISA OSCAR group will be a child or sibling of DXLT. However, it has been agreed TBX will, if at all possible, be a member of the SALT XLT family of formats. Various OSCAR member companies will then use TBX, or some subset of it, for exchanging terminological data within their organization and with outside contractors and other groups as appropriate. It has also been agreed that the OLIF2 consortiium will supply the OLIF2 format to the SALT project so that a software utility can be developed for merging an OLIF2 file into an XLT file and subsequently recreating the original OLIF file, thus facilitating the coordination of machine-translation and human-translation terminological resources, for example. It is anticipated that at some point XLT will replace the current chapter on terminological data in the Text Endoding Initiative guidelines. Finally, as noted above, it is anticipated that XLT, without the NLP option, will be compatible with the ISO TC/37 TMF project. The DXLT format should provide the basis for the ISO TC/37 MSC project when it is re-activated. This means that it will be possible to generate the XLT core structure from the TMF abstract structure by specifying a vocabulary of tag names and an appropriate set of representation choices. The same will be true of the Geneter format. Thus, most of the work being done world-wide in the area of representation formats for terminological data will be interoperable so long as compatible data constraints are being used.

Annex E
(informative)

Alternate representations

DXLT is a format is a member of the ISO/TC37 TMF set of TMLs (formats) defined by choosing (1) the core structure-module of DXLT (that is, the DTD and DCS layout in Annex A), (2) the DXLT master DCS in Annex B, and (3) MARTIF representation style.

Other formats, known as DXLT subset formats, are defined by creating a subset of the DXLT master DCS.

Another format, called DXLT-SRa, is defined by choosing the same core structure and DCS as DXLT but a different representation style. DXLT and DXLT-SRa are fully interoperable formats. Indeed, a fairly simple algorithm can convert between them.

Here is the sample DXLT document from section 5 of this specificatiion. It has been slightly adapted to work with the XDR-schema version of the DXLT core-structure. This schema is almost exactly equivalent to the DTD.

<?xml version='1.0'?>

<fileDesc><sourceDesc>from an Oracle corporation termBase</sourceDesc></fileDesc>

</martifHeader>

<descrip type='subjectField' datatype="noteText">manufacturing</descrip>

<descrip type='definition' datatype="noteText">A value between 0 and 1 used in ...</descrip>

<tig>

<term>alpha smoothing factor</term>

<termNote type='termType' datatype="picklistVal">fullForm</termNote>

</tig>

</langSet>

<tig><term>Alfa simítási tényezõ </term></tig>

</langSet>

</termEntry>

</body> </text>

</martif>

Now we give the same information in DXLT-SRa format (except that some comments were deleted):

<?xml version="1.0"?>

from an Oracle corporation termBase

</sourceDesc>

</fileDesc>

DXFd-mwk

</encodingDesc>

</martifHeader>

<text>

<body>

<subjectField metaType="descrip" datatype="noteText">manufacturing</subjectField>

<definition metaType="descrip" datatype="noteText">A value between 0 and 1 used in ...</definition>

<tig>

<term>alpha smoothing factor</term>

</tig>

</langSet>

<tig>

<term>Alfa simítási tényezõ </term>

</tig>

</langSet>

</termEntry>

</body>

</text>

</martif>

The DXLT-SRa data stream is completely equivalent to the primary representation of DXLT, on an element for element basis. A simple algorithm can convert back and forth between primary and secondary representation without loss of information. Indeed, the DXLT-SRa version was created from the DXLT version by a webpage using embedded JavaScript to access the XML DOM built into Internet Explorer 5.

A careful examination of the DXLT version shows that the optional datatype attribute in the DXLT core structure has been used to indicate the datatype of the text found in each instantiation of an DXLT meta data category (subjectField and definition are note text and termType is a picklist value). This information, which can be retrieved from the DCS file, guides the algorithm that converts from DXLT to DXLT-SRa format.

Bibliography

[1] ISO 639:1988, Code for the representation of names of languages.

[2] ISO 639-2:1998, Code for the representation of names of languages – Part 2: Alpha-3 code.

[3] W3C, Extensible Markup Language (XML) 1.0 (W3C recommendation 10-February-1998) (http://www.w3.org/TR/1998/REC-xml-19980210)

[4] W3C, XML Schema Part 1: Structures (W3C Working Draft 7 April 2000) (http://www.w3.org/TR/xmlschema-1)

[5] W3C, XML Schema Part 2: Datatypes (Working Draft 7 April 2000) (http://www.w3.org/TR/xmlschema-2)