Reference name of working document: DXLT specification draft 1b
Date: 2000-09-16
SALT project — XML representations of Lexicons and Terminologies
(XLT) — Default XLT Format (DXLT)
Document type: SALT working draft Document language: en |
Warning
This document is not an ISO standard. It is distributed
for review and comment. It is subject to change without notice and may not be
referred to as an International Standard. It is in the format of an
International Standard but is produced by the SALT project. It is derived from
the MSC project of ISO TC/37, which was superceded by the MTF project.
Recipients of this document are
invited to submit, with their comments, notification of any relevant patent rights
of which they are aware and to provide supporting documentation.
Copyright notice
This SALT document is a project draft and is
copyright-protected by SALT, with the Brigham Young University Linguistics
Department Translation Research Group, Provo 84602, USA, as its agent. While
the reproduction of project drafts in any form for use by official participants
in the SALT project is permitted without prior permission from SALT, neither
this document nor any extract from it may be reproduced, stored or transmitted
for any other purpose without prior
written permission from the SALT project.
The SALT project (see, for example,
www.ttt.org/salt/ for more information) is an international project funded by
the European Union and by private industry and supported by a number of
universities and industry groups.
please send comments on this document to the SALT
project c/o Alan K. Melby:
e-mail: <akm@byu.edu>
telephone number: +1 801 378-2144 (Provo, USA)
Reproduction for sales purposes may be subject
to royalty payments or a licensing agreement. Violators may be prosecuted.
Foreword....................................................................................................................................................... iv
Introduction.................................................................................................................................................... 1
0.1 Intended audience.............................................................................................................................. 1
0.2 A family of formats.............................................................................................................................. 1
0.3 Distinction between DXLT and other XLT formats.............................................................................. 2
1 Scope................................................................................................................................................. 3
2 Relevant ISO Standards ..................................................................................................................... 3
3 Terms and definitions......................................................................................................................... 4
4 Requirements for DXLT documents.................................................................................................... 5
5 An example of a DXLT document....................................................................................................... 6
6 Definition of the core-structure component......................................................................................... 8
6.1 General............................................................................................................................................... 8
6.2 Hierarchical overview......................................................................................................................... 8
6.3 Text elements, i.e., elements that contain plain, basic or note text................................................... 9
6.4 Meta data categories........................................................................................................................ 10
7 Definition of the default data-constraint specification (DCS) component.......................................... 11
7.1 General............................................................................................................................................. 11
7.2 Systematic listing of XML-element data categories in DXLT............................................................. 11
8 Defining user-group subsets.............................................................................................................. 17
8.1 General............................................................................................................................................. 17
8.2 An example of a user-group DCS file................................................................................................ 17
Annex A Core structure component............................................................................................................. 19
A.1 The core-structure DTD for DXLT....................................................................................................... 19
A.2 The schema version of the DXLT core structure................................................................................ 21
Annex B The data-constraint component..................................................................................................... 22
B.1 The DCS schema............................................................................................................................... 19
B.2 The Default DCS file.......................................................................................................................... 21
Annex C Examples....................................................................................................................................... 23
C.1 Low-level encoding (characters, dates, locales, etc) in DXLT........................................................... 23
C.2 Representing DXLT data categories in terminological entries.......................................................... 23
C.3 Encoding guidelines......................................................................................................................... 25
Annex D Design, application, and context of XLT........................................................................................ 31
D.1 Design principles.............................................................................................................................. 31
D.2 Applications of DXLT and other XLT formats.................................................................................... 32
D.3 Connections between XLT and TMF................................................................................................. 32
Annex E Conformance checking.................................................................................................................. 40
Bibliography................................................................................................................................................. 42
SALT is the acronym for "Standards-based
Access to Lexicons and Terminologies". The SALT project is working in
co-operation with ISO Technical Committee 37, the LISA OSCAR group, the OLIF2
consortium, the Text Encoding Initiative, the ISLES project, and other entities
with common interests. As the name implies, SALT is based on various existing
standards.
A principal objective of the SALT project is to
facilitate the representation, dissemination, and exchange of highly-structured
information from both human-oriented terminological data collections
(terminologies) and machine-translation lexicons.
XLT, which is being developed within the SALT
project, stands for XML-based formats for Lexicons and Terminologies. It is
anticipated that XLT will (1) support
the merging and extraction of OLIF2 files, (2) provide the basis for the
OSCAR TBX format, and (3), when restricted to the Terminologies side, fall
within the Terminology Markup Framework (TMF) currently being developed by ISO
Technical Committee 37.
This SALT document defines an XML-based
application referred to as the Default XLT Format (DXLT). DXLT is the
primary member of the XLT family of
formats. This document also provides the basis for defining other members of
the XLT family. The intended audience for this document consists of three
groups: (1) programmers and analysts
who desire to develop software applications that process XLT-compliant data
streams, for example, by converting them to
data streams in some other format or by deriving XLT-compliant data streams from some other format; (2) terminologists and other language
specialists who desire to analyze a terminological data collection for
representation in some XLT format, in particular in DXLT, or to define either a
user-group subset of DXLT or some other XLT format, and (3) managers who desire to obtain an
overview of the XLT family and its default format, DXLT.
Each of these three groups should be familiar with
this Introduction. In addition to an understanding of this Introduction,
terminologists and other language specialists need a basic understanding of the
structure of XML documents and the data categories in ISO 12620. Besides having
or obtaining this background information, they should study the body of this
SALT document (sections 1-8) and annexes C and D, but they do not need the ability to write or modify
XML DTDs or schemas. An introduction to the data categories of ISO 12620 is
available through www.ttt.org. Programmers and analysts developing software
applications to process DXLT and other XLT formats must have a thorough
knowledge of XML and familiarity with the entirety of this SALT document and
the various standards on which it is based.
The XLT family of formats is based on various international
standards. The X in XLT stands for XML, indicating that each member of the
XLT family is an XML application. The L
in XLT stands for Lexicons,
indicating that information from human-oriented lexicons and NLP lexicons
(especially machine translation lexicons) can be incorporated into XLT. The NLP
aspect of XLT is based on OLIF (see Otelo project,
http://www.olif.net/olif/OLIF1.html). The T
in XLT stands for Terminologies. The
terminological approach of XLT is based on two ISO standards (ISO 12620 and
12200). ISO 12620 provides an inventory of data categories (i.e., data element
types, often implemented as column names in a table or field names in a
record). ISO 12200, also known as Martif, provides the basis for the core
structure for the family of formats. Thus, XLT is a standards-based family of
formats for representing, manipulating, and sharing terminological data.
Each member of the XLT family differs from others only in which data
categories are allowed and what values they can take. These choices are
represented in a Data Constraint Specification (DCS) file. The following figure
shows how XLT is based on the classic form-content distinction. Each
combination of the core DTD/schema (which defines the structure) and a
particular DCS file (which defines the allowed content) results in a format
that is a member of the XLT family of formats.
XLT Family of Formats
Form Content
Core DTD/schema DCS 1 DCS 2
Format 1 Format 2 … Format n
Default-XLT (DXLT) is one member of the XLT family of formats. The DCS file that defines DXLT is naturally called the Default DCS file of XLT. It is anticipated that the data categories in the Default DCS file will suffice for most dissemination and interchange tasks. It thus expected that most members of the XLT family of formats will be defined using strict subsets of the Default DCS file. However, it is possible that some particular application will require data categories or data-category values not allowed by the Default DCS file. In that case, a DCS file can be defined that is not a subset of the Default DCS file. Subsets of the Default DCS file define "children" of DXLT, and custom DCS files that are not subsets of the Default DCS file define "siblings" of DXLT. XLT is simply the family of formats defined by the XLT core structure and all the various DCS files that combine with it.
The data models underlying terminology resources can be very complex, and therefore XLT formats can also be complex. Complexity is managed by identifying generalizations and breaking down complex objects into simpler modules that can each be understood on its own. The XLT approach abstracts away the structure found in a variety of formats and places it in the core structure module that contains very general data elements such as <descrip> (descriptive information) and <admin> (administrative information). The specialization of the core structure to specific data categories is represented in a DCS file, which may include the data category definition as a particular type of <descrip> element. This allows XLT-aware software to deal with a relatively simple core structure and adapt automatically to various members of the XLT family by consulting a DCS file, which has a very simple structure. Complexity is not magically eliminated, since the logical combination of the core structure and a particular DCS file can indeed be rather complex. But in XLT each of the two modules (form and content) can be dealt with separately, in accordance with basic principles of object-oriented design. No one terminology format can satisfy the needs of user groups; however, based on experiments to date, most user groups can use the same core structure and accommodate their particular needs using a user-group-specific DCS file.
It is anticipated that the LISA OSCAR TermBase eXchange format (TBX) will be a subset of DXLT. Also, the European Union project called IATE is using an intermediate format (IATE-XLT) that is a subset of DXLT. Any two members of the XLT family are interoperable in so far as their respective DCS files are compatible.
SALT
project — XML representations of Lexicons and Terminologies (XLT) —
Default XLT Format (DXLT)
1 Scope
For various types of machine processing,
including transmission over the Internet, terminological data can be
represented using XML. The format defined by this SALT document is an XML
application designed to support machine processing of terminological data in
various computer environments, including standalone computers, the Internet,
and intranets.
The format defined in this SALT document is
designed to represent terminological data in a relatively "blind",
that is, neutralized fashion for purposes of (a) interchange, (b)
dissemination, and (c) data analysis. This SALT document is based on (1) an
XML-compliant core structure compatible with “Negotiated MARTIF” (ISO 12200)
and (2) an XML formalism called the Data Constraint Specification (DCS) schema
for specifying constraints on the core structure. In addition this SALT
document contains one set of constraints, the Default set of constraints,
expressed in that formalism. Each set of constraints specifies (a) which data
categories, primarily from ISO 12620, are allowed as instantiations of the meta
data categories in the core structure, (b) which values the data categories can
take, and (c) at which levels in the core structure data-category elements can
appear. In addition, this set of constraints can de-activate selected modules
and options of the core structure, such as which languages are allowed, whether
certain text markup tags are allowed, and whether particular types of
Complementary Information are allowed in the current family member. The format
defined by the core structure and data-category specification included in this
SALT document is called DXLT (the Default-XLT format).
This SALT document further provides guidelines
for specifying user subsets of DXLT. The specification of a user subset does
not involve modification of any XML DTDs or schemas. Other members of the XLT
family of formats can be defined using the core structure and DCS formalism
included in this document. XLT formats include no recursive XML elements, thus
reducing the processing burden on import routines.
XLT formats are members of the lcollection of
formats intended to be compliant with ISO Technica Committee project called TMF
(ISO/CD 16642 – Terminology Markup Framework). XLT is being developed in
parallel with TMF (see Annex D). It is intended that DXLT and its subsets, in
particular, will qualify as Terminology Markup Languages (TMLs) within TMF.
The following ISO standards relevant. For dated
references, subsequent amendments to, or revisions of, any of these
publications do not apply. However, parties to agreements based on documents
are encouraged to investigate the possibility of applying the most recent
editions of the standards indicated below. For undated references, the latest
edition of the standard referred to applies. Members of ISO and IEC maintain
registers of currently valid International Standards.
The key ISO standards and projects upon which this document is based
are: (1) ISO/CD 16642 (the TMF project) (2) ISO 12200:1999 (Negotiated MARTIF)
as amended by TC37/SC3 NWI 318, (3) ISO 12620:1999 (Data Categories), (4) ISO
8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N 029:1998-12-06) to
allow for the definition of XML, and (5) ISO 10646-1 (commonly known as
Unicode).
Expanded list of relevant ISO standards (not including projects which are not yet International Standards):
- ISO/IEC 639, Information
technology – ISO 639:1988, Code for the representation of names of
languages.
- ISO 639-2:1998, Code for the
representation of names and languages—part
2:Alpha-3 code.
- ISO/IEC 646:1991, Information
technology – ISO 7-bit coded character set for information interchange.
- ISO 1087:1990, Terminology –
Vocabulary.
- ISO/1087-2:1999, Terminology
work – Vocabulary – Part 2: Computer applications.
- ISO 3166-1:1997, Code for the
representation of names of countries and their subdivisions – Part 1: Country
codes
- ISO 8601:1988, Data elements and
interchange formats – Information interchange – Representation of dates and
times.
- ISO 8879:1986 (SGML) as extended by TC2 (ISO/IEC JTC 1/SC 34 N
029:1998-12-06) to allow for XML.
- ISO/IEC 10646-1:1993, Information
technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and basic
multilingual plane.
- ISO 12200 as amended, Computer
applications in terminology – Machine-readable terminology interchange format
(MARTIF) – Negotiated interchange.
-
ISO 12620, Terminology –- Computer applications
– Data categories.
For the purposes of this SALT document, the
following terms and definitions apply:
3.1
analysis
identification of the elements and structure of
a terminological data collection so that the data fields, their types, and
their relationships are made explicit
3.2
blindness
property of a data format indicating the degree
to which the data are so rigorously defined that it is unnecessary for the
importer to establish contact with the originator of the data in order to
interpret them
NOTE: The property of blindness is achieved through
the process of neutralization of differences between original formats. The
metaphor behind the term blindness, which has its origin in the engineering
phrase “blind transmission”, is that on the receiving end of a transmission, it
is unnecessary to “see” who is sending the information in order to process it.
Blindness is not an absolute property but is a matter of degree.
3.3
core-structure module
component of a format’s definition that
specifies some elements as meta data categories and indicates which structural
relations are allowed among elements
3.4
data category
result of the specification of a given data
field [ISO 1087-2:2000], (i.e. a type of data field, such as definition)
NOTE: ISO 12620 is an inventory of data categories.
3.5
data stream
a sequence of bytes that correspond to the
contents of a document or file
NOTE: an XML document can be called a “document”, a “file”,
or a “data stream” interchangeably
3.6
data constraint module
component of a format’s definition that
constrains the core-structure module, e.g., by specifying which data categories
are allowed and how each data category can be used
3.7
dissemination
representation of data in an intermediate
format that allows a wide range of potential users to access and reuse the data
3.8
pre-negotiation
property of an intermediate format indicating
that it is adapted to maximizing the preservation of both content and
structural nuances found in the source data, even at the expense of blindness
NOTE: Pre-negotiation and blindness, although
sometimes at odds with each other, should not be considered antonyms, but
rather choices imposed by the tension between complete neutralization and
complete preservation of information in a data collection.
3.9
interchange
transaction involving exporting data from and
importing data into a terminological data collection where those data are
represented in some intermediate format for the purpose of facilitating access
to the data by computer programs
3.10
meta data category
a name used to group similar data categories
together; thus, a category of data categories
NOTE: Meta data categories XLT include descrip, admin and termNote.
3.11
modularity
property of an
electronic format whereby the complexity of the structure and content treated
by the format is managed by defining
sub-components that can be studied separately, side by side, and then logically
combined
NOTE: In XLT, one module defines the core structure
using meta data categories, and the other module specifies constraints on the
core structure module, including which data categories can instantiate each
meta data category.
3.12
metadata registry
description of the
fields in a database for the purpose of facilitating understanding by outside
parties [cf. definition in ISO 11179].
3.13
neutralization
process whereby the differences between the
representation of data elements from various original data collections are reduced
by re-expressing them using the pre-specified structural features, data
categories, and data-category values of an intermediate format
3.14
representation
expression of data content and structural
relationships in an intermediate format outside the environment of the
originating data collection
NOTE: Representation may involve the retention of all
or part of the information from the originating data collection; in addition,
it can involve various degrees of
neutralization and thus tend toward either blindness or pre-negotiation.
3.15
XML™ (eXtensible
Markup Language)
universal format for structured documents and
data on the World Wide Web (WWW); a particular subset of SGML.
NOTE: XSLT is a programming language
specifically designed for manipulating XML documents
For an XML document to be considered
DXLT-compliant, it must qualify on three counts: (1) It must be a well-formed
XML document. (Well-formedness is a purely formal XML notion based on such
criteria as all elements being explicitly empty or explicitly terminated and
not overlapping.) (2) It must be valid according to the XLT core-structure
module (described informally in section 6 and defined formally by the XML DTD
in Annex A). (Validity is also a formal XML notion.) (3) It must adhere to the
constraints in the Default data constraint specification (DCS) module or
user-defined subset thereof currently applicable. These three counts are levels
of conformance to the DXLT specification. Requirements for other members of the
XLT family are similar, the only difference being that the third count requires
adherence to the particular DCS module associated with that family member.
In practice, DXLT documents are typically
created by an export routine in some piece of HLT (Human Language Technology)
software, and they can either be displayed using a tool such as XSLT or be
processed by an import routine that is part of some other piece of HLT
software. So long as the XML documents that are created and processed are DXLT-compliant,
it is not necessary for a human to inspect them and no formal conformance check
is necessary. However, in some circumstances, such as dealing with suspected
data corruption, DXLT-compliance can be checked using DXLT-validation software.
The first two aspects of DXLT-compliance can be
checked by validating the DXLT document against the DTD of the core structure
using a validating XML parser, and the third aspect can be checked using a
custom software application that checks for adherence to the constraints in the
DCS module.
As noted above, it is possible to validate
whether any given well-formed XML data stream is DXLT compliant. However, this
validation is a formal process and does not ensure that appropriate
terminological methods have been used to create the data or that the content of
the data categories is accurate. Validation may determine, for instance, that
the value of an XML element such as term type is not one of the allowed values,
but validation cannot detect a poorly written definition. See Figure 4.1 for
examples of these distinctions in DXLT. The first part is not well-formed,
since the first <descrip> element has a spelling error in the end tag and
since the second <descrip> has no closing tag at all. The second part is
well-formed but not valid, since the core-structure module of DXLT does not
allow for a <desskrip> tag. The third part conforms to the XLT DTD but
not to the Default DCS of DXLT, since there is no DXLT data category called
"conflagration". The fourth part is valid but not accurate, since a
kitten is not a dog or wolf.
Not well-formed:
<term>kitten</term>
<descrip
type=’definition’>content</decrip>
<descrip
type=’definition’>other content
Well-formed but not valid:
<term>kitten</term>
<desskrip type="definition">content</desskrip>
<descrip
type="definition">other content</descrip>
Valid but not DCS-adherent:
<term>kitten</term>
<descrip
type="conflagration">content</descrip>
Valid and DCS-adherent but not accurate:
<term>kitten</term>
<descrip type="definition">a young dog (canis lupus)</descrip>
Figure
4.1 — Well-formedness, validity, adherence, and accuracy
The following is an example of a simple but
complete DXLT document. The numbers in square brackets to the left of certain
lines are not part of the DXLT document. They serve as footnote numbers to the
comments below.
[1] <?xml version='1.0'?>
<!DOCTYPE martif PUBLIC "ISO 12200:1999A//DTD MARTIF
core (XLTcdV04)//EN">
[2] <martif type='DXLT'
lang='en' >
[3] <martifHeader>
<fileDesc><sourceDesc><p>from
an Oracle corporation termBase</p></sourceDesc></fileDesc>
<encodingDesc><p
type='DCSName'>DXLTdV04</p></encodingDesc>
</martifHeader>
[4] <text> <body>
[5] <termEntry
id='ID67'>
[6] <descrip type='subjectField'>manufacturing</descrip>
[7] <descrip
type='definition'>A value between 0 and 1 used in …</descrip>
[8] <langSet
lang='en'>
[9] <tig>
<term>alpha
smoothing factor</term>
[10] <termNote
type='termType' >fullForm</termNote>
[11] </tig>
[12] </langSet>
[13] <langSet
lang='hu'>
[14] <tig><term>Alfa
simítási tényezõ
</term></tig>
[15] </langSet>
[16] </termEntry>
[17] </body> </text>
[18] </martif>
Only a minimal acquaintance with XML is assumed
in the following explanation. Indeed an acquaintance with HTML from building
simple web pages, along with the knowledge that XML allows user-defined tag
names whereas HTML comes with a set of pre-defined tag names, should be
sufficient to allow understanding of the following explanation. For key DXLT
elements, the correspondence to the structural component of the meta-model in
the ISO TC 37 TMF project is given.
[1] <?XML ... : These lines
state that the following lines constitute an XML document that conforms to
version 1.0 of the definition of XML by the World Wide Web consortium (W3C) and
to the DXLT DTD.
[2] <martif ...: This line
states that this particular XML document is an DXLT document and thus, along
with other members of the XLT family, an be validated against a specification
of the XLT core structure, which, for this document, is called XLTcdV04, and
can be checked for adherence against the master Default DCS module. The lang attribute indicates that the
default language for text in this document is English (ISO 639 code 'en').
[3] <martifHeader ...: These
lines provide global information about the collection: specifically, a file
description indicating that the example was derived from an entry in an
termbase used at Oracle corporation and that the DXLT DCS (DXLTdV04, 'd' for
DCS), not to be confused with the XLT core DTD (XLTcd04, 'cd' for core DTD) is
being used.
[4] <text> <body>:
The text element surrounds the body element, which contains the collection of
concept-oriented "Teminological
Entry" (<termEntry>) elements.
[5] <termEntry ...: Each
termEntry element is one instance of the "Terminological Entry"
object class. The id attribute has a
value that is unique throughout the document, making it possible for other
elements to point unambiguously to this element.
[6] <descrip
type='subjectField' ...: The subject field data category is authorized by the
DCS (Data Constraint Specification) mentioned above. It consists of a meta data
category element (descrip) with the
specific data category indicated in the value of the type attribute.
[7] <descrip type='definition'
...: This piece of descriptive information is also associated with the concept.
[8] <langSet lang='en'>:
The langSet element corresponds to a "Language Section" object class,
according to which a Terminological Entry consists of associated information
and language sections. This line begins the English Language Section.
[9] <tig><term> ...:
The meta-model states that a Language Section consists of instances of a
"Term Section" object class, which, in DXLT corresponds to a
<tig> (or <ntig>) element. An instance of a Term Section consists
of a term and associated information, which in this case is the term type. The
name tig stands for term information group.
[10] <termNote type='termType'
...: This piece of descriptive information associated with the term is the
12620 data category "term type". Its value is "fullForm". A
termNote tag is used instead of descrip since the information is closely
associated with the term itself rather than the concept being described.
[11] </tig>: This element
simply ends the current Term Section.
[12] </langSet>: This element
ends the English Language Section.
[13] <langSet lang='hu'>:
This element begins the Hungarian Language Section.
[14] <tig> ...: This line
consists of a Term Section with a Hungarian term but no definition and no
explicit term type. Each character of the term that is not found in ISO 646 is
represented as a hex character reference corresponding directly to a Unicode
character. The actual Hungarian term is "Alfa
simítási tényező". Note that the final
character "ő" (o-tilde)
should more properly be an o-double-acute, which is represented by the following
Unicode hex character reference: "ő", a character not
available in a typical font. In XML, a
Unicode hex character reference consists of "&#x" + four hex
digits from the Unicode standard + a semicolon.
[15] </langSet>: This element
ends the Hungarian Language Section.
[16] </termEntry>: This
element ends the current Terminological Entry.
[17] </body> </text>:
These elements end the set of terminological entries, which in this case
consist of only one entry, and the XLT text
element, which is the composite of terminological entries and other resources
called Complementary Information in the meta-model. In this DXLT document,
there are no resources outside the terminological entry. If there were, they
would be in the XLT element back.
[18] </martif>: This element
ends the entire DXLT document.
This sample DXLT entry has several properties:
1. It corresponds directly to
the meta-model in the TMF project.
2. It
is a well-formed XML document.
3. It
conforms to DXLT, by being welli-formed as well as being valid according to the
core structure and by adhering to the master data constraint specification
(DCS) module of DXLT.
This section defines the core structure of XLT informally,
particularly for a human analyst who is either seeking to understand an XLT
document or to analyze source or target terminological data in order to prepare
a mapping that a programmer can use to write an automatic conversion routine
from the source format to, for example, DXLT or from DXLT to the target format.
The highest-level XML element in an XLT
document is the "martif" element, which consists of a
"martifHeader" element and a "text" element. (See Figure
6.1.)
The text
element in Figure 6.1 consists of terminological entries (that together make up
the XLT body element) and
"Complementary Information" (a meta-model object class) that are
found in the front and back elements.
The martifHeader
element corresponds to "Global Information" in the meta-model and
consists of a description of the whole terminological data collection (in the fileDesc element), information about the
data-category specification and character encoding ( in the encodingDesc element), and a history of
major revisions to the collection (in the revisionDesc
element).
A question mark after an element in the
box-and-line diagrams below indicates that it is optional.
See Annex A for more detail on these elements.
Figure
6.1 — The highest-level elements
Each terminological concept entry in the body
element is called a termEntry (see
Figure 6.2) and follows the structure of the meta-model.
The "auxInfo" element in Figure 6. 2
corresponds to "Terminology-related Information" in the meta-model,
and each piece of terminology-related information can associated with any one
of three levels: the Terminological Entry
level (termEntry in XLT, i.e. the
concept level), the Language-section
level (LangSet in XLT), and the Term-section level (ntig, or its simplified version, tig, in XLT). The termNote and
termNoteGrp elements at the Term -section level are also part of
Terminology-related Information in the meta-model and consist of term-related
descriptive elements that can only appear at the Term-section level and
below. The termCompList element
corresponds to the "Term Component Section" object class of the
meta-model.
entry-level language-level term-level
Figure
6.2 — The structure of a terminological entry in body
In XLT, auxInfo
consists of any combination of the following elements:
descrip,
descripGrp, admin, adminGrp, transacGrp, note, ref, and xref.
A ref
element is a crossreference that points somewhere inside the martif element. An xref element is a crossreference that points to an external object
using a URI (a URL or other Web address). A note
element, as expected, is a note. These three elements appear at various levels
to allow the creation of links and the recording of supplementary information.
A transacGrp
element gives information about a transaction. ISO 12620 (A.10.2) states that
the two terminology management functions concerning a transaction are date and responsibility. A date is specified by a date element, and a responsibility is specified by an adminNote
element. Thus, a transacGrp contains a transac
element that describes the transaction, accompanied by any combination of transacNote, date, note, ref, and xref elements that apply to the transaction. Any date in XLT must
appear within a transacGrp, even if an
implicit transaction must be made explicit.
An adminGrp
element is similar to a transacGrp in
that it groups information pertaining to another element, in this case an admin rather than a transac, specifically, a combination of adminNote, note, ref, and xref. An admin is a
simplified adminGrp in which there is
just a single admin element and the
adminGrp container has been omitted.
A descripGrp
element consists of a descrip element followed by any combination of descripNote, admin, adminGrp, transacGrp, note, ref, and xref elements.
The descrip
and admin elements are examples of
meta data categories in XLT. Each instance of a meta data category in XLT is an
element that is specialized by the value of its type attribute. The various instantiations of the meta data
categories are given in section 7. The
DXLT DCS file restricts each instantiation of a descrip to certain levels.
A termNoteGrp element, like other …Grp
elements, consists of a base element, in this case a termNote, and auxiliary information,
in this case, admin, adminGrp, transacGrp, note, ref, and xref elements. A comparison with descripGrp shows that the
difference is that there are no descrip elements in a termNoteGrp. This is because descrip contains concept-related data categories that do not apply
to the term itself.
A termCompList
element shows the internal composition of a term and consists of a combination
of termCompGrp and, in the simplified
case, termComp elements. A
termCompGrp, consistent with the pattern set by other ...Grp elements, consists of a termComp element and a combination of termNote, termNoteGrp, admin, adminGrp, transacGrp, note, ref, and xref elements that apply to it. Each termComp element contains some component of a term, such as one of
the words of which it is composed.
In XLT, elements such as descrip, descripNote, admin, adminNote, transac, transacNote, termNote, note, ref, and xref, contain text. Sometimes, the permissible values of the
element are restricted to a picklist. In other cases, the element can contain
free text. There are three types of free text in XLT: plain, basic, and note. Plain text (#PCDATA) is defined by
the XML specification. It contains no elements, only characters and character
entities. Basic text is plain text with the addition of optional embedded hi elements. A hi element highlights a segment of text and optionally points to
another element. One use of hi is to mark an entailed term inside a
definition. A term element contains
basic text. Note text, which is used in definitions and contextual examples and
similar elements, allows the following additional embedded elements besides hi: foreign,
bpt, ept, it, ph, and ut. The foreign element
is used to mark a segment of text that is in a different language from the
surrounding text, e.g. "a <foreign lang='fr'> pamplemousse
</foreign> is a grapefruit."
The five elements, bpt, ept, it, ph,
and ut, are meta-markup tags that are
used to mark up (i.e., encapsulate) markup to distinguish it from text. They
allow XLT elements to contain various kinds of markup that needs to be retained
but not necessarily processed during terminology management functions. Any such
enclosed markup is modified so that start-tag characters ('<') become
entities (<) and ampersands become entities (&). If a piece of
markup to be encapsulated consists of two paired pieces of markup, such as the
markup used to show that a piece of text is to be in bold or italics, then bpt and ept (begin and end paired tags) are used. If the markup to be encapsulated consists of one piece that would
be paired except that the other piece was cut off and appears outside the
current element, then an it (isolated
tag) is used. If the piece of markup to be encapsulated stands on its own,
marking a place such as a footnote, then ph
(placeholder) is used. If the
categorization of the piece of markup is unknown, then ut (unknown tag) is used.
Suppose
one has the following segment of text to put into an XML element in XLT:
"We
need a big dog."
The
marked-up text might be underlying this presentation might be:
"We
need a <bold> big </bold> dog."
This
is not a problem for meta-markup tags. One can put it into an XLT element as
follows:
"We
need a <bpt i='1'><bold>/bpt> big <ept
i='1'></bold></ept> dog."
Then
one can get the original segment back by taking out the meta-markup tags and
converting any "<" inside a meta-markup tag back to
"<".
Now
consider about the following segment (that uses SGML markup):
"We
need a big but < 50 pound
dog", which might have the following underlying SGML markup:
"We
need a <bold> big but < 50 pound </bold> dog"
(i.e.
a "big but less-than-fifty-pound dog" in which the less-than sign
"<" has already been converted to an SGML entity in the source
segment before placing it into XLT, since in this case the less-than sign is a
literal rather than an escape character).
One
would put it into an XLT segment as follows:
We
need a <bpt i='1'><bold>/bpt> big but &lt; 50 pound
<ept i='1'></bold></ept> dog.
Then,
when we try to re-construct the original segment, we will get what we started
with, since the & will be converted back to an ampersand.
HTML tags are one kind of markup that may be
enclosed inside meta-markup elements. This allows the markup to be retained and
processed during display or import without unduly complicating the core
structure by including the XHTML DTD include in the XLT core structure. Any kind
of markup, including RTF, can be encapulated in meta-markup tags and later
retrieved without loss of information. The XLT approach to meta markup is
borrowed from the TMX format of LISA, an ISO/TC 37 liaison organization and
supporter of the SALT project.
The meta data categories of DXLT are as
follows. Each of them can potentially be given multiple instantiations in a DCS
module, each instantiation specifying one data category. In DXLT, the specific data
category instantiation is indicated by the value of a type attribute (e. g.
<descrip type='definition'>).
a)
termNote
(A termNoteGrp element receives the data category of its termNote element.)
b)
termComp
(Each termComp element in a termCompList inherits the data category of the
list; then each termCompGrp element receives the data category of its termComp
element.)
c)
admin
(An adminGrp element receives the data category of its admin.)
d)
adminNote
e)
transac
(A transacGrp element receives the data category of its transac element.)
f)
transacNote
g)
descrip
(A descripGrp element receives the data category of its descrip element.)
h)
descripNote
i)
ref
j)
xref
k)
refObject
(Each refObject element in a refObjectList inherits the data category of the
list.)
In general, a
…Grp element in DXLT receives
the data category of the first element of the group, and all the elements of a …List element inherit the data category
of the list. If the …Grp elements
were not optional in the simple case of a single element, then the data
category would be specified on the …Grp
element directly.
A term is not formally a meta data category in
DXLT, but the termType data category
used with a termNote element is used
to specify term type, thus rendering a term element an indirect meta data
category.
6.5 Attributes
The
main attributes used in DXLT are lang
(language), type, id (to identify an
element uniquely), and target (to
point to an ID). Additional attributes are found in Annex A.
The value of the lang attribute inherits
downward through the implied tree structure of the XML document unless
overridden by another lang attribute. The martif element is required to have a
lang attribute. The language specified in the martif element becomes the
working language of the entire DXLT file. Each langSet element must also
specify a language that applies to that Language Section. Thus, a definition at
Terminological Entry level is assumed to be in the working language of the
martif file unless otherwise specified, and a note in a Language Section is
assumed to be in the language of that Language Section unless otherwise
specified.
The the allowed values of the lang attribute in
XLT are the same as the allowed values of the lang attribute in TMX.
The id
and target attributes work together to
point unambiguously between elements in the same martif file. For example, one
entry:
<termEntry id="c5574">
...(entry for "hunting dog")
</termEntry>
could be pointed to by another entry:
<termEntry>
<descrip
type="superordinateConceptGeneric" target="c5574">hunting
dog</descrip>
…(entry for "Retriever" [a type of hunting
dog])
</termEntry>
The redundant content "hunting dog" in the
second entry is for display purposes. It provides a name for the link to the
other entry that can be viewed by a human who is deciding whether to follow the
link.
This section describes the Default data
constraint specification (DCS) module for DXLT, which is based on a selection
of data categories from ISO 12620 selected to support somewhat blind
interchange. The formal, machine-processable version of the DXLT master DCS
module can be found in Annex B. It is
referred to as the master DCS of DXLT when distinguishing it from a particular
user-group subset DCS.
NOTE: The list orders the data categories according to
the section of ISO 12620 in which they are described. It is also the order in
which they appear in the master DCS module.
The following tables define the DXLT master DCS
(data constraint specification), which describes the data categories in DXLT
that are implemented as XML elements that instantiate a meta data category. The
remaining data categories are implemented as the term element, the note
element, the date element, the lang attribute, the id attribute, the hi
element, and the foreign element.
These basic data categories are mentioned in section 6, since they are part of
the core structure.
Guidelines for encoding particular data
categories in DXLT as XML are given in Annex C.
Each data category other than the basic data
categories is related to the meta-model by being classified as either
administrative or descriptive. Descriptive data categories may describe either
a concept or a term. All data categories that use the Martif tag name descrip are concept-related descriptive
data categories. All data categories
that use the Martif tag names termNote
or termComp are term-related
descriptive data categories. All data categories that use the tag name admin are administrative. Descriptive
and administrative data categories are further divided into properties and
relations. In DXLT, a data category is
a relation if the target attribute is allowed by the DCS file. Notes can be
either administrative or descriptive.
In the following table (split into parts for
convenience), the first column (ISO 12620) is the position code of the data
category in ISO 12620. The second column (Martif Data Category Name) is the
name of that data category when given as the value of the type attribute. Typically, it consists of the name in ISO 12620
with spaces removed and the first letter of the second and subsequent words
upper-cased. The third column (TextType) tells what kind of text is allowed in
the element. The fourth column tells whether this element can take a target
attribute, in which case it indicates what kind of element can be targeted. The
fifth column (Martif Tag Name) tells which meta data category is used in DXLT
for this data category. The sixth column (Level) gives any exceptional
information about the levels in the meta-model at which a particular data
category can appear. Admin elements can appear at any level. Descrip elements can appear at the
entry, language, or term levels unless otherwise restricted (using codes TE for
Terminological Entry, LS for Language Section, and TM for term). TermNote elements can appear at only at
the term level, unless authorized (by a TC code) to appear also at the Term Component level.
The last column (column seven) contains various
comments. The code PA means that this data category is not yet officially in
ISO 12620 and is thus Pending Approval (PA). Picklists are found after the
tables in footnotes. If the comment column contains a position code of a data
category, this indicates that the listed data category has been combined with
the data category of the current row.
Data
categories that do not have a picklist in the DXLT master DCS can have a
picklist in a user-group subset DCS of DXLT (see section 8) if the user-group
in question can agree on a picklist for that data category. One obvious
candidate for a user-group picklist is partOfSpeech, for which there is no
agreed-on picklist when all the languages of the world and all linguistic
theories are to be taken into account.
List 7.2
Basic data categories:
-
term [A.01]: <term>
-
highlighted text: <hi>
-
foreign language text: <foreign>
-
language [A.10.07]: the lang attribute, e.g., lang="es" on an element
-
element identifier [A.10.15]: the id attribute, e.g. <termentry id="eid-45631">
-
date [A10.02.01]: <date>
-
comment [A.08]: <note>
Table 7.2a-n
Table
7.2a — Types of terms (12620: A.2.1)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.02.01 |
termType |
picklist |
none |
termNote |
|
f1 |
A.02.01.05 |
commonNameFor |
basicText |
term |
termNote |
|
|
A.02.01.08 |
abbreviatedFormFor |
basicText |
term |
termNote |
|
|
f1: picklist:
mainEntryTerm, synonym, internationalScientificTerm, commonName,
internationalism, fullForm, shortForm, abbreviatedForm, variant,
transliteratedForm, transcribedForm, symbol, formula, equation,
logicalExpression, sku, partNumber, phraseologicalUnit, standardText
Table
7.2b — Grammar, Usage, and Origin (12620: A.2.2 – A.2.4)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.02.02.01 |
partOfSpeech |
plainText |
none |
termNote |
TC |
|
A.02.02.02 |
grammaticalGender |
picklist |
none |
termNote |
TC |
f2 |
A.02.02.03 |
grammaticalNumber |
picklist |
none |
termNote |
TC |
f3 |
A.02.02.04 |
animacy |
picklist |
none |
termNote |
TC |
f4 |
A.02.02.07 |
grammaticalValency |
plainText |
none |
termNote |
TC |
PA |
A.02.03.01 |
usageNote |
noteText |
none |
termNote |
|
|
A.02.03.02 |
geographicalUsage |
picklist |
none |
termNote |
|
f5 |
A.02.03.03 |
register |
picklist |
none |
termNote |
|
f6 |
A.02.03.04 |
frequency |
picklist |
none |
termNote |
|
f7 |
A.02.03.05 |
temporalQualifier |
picklist |
none |
termNote |
|
f8 |
A.02.03.06 |
timeRestriction |
noteText |
none |
termNote |
|
|
A.02.03.07 |
proprietaryRestriction |
picklist |
none |
termNote |
|
f9 |
A.02.04.01 |
termProvenance |
picklist |
none |
termNote |
|
f10 |
A.02.04.02 |
etymology |
basicText |
none |
termNote |
TC |
|
f2: picklist:
masculine, feminine, neuter, other
f3 picklist: singular, plural, dual, mass, other
f4: picklist:
animate, inanimate, other
f5: picklist:
SF, CH, FR etc from ISO 3166 (country codes)
f6: picklist:
neutralRegister, technicalRegister, in-houseRegister, bench-levelRegister,
slangRegister, vulgarRegister
f7: picklist:
commonlyUsed, infrequentlyUsed, rarelyUsed
f8: picklist:
archaicTerm, outdatedTerm, obsoleteTerm
f9: Picklist:
trademark, tradeName
f10: transdisciplinaryBorrowing,
translingualBorrowing, loan, translation, neologism
PA = Pending Approval
Table
7.2c — Term components (12620: A.2.5 – A.2.8)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.02.05 |
pronunciation |
basicText |
none |
termNote |
|
|
A.02.06 |
syllabification |
basicText |
none |
termCompList |
|
|
A.02.07 |
hyphenation |
basicText |
none |
termCompList |
|
|
A.02.08.01 |
morphologicalElement |
basicText |
none |
termCompList |
|
|
A.02.08.02 |
termElement |
basicText |
none |
termCompList |
|
|
A.02.08.03 |
termStructure |
noteText |
none |
termNote |
|
PA |
PA = Pending Approval
Table
7.2d — Term status (12620: A.2.9)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.02.09.01 |
normativeAuthorization |
picklist |
none |
termNote |
|
f11 |
A.02.09.02 |
languagePlanningQualifier |
picklist |
none |
termNote |
|
f12 |
A.02.09.03 |
administrativeStatus |
picklist |
none |
termNote |
|
f13 |
A.02.09.04 |
processStatus |
picklist |
none |
termNote |
|
f14 |
Note: discussion is needed as to whether processStatus should have a picklist or be plainText
f11: picklist:
standardizedTerm, preferredTerm, admittedTerm, deprecatedTerm, supercededTerm,
legalTerm, regulatedTerm
f12: picklist:
recommendedTerm, nonstandardizedTerm, proposedTerm, newTerm
f13: picklist:
standardizedTerm, preferredTerm, admittedTerm, deprecatedTerm, supercededTerm,
legalTerm, regulatedTerm
f14: picklist:
unprocessed, provisionallyProcessed, finalized
Table
7.2e — Equivalence (12620: A.3)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.03.02 |
falseFriend |
basicText |
term |
termNote |
|
|
A.03.04 |
reliabilityCode |
picklist |
none |
descrip |
|
f15 |
A.03.05 |
transferComment |
noteText |
term |
termNote |
|
|
f15: picklist:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Table
7.2f — Classification System (12620: A.4)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.04 |
subjectField |
plaintext |
none |
descrip |
LS,
TM |
|
A.04.02 |
classificationCode |
plaintext |
classSysDescrip |
descrip |
LS,
TM |
A.4.1+ |
Note: In 12620, A.04.02 is called classificationNumber.
Table 7.2g — Concept-related descriptions (12620: A.5)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.05.01 |
definition |
noteText |
none |
descrip |
LS, TM |
|
A.05.02 |
explanation |
noteText |
none |
descrip |
LS, TM |
|
A.05.03 |
context |
noteText |
none |
descrip |
LS, TM |
|
A.05.04 |
example |
noteText |
none |
descrip |
LS, TM |
|
A.05.05.01 |
figure |
noteText |
binaryData |
descrip |
TE, LS, TM |
|
A.05.05.02 |
audio |
noteText |
binaryData |
descrip |
TE, LS, TM |
|
A.05.05.03 |
video |
noteText |
binaryData |
descrip |
TE, LS, TM |
|
A.05.05.04 |
table |
noteText |
binaryData |
descrip |
TE, LS, TM |
|
A.05.05.05 |
otherBinaryData |
noteText |
binaryData |
descrip |
TE, LS, TM |
|
A.05.06 |
unit |
noteText |
none |
descrip |
TM |
|
A.05.07 |
range |
noteText |
none |
descrip |
TM |
|
A.05.08 |
characteristic |
noteText |
none |
descrip |
TM |
|
Table
7.2h — Concept relations (12620: A.6 and A.7 combined)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.07.02 |
conceptPosition |
plaintext |
conceptSys Descrip |
descrip |
TE, LS |
A.07.01+ |
A.07.02.01 |
broaderConceptGeneric |
basicText |
entry |
descrip |
TE, LS |
A.06.01+ |
A.07.02.01 |
broaderConceptPartitive |
basicText |
entry |
descrip |
TE, LS |
A.06.02+ |
A.07.02.02 |
superordinateConceptGeneric |
basicText |
entry |
descrip |
TE,
LS |
A.06.01+ |
A.07.02.02 |
superordinateConceptPartitive |
basicText |
entry |
descrip |
TE, LS |
A.06.02+ |
A.07.02.03 |
subordinateConceptGeneric |
basicText |
entry |
descrip |
TE,
LS |
A.06.01+ |
A.07.02.03 |
subordinateConceptPartitive |
basicText |
entry |
descrip |
TE, LS |
A.06.02+ |
A.07.02.04 |
coordinateConceptGeneric |
basicText |
entry |
descrip |
TE,
LS |
A.06.01+ |
A.07.02.04 |
coordinateConceptPartitive |
basicText |
entry |
descrip |
TE,
LS |
A.06.02+ |
A.07.02.05.01 |
relatedConceptBroader |
basicText |
entry |
descrip |
TE,LS |
|
A.07.02.05.02 |
relatedConceptNarrower |
basicText |
entry |
descrip |
TE,LS |
|
A.07.02.05 |
relatedConcept |
basicText |
entry |
descrip |
TE,
LS |
|
A.07.02.06 |
sequentialRelatedConcept |
basicText |
entry |
descrip |
TE, LS |
A.06.03 |
A.07.02.07 |
temporallyRelatedConcept |
basicText |
entry |
descrip |
TE,
LS |
A.06.03.
01 |
A.07.02.08 |
spatiallyRelatedConcept |
basicText |
entry |
descrip |
TE, LS |
A.06.03. 02 |
A.07.02.09 |
associatedConcept |
basicText |
entry |
descrip |
TE,
LS |
A.06.04 |
A.10.18.06 |
antonymTerm |
basicText |
term |
descrip |
TM |
|
A.10.18.06 |
antonymConcept |
basicText |
entry |
descrip |
TE |
|
Note: further discussion is needed concerning
whether antonyms are term-relations, concept-relations, or both
Table
7.2i — Specialized notes (12620: A.8)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.08.01 |
descripType |
picklist |
element |
descripNote |
|
PA,
f16 |
A.08.02 |
definitionType |
picklist |
element |
descripNote |
|
PA,
f17 |
A.08.03 |
contextType |
picklist |
element |
descripNote |
|
PA,
f18 |
note: General note
is not a meta data category
f16: picklist:
translation
f17: picklist intensionalDefinition,
extensionalDefinition, partitiveDefinition
f18: picklist
definingContext, explanatoryContext, associativeContext, linguisticContext,
metalinguisticContext
Table 7.2j — Documentary language (e.g., thesaurus) (12620: A.9)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.09.02 |
thesaurusDescriptor |
basicText |
thesaurus
Descrip |
descrip |
TE |
A.09.01+ |
A.09.04 |
keyword |
plaintext |
none |
admin |
|
|
A.09.05 |
indexHeading |
plaintext |
none |
admin |
|
|
Table
7.2k — Transactions (12620: A.10.1 – A:10.2-3)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.10.01 |
transactionType |
picklist |
none |
transac |
|
f19 |
A.10.02.02 |
responsibility |
basicText |
Person/Org |
transacNote |
|
two
data categories |
A.10.02.03 |
count |
plaintext |
none |
transacNote |
|
|
A.10.02.10 |
subsetOwner |
basicText |
personOrg |
admin |
|
|
Notes: <date> can also appear in a <transacGrp>; A.10.02.02
data categories are: responsiblePerson
and responsibleOrg
f19: creation [formerly
origination], input, modification, check, approval, withdrawal,
standardization, exportation, importation, proposal, userAccess
Table
7.2l — Subsets (12620: A.10.3)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
A.10.03.01 |
customerSubset |
plaintext |
none |
admin |
|
|
A.10.03.03 |
projectSubset |
plaintext
|
none |
admin |
|
|
A.10.03.05 |
productSubset |
plaintext |
none |
admin |
|
|
A.10.03.06 |
applicationSubset |
plaintext |
none |
admin |
|
|
A.10.03.07 |
environmentSubset |
plaintext |
none |
admin |
|
|
A.10.03.08 |
businessUnitSubset |
plaintext |
none |
admin |
|
|
A.10.03.09 |
securitySubset |
picklist |
none |
admin |
|
f20 |
Note: entailed term
(10.6.1) is implemented in DXLT using the hi
element, and foreign (10.8) is implemented in DXLT using the foreign
element.
Note: (1 = public; 10 = highly confidential)
f20: 1,
2, 3, 4, 5, 6, 7, 8, 9, 10
Table 7.2m — Other administrative information (12620: A.10.4 – A.10.21, except antonym)
ISO 12620 |
Martif Data Category Name |
TextType |
Target |
Martif Tag Name |
Level |
Comments |
10.06.02 |
sortKey |
plainText |
none |
admin |
|
|
A.10.06.03 |
searchTerm |
basicText |
none |
admin |
|
|
A.10.13 |
entrySource |
noteText |
? |
admin |
TE, LS, TM |
|
A.10.14 |
conceptIdentifier |
noteText |
? |
admin |
TE, LS, TM |
|
A.10.18 |
see |
noteText |
element |
ref |
TE,
LS, TM |
|
A.10.18 |
crossReference |
noteText |
any
element |
ref |
TE, LS, TM |
|
A.10.18 |
xCrossReference |
noteText |
external |
xref |
TE, LS, TM |
|
A.10.18.05 |
homograph |
basicText |
term |
termNote |
|
|
A.10.18.06 |
antonym |
basicText |
term |
descrip |
|
|
A.10.19 |
source |
noteText |
none |
admin |
TE, LS, TM |
|
A.10.20 |
sourceIdentifier |
noteText |
bibl |
admin |
TE, LS, TM |
|
A.10.22.01 |
originatingInstitution |
noteText |
org |
admin |
TE, LS, TM |
|
A.10.22.02 |
originatingPerson |
noteText |
per |
admin |
TE,
LS, TM |
|
A.10.22.03 |
originatingDatabase |
noteText |
? |
admin |
TE, LS, TM |
|
A.10.23 |
sourceLanguage |
picklist |
|
admin |
TE |
|
A.10.24 |
domainExpert |
plainText |
|
admin |
TE |
|
Table
7.2n — Types of refObjects (Other Resources) in front and back elements
Note: These data categories are used to describe Other Resources and are
not used inside a termEntry element.
Meta data category |
Instantiation |
Content/Comments |
refObjectList |
bibl |
structured bibliographic reference (see 12620 Annex B) |
refObjectList |
binaryData |
encoded as two hex digits per byte or as Mime |
refObjectList |
conceptSysDescrip |
description of an external system |
refObjectList |
person/Org |
details about a person or an organization |
refObjectList |
subjectFieldSet |
set of subject field names and descriptions of them |
refObjectList |
colSequenceDescrip |
description of collating sequence |
In <back>, <refObject> elements in
<refObjectList> elements to group similar types of reference objects.
In <front>, namespaces for various types of
objects
As indicated in the
Introduction, user-group subsets can be defined for DXLT. This section shall
describe one such subset. In this subset, which we shall call the Supplier
subset, we only allow minimal terminological information provided to a fictive
supplier of translation along with the source text to be translated, in a very
restrictive environmment involving manufacturing and finance only.
We shall allow only two
types of terms, full forms and abbreviated forms. This is done by specifying a
picklist of allowed values for the data category termType, which is an instantiation of the meta data category termNote. Here is the information that
is placed in the DCS module concerning term type:
• meta data category: termNote
• data category: termType
• picklist:: fullForm
abbreviatedForm
This specification is a
strict subset of the specification for termType
in the master DCS module. The only difference is that the master DCS module
allows more options in the picklist. Clearly, any DXLT document that conforms
to the subset specification shall also conform to the master specification.
We shall allow only two
types of descriptive information: a subject field and a definition. The subject
fields allowed in this subset are manufacturing and finance, and we only allow
subject field specifications at the terminological entry level:
• Meta data category: descrip
• data-category: subjectField
• picklist: manufacturing finance
• levels: termEntry
The master DCS module
allows any plain text value for a subject field, so a subset module can specify
a picklist. Obviously, there can be no picklist of possible definitions, so we
specify that a definition can contain the same type of text found in general
notes (noteText). We shall allow definitions at two levels, entry and language.
This is done by placing
the following information in the DCS module:
• meta data category: descrip
• data category: definition
• content: noteText
• levels: termEntry langSet
An actual machine-processable
DCS file for the Supplier subset of DXLT would look like this:
<?xml version="1.0"?>
<martifDCS name='DXFd-supplier'
version="0.4" lang='en' xmlns="x-schema:MTFssV04.xml">
<header><title>subset DCS file for the
Supplier example</title></header>
<datCatSet>
<termNoteSpec
name="termType" position="2.1">
<contents
datatype="picklist" targetType="none">fullForm
abbreviatedForm</contents>
</termNoteSpec>
<descripSpec
name="subjectField" position="4">
<contents
datatype="picklist" targetType="none">manufacturing
finance</contents>
<levels>termEntry</levels>
</descripSpec>
<descripSpec
name="definition" position="5.1">
<contents
datatype="noteText" targetType="none"/>
<levels>termEntry
langSet</levels>
</descripSpec>
</datCatSet>
</martifDCS>
It should be noted that the
machine-processable DCS module corresponds in a straightforward manner to the
information listed for the three data categories presented in the above
examples. The position numbers identify the location of the description of the
data category in ISO 12620. Creating a DCS module does not require an extensive
knowledge of XML nor expertise in writing XML DTDs or schemas.
Clearly, specifying only
three data categories (term type, subject field, and definition) as instances
of meta data categories defines a very limited subset of DXLT; nevertheless,
this limited data-category module can be logically combined with the
core-structure module of DXLT to allow such DXLT-compliant documents as the
example in section 5. Elements that are not meta data categories, i.e., basic
XLT data ategories such as <term> and <note>, are allowed without
explicit mention in the DCS module because they are part of the core structure
and need not be mentioned in the DCS.
As previously mentioned,
It is anticipated that the TBX format of OSCAR will be a subset of DXLT.
Annex A
(normative)
Core structure module
The entities allow mnemonic names to be given to pieces of text, especially text used in several places. The elements of XLT are divided into three groups: (a) the low-level elements used to mark up text, such as markup inside definitions and contextual examples, (b) elements needed to constitute a terminological entry (<termEntry>); and (c) high-level elements and other elements not used in a terminological entry, e.g. header elements. After these three groups of elements, attribute lists for all elements are given in alphabetical order. Meta data categories are given a comment indicating their specializations and other constraints on them are in the DCS module.
The following DTD is a formal XML representation of the core structure defined informally in Section 6. It can be found as as text file in the DXLT zip file as XLTCDv04.xml (XLT Core-structure DTD version 0.4).
There is also an XDR schema version of the core structure that work with Internet Explorer 5. It can be found in the DXLT zip file as XLTCSv04.xml (XLT Core-structure Schema version 0.4).
<!--
=================================================================================
SOME USEFUL ENTITIES THAT ARE
REFERENCED BELOW
==================================================================================
-->
<!ENTITY % basicText
'(#PCDATA|hi)*'>
<!ENTITY % noteText
'(#PCDATA|hi|foreign|bpt|ept|it|ph|ut)*' >
<!ENTITY % auxInfo
'(descrip|descripGrp|admin|adminGrp|transac|transacGrp|note|ref|xref)*' >
<!ENTITY % noteLinkInfo
'(admin|adminGrp|transac|transacGrp|note|ref|xref)*' >
<!-- Entities that define
common sets of attributes -->
<!ENTITY % impIDLang 'id
ID #IMPLIED lang CDATA #IMPLIED' >
<!ENTITY % impIDType 'id
ID #IMPLIED type CDATA #IMPLIED' >
<!ENTITY %
impIDLangTypTgtDtyp 'id ID #IMPLIED lang CDATA #IMPLIED
type CDATA #REQUIRED target IDREF #IMPLIED datatype
CDATA #IMPLIED' >
<!-- ================================================================================
ELEMENTS USED FOR TEXT MARKUP
================================================================================ -->
<!ELEMENT hi (#PCDATA) >
<!ELEMENT foreign (%basicText;) >
<!-- meta-markup elements
borrowed from OSCAR -->
<!ELEMENT bpt (#PCDATA)* >
<!ELEMENT ept (#PCDATA)* >
<!ELEMENT it (#PCDATA)* >
<!ELEMENT ph (#PCDATA)* >
<!ELEMENT ut (#PCDATA) >
<!-- ================================================================================
ELEMENTS NEEDED FOR
TERMINOLOGICAL ENTRIES (IN ALPHABETICAL ORDER)
================================================================================
-->
<!ELEMENT admin (%basicText;) >
<!ELEMENT adminGrp (admin, (adminNote|note|ref|xref)*) >
<!ELEMENT adminNote (%noteText;) >
<!ELEMENT date (#PCDATA)
>
<!ELEMENT descrip (%noteText;) >
<!ELEMENT descripGrp
(descrip,(descripNote|admin|adminGrp|transacGrp|note|ref|xref)*) >
<!ELEMENT descripNote (%noteText;) >
<!ELEMENT langSet ((%auxInfo;), (tig | ntig)+) >
<!ELEMENT note (%noteText;) >
<!ELEMENT ntig (termGrp, %auxInfo;) >
<!ELEMENT ref (#PCDATA) >
<!ELEMENT term (%basicText;) >
<!ELEMENT termComp
(%basicText;) >
<!ELEMENT termCompGrp
(termComp, (termNote|termNoteGrp)*, %noteLinkInfo;) >
<!ELEMENT termCompList
((%auxInfo;),(termComp | termCompGrp)+) >
<!ELEMENT termEntry ((%auxInfo;),(langSet+)) >
<!ELEMENT termGrp (term, (termNote|termNoteGrp)*,
(termCompList)* ) >
<!ELEMENT termNote
(%noteText;) >
<!ELEMENT termNoteGrp (termNote, %noteLinkInfo;) >
<!ELEMENT tig (term,
(termNote)*, %auxInfo;) >
<!ELEMENT transac
(%basicText;) >
<!ELEMENT transacGrp
(transac, (transacNote|date|note|ref|xref)* ) >
<!ELEMENT transacNote (%noteText;)
>
<!ELEMENT xref (#PCDATA) >
<!--
===================================================================================
OTHER ELEMENTS (in
hierarchical order)
===================================================================================
-->
<!ELEMENT martif (martifHeader, text) > <!-- *** starting element *** -->
<!ELEMENT martifHeader
(fileDesc, encodingDesc?, revisionDesc?) >
<!ELEMENT p
(%noteText;) > <!-- p is
used in several header elements -->
<!ELEMENT fileDesc
(titleStmt?, publicationStmt?, sourceDesc+) >
<!ELEMENT titleStmt
(title, note*) >
<!ELEMENT title
(#PCDATA) >
<!ELEMENT publicationStmt
(p+) >
<!ELEMENT sourceDesc
(p+) >
<!ELEMENT encodingDesc
(ude?, p+) >
<!ELEMENT ude (map+) >
<!ELEMENT map EMPTY >
<!ELEMENT revisionDesc
(change+) >
<!ELEMENT change
(p+) >
<!ELEMENT text
(front?, body, back?) >
<!ELEMENT front (#PCDATA) > <!-- here put Other Resources, each in a namespace -->
<!ELEMENT body
(termEntry+) >
<!ELEMENT back
((refObjectList)*) >
<!ELEMENT refObjectList
(refObject+) >
<!ELEMENT refObject
((itemSet | itemGrp | item)+) >
<!ELEMENT item
(%basicText;) >
<!ELEMENT itemGrp
(item, %noteLinkInfo;)>
<!ELEMENT itemSet
((item | itemGrp)+)>
<!--
=================================================================================
ELEMENT ATTRIBUTES
================================================================================= -->
<!-- note: see DCS for
values of type on meta data categories and for values of lang -->
<!ATTLIST admin
%impIDLangTypTgtDtyp; > <!--
meta: see DCS for values of type -->
<!ATTLIST adminGrp id ID
#IMPLIED >
<!ATTLIST adminNote
%impIDLangTypTgtDtyp; > <!--
meta: see DCS for values of type -->
<!ATTLIST back id ID
#IMPLIED >
<!ATTLIST body id ID
#IMPLIED >
<!ATTLIST bpt i CDATA #IMPLIED x CDATA #IMPLIED type
CDATA #IMPLIED >
<!ATTLIST change
%impIDLang; >
<!ATTLIST date id ID
#IMPLIED >
<!ATTLIST descrip
%impIDLangTypTgtDtyp; > <!--
meta: see DCS for values of type -->
<!ATTLIST descripGrp id ID
#IMPLIED >
<!ATTLIST descripNote
%impIDLangTypTgtDtyp; > <!-- meta: see DCS for values of type -->
<!ATTLIST encodingDesc id
ID #IMPLIED >
<!ATTLIST ept i
CDATA #IMPLIED >
<!ATTLIST fileDesc id ID
#IMPLIED >
<!ATTLIST foreign id ID
#IMPLIED lang CDATA #REQUIRED >
<!ATTLIST front id ID
#IMPLIED >
<!ATTLIST hi type (entailedTerm | xlink) #IMPLIED
target IDREF #IMPLIED
lang CDATA #IMPLIED
href CDATA #IMPLIED
show CDATA #IMPLIED
actuate CDATA #IMPLIED
role CDATA #IMPLIED
behavior CDATA #IMPLIED >
<!ATTLIST it pos (begin|end) #REQUIRED x CDATA #IMPLIED type CDATA #IMPLIED >
<!ATTLIST item %impIDType;
>
<!ATTLIST itemGrp id ID
#IMPLIED>
<!ATTLIST itemSet
%impIDType; >
<!ATTLIST langSet id ID
#IMPLIED lang CDATA #REQUIRED >
<!ATTLIST map unicode CDATA #REQUIRED
code CDATA #REQUIRED
ent CDATA #REQUIRED
subst CDATA #REQUIRED >
<!ATTLIST martif type
(DXLT) #REQUIRED lang CDATA #REQUIRED
>
<!ATTLIST martifHeader id
ID #IMPLIED >
<!ATTLIST note %impIDLang;
>
<!ATTLIST ntig id ID
#IMPLIED >
<!ATTLIST p id ID #IMPLIED
type (langDeclaration|DCSName) #IMPLIED
lang CDATA #IMPLIED >
<!ATTLIST ph assoc CDATA
#IMPLIED x CDATA #IMPLIED type CDATA #IMPLIED >
<!-- ptr: not currently used in DXLT -->
<!ATTLIST publicationStmt
id ID #IMPLIED >
<!ATTLIST ref
%impIDLangTypTgtDtyp; >
<!-- meta: see DCS for values of type -->
<!ATTLIST refObject id ID
#IMPLIED >
<!ATTLIST refObjectList id
ID #IMPLIED
type CDATA #REQUIRED > <!-- meta: see DCS for values of type -->
<!ATTLIST revisionDesc
%impIDLang; >
<!ATTLIST sourceDesc
%impIDLang; >
<!ATTLIST term id ID
#IMPLIED >
<!ATTLIST termComp
%impIDLang; >
<!ATTLIST termCompGrp id
ID #IMPLIED >
<!ATTLIST termCompList id
ID #IMPLIED
type CDATA #REQUIRED > <!-- meta: see DCS for values of type -->
<!ATTLIST termEntry id ID
#IMPLIED >
<!ATTLIST termGrp id ID
#IMPLIED >
<!ATTLIST termNote
%impIDLangTypTgtDtyp;> <!--
meta: see DCS for values of type -->
<!ATTLIST termNoteGrp id ID
#IMPLIED >
<!ATTLIST text id ID
#IMPLIED >
<!ATTLIST tig id ID
#IMPLIED >
<!ATTLIST title
%impIDLang; >
<!ATTLIST titleStmt
%impIDLang; >
<!ATTLIST transac type
CDATA #REQUIRED lang CDATA
#IMPLIED target IDREF #IMPLIED
datatype CDATA #IMPLIED > <!-- meta: see DCS for values of type
-->
<!ATTLIST transacGrp id ID
#IMPLIED >
<!ATTLIST transacNote
%impIDLangTypTgtDtyp; > <!--
meta: see DCS for values of type -->
<!ATTLIST ude id ID
#IMPLIED
name CDATA #REQUIRED
base CDATA
#IMPLIED >
<!ATTLIST ut x CDATA #IMPLIED >
<!ATTLIST xref %impIDType;
target CDATA #REQUIRED > <!-- meta: see DCS for values of type
-->
<!-- end -->
(See XLTdsV04.xml in the DXLT zip file: XLT DCS
[XDR] schema version 0.4)
Annex B
(normative)
Master data category specification module
NOTE: This annex presents the
master DCS file for DXLT expressed as an XML document. The informal version of
this DCS file is found in Section 7.
(See DXLTDv04.xml in the DXLT zip file: DXLT DCS version 0.4)
Annex C
(normative)
Examples
This Annex consists of
examples of how to encode DXLT data streams. Some examples are complete XML
documents; others are fragments.
Tentative example of
representing binary data:
The following example shows
how binary data might be included in a refObjectList. This example will be
expanded to include base64 encoding with MIME types. The W3C proposal on
representation of binary data in XML is not yet final.
<refObjectList type="binaryData">
<refObject
id="B432">
<item
type="encoding">hex</item>
<item
type="format">GIF</item>
<item
type="data">4F 55 6E F2 3B 43 07 ...</item>
</refObject>
Character set issues:
The Latin1 entities and other
entitites are allowed according to ISO 12200, but DXLT is more restrictive.
Only hex character references are used. This is to reduce the burden on a blind
import routine, which cannot anticipate all the mnemonic character entities
that might be used. DXLT has adopted the LISA OSCAR convention that all “blind”
data streams shall be in one of three encodings of Unicode: (a) UCS-2 (b) UTF-8
or (c) pure 7-bit ASCII in which non-ASCII characters are encoded as eight
ASCII characters using an XML hex character reference, e.g.
"㐡". Such hex character references are automatically
converted to readable characters when an XML data stream containing them is
displayed in various types of XML-aware software, such as certain web
browswers, without any need for special programming. This third type of
encoding can formally be consired to be UTF-8, although it does not use the
UTF-8 method of encoding characters whose code point is above 127.
Note about dates:
Dates as standalone tags
without a type attribute would not express a sufficiently explcit meaning. Thus
they occur embedded in
<transacGrp> elements, adding date information to the action described by
the transaction.
The following sample
demonstrates the description of term components. In addition it also shows the
possibility of using a tig in place of an ntig in simple cases:
fr : table des transitions
d'états
en : state transition table
The tig usage is as follows:
<langSet
lang="fr">
<ntig>
<termGrp>
<term>table des transitions
d'états</term>
<termCompList
type="termElements">
<termCompGrp>
<termComp>table</termComp>
<termNote
type="grammaticalGender">feminine</termNote>
</termCompGrp>
<termCompGrp>
<termComp>des</termComp>
<termNote
type="partOfSpeech">preposition+article</termNote>
</termCompGrp>
<termCompGrp>
<termComp>transitions</termComp>
<termNote
type="grammaticalNumber">plural</termNote>
<termNote
type="grammaticalGender">feminine</termNote>
</termCompGrp>
<termCompGrp>
<termComp>de</termComp>
<termNote
type="partOfSpeech">preposition</termNote>
</termCompGrp>
<termComp>états</termComp>
</termCompList>
</termGrp>
</ntig>
</langSet>
<langSet
lang="en">
<tig><term>state
transition table</term></tig>
</langSet>
The following ntig is equivalent to the English tig given above:
<langSet
lang="en">
<ntig>
<termGrp>
<term>state
transition table</term>
</termGrp>
</ntig>
</langSet>
The following sample shows
how synonyms can be represented in DXLT. The following terminological data
sample indicates that there is a synonym for the German term “Abtastglied”:
fr : échantillonneur
en : sampling element
de : Abtastglied; Abtaster
The French and German
information is represented as follows:
<termEntry>
[here
put descriptive and administrative info at termEntry level]
<langSet
lang='"fr">
[here put
descriptive and administrative info at langSet level]
<ntig>
<termGrp>
<term>échantillonneur</term>
</termGrp>
[here
put descriptive and administrative info at term level]
</ntig>
</langSet>
<langSet lang="de">
[here put
descriptive and administrative info at langSet level]
<ntig>
<termGrp>
<term>Abtastglied</term>
<termGrp>
[here
put descriptive and administrative info at term level]
</ntig>
<ntig>
<termGrp>
<term>Abtaster</term>
<termNote
type="termNote">synonym</termNote>
</termGrp>
[here put descriptive and administrative info
at term level]
</ntig>
</langSet>
</termEntry>
Note: The English
langSet has been omitted for brevity, as has descriptive and administrative
information and character references are shown converted to glifs for
readability (é = é)
The following samples show
how abbreviations can be represented in DXLT in two different methods. In the
following terminological data sample the German term has an abbreviation:
fr : élément ŕ action
proportionnelle et par intégration
en : proportional plus integral
element
de : Proportionalglied plus
Integrierglied; PI- Glied
The German langSet can be
represented in DXLT as:
<langSet
lang="de">
[here put
descriptive and administrative info at langSet level]
<ntig>
<termGrp>
<term>Proportionalglied
plus Integrierglied</term>
<termGrp>
[here put descriptive and administrative info
at term level]
</ntig>
<ntig>
<termGrp>
<term>PI-
Glied</term>
<termNote
type="termNote">abbreviation</termNote>
</termGrp>
[here put descriptive and administrative info
at term level]
</ntig>
</langSet>
It can also be represented
using "abbreviatedFormOf" as follows, when it is desirable to show
the relationship between the the abbreviated form and that full form
explicitly:
<langSet
lang="de">
[here put
descriptive and administrative info at langSet level]
<ntig>
<termGrp>
<term
ID="n337">Proportionalglied plus Integrierglied</term>
<termGrp>
[here put descriptive and administrative info
at term level]
</ntig>
<ntig>
<termGrp>
<term>PI-
Glied</term>
<termNote
type="abbreviatedFormOf" target="n337">Proportionalglied
plus Integrierglied</termNote>
</termGrp>
[here
put descriptive and administrative info at term level]
</ntig>
</langSet>
<term>
transactions
<transacGrp>
<transac
type=”modification”>marketing department requested change from gizmo to
thinger</transac>
<date>1999-11-12<date>
<transacNote
type=”responsiblePerson">John Harris</adminNote>
</transacGrp>
Note: more examples of
encoding transactions are given in the folllowing subsection.
The
following legend will be helpful in interpreting the encoding guidelines for
the tables in section 7.2:
- <xxx>: an XML element in the core structure
- bib-descrip:
description in natural language of a bibliographic reference; no formal
structure is required but it should be a complete reference, not just a code
for a reference somewhere else.
- bibl-idref:
reference to the ID of a bibliographic reference in Other Resources
- bindat-descrip:
a description in natural language of an Other Resource item that is binary data
(e.g. a description of a figure)
- bindat-idref:
a reference to the ID of a binary data element in Other Resources
- classSysDescrip-idref:
a reference to the ID of the description of a classification system in Other
Resources (e.g. for the Lenoch system, how to obtain a copy of the full set of
codes; what version is being used, etc)
- concept-display:
a description in natural language of the concept entry being linked to (e.g. to
be displayed as a hypertext link to be followed if desired)
- conceptSysDescrip-idref:
similar to above, but for a concept system
- content:
element content restricted according to the TextType column in the table (e.g.
basicText)
- datcat-name:
the name of the data category as found in column two of the table
- descriptor:
a particular thesaurus descriptor appropriate to the current concept entry
- display-text:
description in natural language to display in conjunction with a hypertext link
- element-idref:
reference to the ID of any element
- iso-date:
date in ISO format, using the options specified in the XML schemas proposal
- link: a
formal link using a target attribute that points to the the IDREF of another
element; or an informal link accomplished by specifying the term to link to
(which may be ambiguous but is sometimes all that is available in a real-life
termbase)
- page-etc:
the page number or other identifer of a place in a document
- part-of-speech:
a user-chosen name for a part of speech
- personOrg-descrip:
description in natural language of the person or organization responsible
- personOrg-idref:
a reference to the ID of a person or organization element in Other Resources
- picklist-value:
any one of the items in the picklist associated with that data category
- position-code:
a number or other code that identifies the position of a concept in a
classification system, concept system, or thesaurus (e.g. a Lenoch code)
- pronunciation:
a guide to the pronunciation of a term
- subject-field:
the name of a subject field (chosen by the user)
- term-component:
a component of a term, e.g. one of its words (termElement)
- term-idref:
a reference (link) to the ID of a <term> element
- term-structure: the internal structure of a term (used only when a
tree structure is needed)e.g. [[Empire State][Building]] (New York is known as the
Empire state) as opposed to [[Empire] [State Building]] (termElement can be used in
conjunction with termStructure to show part of speech of each element)
- term: a term
(usually a term being pointed to)
- thesaurusDescrip-idref:
similar to above, but for a concept system
- thesaurusDescrip-idref:
similar to above, but for a concept system
- transac-descrip:
description in natural language of the transaction (e.g. why it was done)
- transfer-comment:
an explanation in natural language of some lack of perfect equivalence between
two terms (typically in two languages in the same terminological concept entry)
- URI: URL etc
(see W3C spec for URIs)
In general: any item in italics in one of the patterns
below is a variable that is a placeholder for a variety of specific possible
values. Consult the legend for the meaning of a variable.
XML encoding guidelines for Table 7.2a (Types of terms, 12620:
A.2.1):
A.02.01:
termType without link
<termNote type='termType'>picklist-value</termNote>
A.02.01.05:
commonNameFor;
A.02.01.08: abbreviatedFormFor
(types of term with link
to another <term>)
<termNote type='datcat-name'
target='term-idref'>term</termNote>
note: The term
that is the content of the element is redundant with the value of the
<term> element being linked to; it is there only as a display value for
the convenience of the user who has not yet followed the link or in case the
link gets broken or in case there is not link at all. The target attribute is
strongly suggested but not required; the display content (the term being
pointed to) is not optional. However, this is not to be interpreted as a
license to violate term autonomy by placing an abbreviated form, for example,
as the content of an abbreviatedFormFor element and not giving it its own tig
or ntig.
XML encoding guidelines for Table 7.2b (Grammar, Usage, and Origin, 12620: A2.2 - A.2.4):
A.02.01.08:
partOfSpeech
<termNote type='partOfSpeech'>part-of-speech</termNote>
note: In a user-group subset DCS, the content could
become a picklist value.
A.02.02.02
- A.02.04.02: grammar, usage, and origin data categories (rest
of the table)
<termNote type='datcat-name'>content</termNote>
note: this very abstract pattern works as follows: the
value of the type attribute is one of the data category (datcat) names in table
2 in the range from A.02.02.02 through A.02.04.02; the content of the element
is a picklist value or free text (plain, basic, or note text), depending on
what is in the TextType column for that row. For example, the datcat on row
A.02.02.02 (grammatical gender) has picklist in the TextType column of Table 2,
so for this grammatical gender, the general pattern becomes:
<termNote
type='grammaticalGender'>picklist-value</termNote>
XML encoding guidelines for Table 7.2c (Term components, 12620: A.2.5 - A.2.8):
A.02.05: pronunciation
<termNote
type='pronunciation'>pronunciation</termNote>
note:
For the pronunciation to be considered blind, it must be written in IPA.
A.02.06: syllabification
A.02.07: hyphenation
A.02.08.01: morphologicalElement
A.02.08.02: termElement
(ways
of breaking down a term into components)
<termCompList
type='datcat-name'>
<termComp>term-component</termComp>
<termCompGrp>
<termComp>term-component</termComp>
<termNote type='datcat-name'>content</termNote>
(e.g. part of speech of this component)
</termCompGrp>
</termCompList>
A.02.08.03: termStructure
<termNote
type='termStructure'>term-structure</termNote>
The structure the components of a term is indicated using square brackets that are formally equivalent to a tree, e.g. :
"[bank statement] [total] v.s [bank] [statement total]
XML encoding guidelines for Table 7.2d (Term status, 12620: A.2.9):
A.02.09.01 - A.02.09.04 term status data categories
(the whole table)
<termNote
type='datcat-name'>picklist-value</termNote>
XML encoding guidelines for
Table 7.2e (Equivalence, 12620: A.2.9):
A.03.02: falseFriend
<termNote
type='falseFriend' target='term-idref>term</termNote>
(e.g.
in a French <ntig> for librarie, a false friend would be 'library':
<termNote
type='falseFriend' lang='en' target='T567-e3'>library</termNote>
where
<term id='T567-e3'>library</term>)
A.03.04: reliabiltyCode
<termNote
type='reliabilityCode'>picklist-value<termNote>
(where
source reliability codes are mapped to the 'blind' range 1-10)
A.03.05: transferComment
<termNote
type='transferComment' target='term-idref'>transfer-comment</termNote>
XML encoding guidelines for
Table 7.2f (Classification system,
12620: A.4):
A.04 subjectField
<descrip
type='subjectField'>subject-field</descrip>
note:
In a user-group subset, the subject field can be restricted to a picklist
value.
A.04.02 classificationCode
<descrip
type='classificationCode'
target='classSysDescrip-idref'>position-code</descrip>
note:
In 12200, this is two elements, a descrip and a ref or ptr to the the
description.
XML encoding guidelines for
Table 7.2g (Concept-related
descriptions, 12620: A.5):
(A.05.01-A.05.08)
<descrip
type='datcat-name'>content</descrip>
binary
data (A.05.05.01-A.05.05.05):
<descrip
type='datcat-name' target='bindat-idref'>bindat-descrip</descrip>
XML encoding guidelines for
Table 7.2h (Concept relations, 12620:
A.6 and A.7 combined):
A.07.02: conceptPosition
<descrip
type='conceptPosition' target='conceptSysDescrip-idref'>position-code</descrip>
rest of the table (concept
relations between to entries), except antonym
<descrip
type='datcat-name' target='entry-idref'>concept-display</descrip>
A.10.18.06: antonym
<descrip
type='antonym' target='term-idref'>term</admin>
note:
Antonym is a little different from the other datcats in Table 8, since it
points to a term, not a concept entry. Antonym is hard to classify; it could
even be thought of as similar to false-friend, which is a term note. It is
currently in 12620 under administrative elements, but most outsiders view that
classification as a mistake.
XML encoding guidelines for
Table 7.2i (Specialized notes, 12620:
A.8):
A.08.01-A.08.03:
<descripNote
type=’dat-cat name’>picklist value</descripNote>
XML encoding guidelines for
Table 7.2j (Documentary language,
12620: A.9):
A.09.02 thesaurusDescriptor
<descrip
type='thesaurusDescriptor
target='thesaurusDescrip-idref'>descriptor</descrip>
A.09.04 keyword
<descrip
type='keyword'>plaintext</descrip>
note:
a keyword is a word or group of words that are to be placed in some kind of
index and used for retrieval; this keyword is not a descriptor but actually
appears in the text of some element of <termEntry> in which the keyword
element is placed. It has been changed from an admin to a descrip so that its
level can be controlled. If it turns
out that it applies only to terms, then it may become a termNote. A keyword
need not be in the inflected form found in text; instead, it may in lemmatized.
This is one reason to use a keyword even when field is indexed on all words,
since the user may not know to search on the inflected form.
A.09.05 indexHeading
<descrip
type='indexHeading'>plaintext</descrip>
note:
an indexHeading is like a keyword, except that it does not appear anywhere in
the <termEntry> yet is still useful for retrieval.
XML encoding guidelines for
Table 7.2k (Transactions, 12620: A.10.1 - A:10.2):
<transacGrp>
<transac
type='datcat-name'>transac-descrip</transac>
<transacNote type='responsibility' target='personOrg-idref'>personOrg-descrip</transacNote>
<date>iso-date</date>
</transacGrp>
If
neither responsibility nor date is associated with the transaction, then a
transac can be used without being in a transacGrp.
(Note that the creation date and the author are considered in XLT to be the date and responsibility parts of a transaction; thus the representation of a modificaiton is: a modification transaction containing a responsiblePerson [A.10.02.02.12] in a transacNote and a date [A.10.02.01]:
<transacGrp>
<transac type="modification"> comment on the transaction, e.g., reason for doing it </transac>
<transacNote type="responsiblePerson" target="per-idref"> person-descrip </transacNote>
<date> 1998-01-28 </date>
</transacGrp>
XML encoding guidelines
forTable 7.2l (Subsets, 12620: A.10.3):
A.10.03.01-A.10.03.08
<admin
type='datcat-name'>content</admin>
A.10.03.10 security subset
<admin
type='’securitySubset'>picklist value</admin>
XML
encoding guidelines forTable 7.2m (Other administrative
information, 12620: A.10.4 - A.10.21, except antonym):
A.10.18 crossReference
<ref type='crossReference'
target='element-idref'>display-text</ref>
note: element-idref must be inside the current
XML document
A.10.18 xCrossReference
<xref type='xCrossReference'
target='URI'>display-text</xref>
A.10.18.05 homograph
<termNote type='homograph'
target='term-idref'>term-description</termNote>
- antonym: treated with concept relations
- source and sourceIdentifier:
-
source is self-contained in element and self-explanatory
<admin
type='source'>bib-descrip</admin>
- source
identifier:
<admin type='sourceIdentifier'
target='bibl-idref'>page-etc</admin>
Annex D
(informative)
Design and application of XLT
The following design principles apply to all members
of the XLT family:
1)
It is assumed that the reader already accepts the need to share
terminological data.
2)
It is also assumed that the reader accepts the need for both formats
that are relatively blind (for cases where unknown parties will be processing
files in those formats) and formats that are relatively less blind (for cases
where parties negotiate details of a format in advance in order to reduce
information loss when passing through those formats). Thus, in the remainder of these principles, "blind"
should be read as "tending toward the blind side of the blind-negotiated
spectrum".
3)
It is further assumed that the reader acknowledges the wisdom of using
Unicode in a blind format to avoid the problem of having to allow for many
different code pages when importing data. Of course, this imposes a burden on
the party placing data in the intermediate format, but this burden is offset by
the increased predictability of the format, which permits the design of one
routine to process data from any source so long as it comforms to the blind
format specification.
4)
In addition to the above assumptions, it is hoped that the reader
accepts the following design principles for a blind interchange format (or is
willing to discuss their objection):
a)
A viable blind interchange proposal shall be formulated in the context
of an overall framework for representing, designing, and sharing terminological
data, namely, the ISO/TC37 meta-model and associated projects, whether these three tasks are addressed in separate
standards or a multipart standard.
b)
A viable blind interchange format should be an XML application, not
just an SGML application, since XML is gaining momentum very rapidly worldwide
and is specifically designed for data representation and sharing. The next
generation of web browsers will even allow the viewing of any XML document,
even if its DTD has not been seen before, but not arbitrary SGML
documents. XML is substantially easier
to process than full SGML. This will be a particularly important point for a
translation technology vendor deciding whether or not to implement import and
export routines from and to the format.
c)
A viable blind interchange
format should be customizable, since it is unlikely that any one particular
interchange format could be acceptable to all user groups in all its details.
That is, a single, monolithic format without the possibility of creating
user-group subsets is out of the question. The ISO principle of providing
methods of solving classes of problems, rather than overly specific problems,
applies here. A viable blind interchange format is easy to grasp by a software
developer, which means it should be modular, just as a software application
should be divided into manageable modules. A blind interchange format will
probably not be simple, but if it is broken into modules, each of which can be
easily grasped, then the complexity of the format as a whole becomes more
manageable.
d)
A viable blind interchange
format is suitable for sharing information over the Internet, using such
protocols as http (browser), ftp (file transfer), and e-mail.
e)
In the XML world, XML may be embedded in HTML, but HTML is not usually
embedded in XML. If it is essential to embed a piece of HTML in an XML
document, it can be treated as a piece of data rather than as XML markup, with
the pointed brackets and ampersands of the HTML represented as entities. The
HTML data can then be embedded inside meta-markup tags. The addition of
meta-markup tags allows the embedding of various kinds of markup inside such
elements as definitions and contextual examples. If the TMF project includes a
module that embeds XHTML inside a TML, that module should be optional, just as
the meta-markup tags should be optional.
f)
It is important to maintain the distinction between descriptive and
presentational markup. A blind interchange document should consist primarily of
descriptive markup. If the presentational look of a termbase needs to be
preserved, then pure HTML could be used, but this approach is incompatible with
the tasks of exporting content from one termbase and importing content into a
termbase. If a blind interchange format is to be viewed other than as raw XML,
it can be automatically converted to a visually appealing presentation in HTML.
g)
Links in a blind interchange format that point to another concept entry
or to a piece of Complementary Information should be unambiguous. Merging data
from sources need not result in ambiguity if the ids are preceded by distinct
prefixes, in the spirit of the namespace specification for XML that allows for
co-existence of elements from multiple sources without ambiguity through the
use of GI prefixes.
h)
The use of DTDs to declare constraints on the use of markup is being
replaced by the use of XML itself to express constraints, i.e. rather than
using the "bang" statements of DTDs, such as !ELEMENT (read
"bang" element) and !ATTLIST. The XML activity page
(www.w3.org/XML/Activity) says "While XML 1.0 supplies a mechanism, the DTD, for declaring
constraints on the use of markup, automated processing of XML documents
requires more rigorous and comprehensive facilities in this area."
Therefore, we should not make decisions on the design of a blind interchange
format based on the unfortunate limitations of DTD bang statements, such as the
limitation that there is no formalism for constraining the content of an
element to the values in a picklist. Instead of converting an element to an
attribute, we could use a constraint system expressed in XML. When the W3 publishes
its XML-Schema proposal, not to be confused with the RDF-Schema proposal, we
should look at using it or else something in the spirit of it. Regardless of
which particular type of schema is eventually used for a terminology blind
sharing format, it is clear that the trend is toward using new XML schemas
rather than relying exclusively on traditional SGML bang statements.
i)
The task of representing, that is, making explicit, the structure of an
existing termbase, need not involve the same degree of normalization that is
needed for blind interchange, yet it would be highly desirable for the
representation task to be done within the same framework as the sharing task.
This means, for example, that it would be desirable to define a core structure
for a family of formats. This core can then be successively constrained for
design, by adding a list of data categories from ISO 12620, and for sharing, by
defining user-group subsets of the available data categories and additional
constraints on content and point of attachment of elements as needed for blind
interchange.
j)
Term autonomy (i.e. placing each term
on an equal basis in a language section, with the possibility of additional
information on each term) as opposed to simple synonyms dependent on a
particular term, should be required in blind interchange but not in the representation task.
This following list indicates various potential applications of the DXLT format:
A. blind interchange, including:
1. on-going data flow from one translation technology module to another with a different function in a complete business solution to the document production chain, which involves communication between authors, and translation requesters and suppliers,
2. integration of terminological data from multiple sources, and
3. data conversion necessitated by a shift from one application to another for the same function;
B. dissemination, including:
1. querying multiple termbases through a single user interface by passing data through a common intermediate format on a batch or dynamic basis,
2. placing data on an FTP site for download by interested parties, and
3. asking for suggestions from interested colleagues by making available entries in which some terms are tentative and asking for feedback,
C. analysis, including:
1. comparing the contents of various termbases [along the lines of the Interval project];
2. studying how lossless a roundtrip conversion with a given termbase can be; and
3. designing a new termbase intended to minimize loss during conversion
A member of the XLT family is defined by the logical
combination of three components, which can be summarized as structure, content, and style.
The structure is
a constrained version of the core structure found in ISO 12200 (as to be
amended) in the form of an XML DTD. The amendment is not final but was approved
as a New Work Item (ISO 37.3N318) and the changes it proposed from the October
1999 version of ISO 12200 are relatively minor. This revision provides the
basis for a work item that has been called Martif
with Specified Constraints or MSC[SEW1]. Significantly,
the proposed amendment to 12200 consists of optional extensions to the DTD, so that
valid pre-amendment Martif files remain valid according to the revised DTD. In
August 2000 ISO TC 37/WG 4 resolved to place MSC on hold for the time being in
order to develop a more pressing standard, ISO 16642, which will define a
high-level meta-model designed to provide a uniform environment in which MSC,
Geneter, and other possible interchange formats can function with a commitment
to universal interoperability. At the same time (August 2000), the latest draft
of the MSC specification (and the first draft to propose XLT) was withdrawn
from ISO and placed in the SALT project. The XLT core structure will be mapped
to the ISO TC/37 meta-model abstract structure when the meta-model TMF
(Terminology Markup Framework) project is further along. The XLT core structure
can also be expressed as an XML schema.
The content of a
member of the XLT family consists of the set of the data categories (from ISO
12620 whever possible) needed for a particular user group and various
constraints on those data categories and their appearance in the core
structure. For a particular XLT family member, the various constraints that
define it are gathered together into a module called a Data Constraint
Specification (DCS) file. All DCS files have the same layout, which is defined
by the XLT-DCS XML schema and will be equivalent to a subset of the RDF
constraint format being developed in the ISO TC/37 Meta-model TMF project. One
DCS file is privileged by being designated the Default XLT-DCS file. It
functions as the “master file” containing the full list of data categories
allowed in standard XLT. Various user groups are encouraged to define their
formats as subsets of DXLT if possible. Note[SEW2] that
DXLT, in the SALT environment, takes the place of the MSC format formerly under
development in ISO TC/37.
In the XLT environment, style indicates the form of markup used in a format, particularly
the approach to tag names and structures. The primary style [SEW3]of XML
expression for XLT is the style of the XLT core structure, whereby broad tag
names, sometimes called meta data categories, such as descrip, are specialized by the value of the type attribute, which is actually a data category name taken from
the DCS file, e.g., <descrip type="definition">. The formal
definition of an XLT family member, e.g., the features that distinguish it from
other formats, thus depends simply on its DCS file. That is, each distinct DCS
file defines a particular member of the XLT family. In XLT, alternative styles
will only be used for internal processing.
Multilingual
terminological resources have been developed in machine-readable form since at
least the 1960s. Being an integral part of the language industries and the
information economy, these resources are processed using various kinds of
database management systems and represented using various data models. In
addition, these resources (and other types of resources, such as source and
target texts and translation memory data) often need to be shared in various
contexts, including interchange between computer systems and dissemination to
human users. This sharing is usually accomplished using intermediate formats.
Effective use and re-use of a variety of multilingual terminological resources
is facilitated by a single high-level data model that supports analysis and
design of both databases and representation formats for analyis and sharing.
Project 16642/CD of ISO
Technical Committee 37 (consisting of a Terminology Markup Framework called
TMF) provides such a model. The abstract data model in ISO 16642 is called a
meta-model since it is a model of models, that is, a high-level model of what
more specific data models have in common, thus providing the basis for a framework
within which to defined various compatible representations. The Meta-model fits into an integrated
approach to be used in analyzing existing terminological data collections and
in designing (a) new databases (which are typically processed using a
relational or text-based data management system) and (b) structured documents
in a representation format (which are typically defined using XML markup). An
integrated approach eases the task of importing information from a structured
document into a database and the task of exporting information from a database
to a structured document. Yet another motivation for an integrated approach, as
opposed to entirely separate approaches for databases and structured documents
is that XML-based formats are now being processed in new ways that infringe on
the traditional role of database management systems. For example, XML documents
are being queried and updated directly.
This integrated approach
to analysis and design consists of three levels of data modeling: the meta-model
level (level 1), the data-model requirements level (level 2), and the actual
format (or database) level (level 3).
Level 1:
The first (and most abstract) level of the integrated approach is the
meta-model level. This level consists of three components: (a) the structural
component, (b) content component and (c) a links between (a) and (b). The structural component is the meta-model
from ISO 16642 and a diagram of it is found in Figure D.3.1. The content component is a meta-data
registry based on ISO 12620, which is an inventory of data categories used in
terminological data collections. These data categories are sometimes called
"data element types" from a general information technology
perspective. The links between the structural component and the metadata
registry are (1) that the "term" object class in the structural
component is the central data category called term in the metadata registry and (2) that all other data
categories in the meta-data registry fall under "terminology-related information"
in the structural meta-model. Thus, each specific type of terminology-related
information is a data category in the meta-data registry.
Figure D.3.1 – The Structural Component of the
Meta-model (part of level 1)
Note: The Associated
Information box has been renamed "Terminology-related Information"
and the "Other Resources" box has been renamed "Complementary
Information". Also, some types of Terminology-related information can also
apply to term components.
The meta-model level
supports analysis and design at a very general level. The meta-model says that
a terminological data collection consists of (a) global information about the
collection (such as owner, title, and origin), (b) a number of entries (each
entry performing three functions: describing a concept, listing the terms in
each treated language that designate that concept, and describing the terms
themselves), and (c) “Complementary Information” that is outside the entries.
Each entry can have multiple language sections, and each language section can
have multiple term sections. Each term section has exactly one term. Each level
(entry, language, and term) can have associated with it various kinds of
terminology-related information. The various types of Complementary Information
that are not part of any one entry but that can be linked to from any element
within one or more entries, even though those links are not shown explicitly in
the diagram. Such Complementary Information includes bibliographic references,
descriptions of ontologies, and binary data such as images that illustrate
concepts.
Level 2:
The second level is the data-model requirements level. At this level, the user
of the integrated approach (who may be an analyst or designer) must make various
choices, based on real-life needs. First, there is the question of the modality
of the representation of terminological data: database or structured document.
For the analyst, there is no real choice. Instead, it is a matter of
identifying the nature of the terminological data collection to be analyzed.
For the designer, there is a fundamental choice to be made. Will the terminological data be used
primarily for queries and updates and be represented in some database
management system? If so, which system?
Or will the data be used primarily for sharing and be represented in a
structured document with markup?.
Once the choice between
database and XML modality has been made, a data model must be chosen that is
slightly more concrete than the meta-model at level 1. The data model consists
of a structural component and a data constraint component compatible with and
parallel to the meta-model and meta-data registry at level 1. For a relational
database, a typical method of describing a data model is an entity-relationship
diagram. Analysis and design of relational databases is a complex topic treated
elsewhere. In the SALT project, the default relational-database approach is
RelTerm(tm). The ISO TC/37 TMF project deals with XML data models and
representation formats. For an XML format, a typical method of describing a
data model is an XML document type definition (DTD). An alternative method,
using what is called an "XML schema", is provided by the World Wide
Web Consortium (W3C). TMF is developing an abstract structure and a constraint
component at level 2. The abstract XML structure of TMF can function as an
intermediate representation between various other representations, such as MSC
(and other members of the XLT family) and GeneterXLT. The core structure and
DCS modules of XLT are expected to be compatible with the TMF project as it
evolves, with the understanding that the XLT core structure may include an NLP
component even if the TMF project does not.
The level 2 structural
component of XLT is provided by the core structure in ISO 12200 (as amended by
the project initiated by New Work Item SC3/N318), which includes an XML DTD
that is compatible with the structural component of the meta-model. There is an
optional module in the core structure that supports Natural Language Processing
elements. By including in the level 2 data constraint component the possibilty
of de-activating selected modules in the core structure, the same core
structure module can provide the basis for all members of the XLT family of formats.
It is anticipated that the XLT core structure will be generated from the TMF
abstract structure through an automatic procedure.
At level 2, one
conceptual data model is distinguished from another by the content of its data
constraint specification. Each specification is guided by the real-world needs
of some user group and consists of a list of data categories from the metadata
registry (i.e., ISO 12620 and additional data categories that may become part
of 12620), constraints on each of those data categories, and various other
constraints on the core structure.
Constraints on data
categories include restrictions on the values each data category can take
(ranging from "text with markup" for contextual examples to a
“picklist” for grammatical gender). Constraints on descriptive data categories also include
restrictions on where a particular data category can appear in an entry,
selected from the options provided by the core structure (including entry
level, language-section level, and term level). Other possible constraints on
the core structure include whether natural language processing elements will be
allowed, whether meta-markup tags will be allowed, which languages will be
allowed, and which types of Complementary Information will be allowed.
The XLT data constraint
specifications in this document are XML documents that conform to an XDR-style
schema found in Annex A. This schema is compatible with Internet Explorer 5. It
is anticipated that this schema will be fully compatible with the RDF-based constraint
specification format being developed in the TMF project.
An additional kind of
choice that is expressed in a data-category specification is level of
granularity. At the meta-model level, the data category inventory is given as a
hierarchy. For example, under term-related descriptive information, the
coarsest level of granularity is "term type". Under term type, we
find "full form" and "abbreviated form of term". Under
"abbreviated form of term", we find finer distinctions, such as
"acronym", "initialism", and "clipped term". In creating each data-category specification
at level 2, the user must choose which degree of granularity in the ISO 12620
data categories will be basic. The basic degree provides the name, while finer
degrees become values and coarser degrees become headings. For example, a user could choose a coarser
granularity as follows (where the code after each item is the position code in
the ISO 12620 hierarchy):
heading: term-relate descriptive information (A.2)
name: term type (A.2.1)
value: abbreviated form of term (A.2.1.8)
or a finer granularity as follows:
heading: term type (A.2.1)
name: abbreviated form of term (A.2.1.8)
value: acronym (A.2.1.8.4)
Another aspect of the
data-model requirements level is subsetting of data constraint specifications.
Each “master” data constraint specification can have many valid subsets, each
chosen by a particular user group in order to exercise more control over the
choice of data categories and their values.
Each user-group subset must be a strict subset of the master data
constraint specification from which it was derived. That means that each data
category name in the subset must appear in the master, that the structural
level restrictions must be the same or tighter, and that the restrictions on
values must be the same or tighter. For
example, a data category in the master specification that is allowed to have
any plain text value might be restricted to a picklist in a user-group subset.
A particular XML data
model at level 2 is defined by the logical combination of (a) a core structure
and (b) some constraint specification. That is, together, the core structure
and a data constraint specification define a data model for an XML format. This
can be visualized by thinking of the structural meta-model, which provides for
a vast range of possible data models, particularized to a real-world
application by constraining the core
structure, which includes restricting it
to certain data categories and data-category values applied to the entry, language-section,
term-section, and term-component-section object classes in the structural
meta-model. The level 2 abstract structure, together with the general layout of
a constraint specification, will be
found in the next draft of ISO/CD 16642. The XLT core structure will be
derived, perhaps automatically, from the abstract structure of TMF. Thus, XLT
can be thought of a subset of the TMLs (terminological markup languages) that
can be generated from the ISO TMF (terminological markup framework), with NLP extensions.
Level
3: The third level of the
integrated approach is where an actual XML format (or database layout) finally
appears. At this level the user designing an intermediate format must make one
additional choice: an XML representation style. One aspect of representation
style is how to represent a data category from the metadata registry. The
primaryt XLT style, based on ISO 12200, represents an instance of a data
category as an XML element with the tag name being a class of data categories (descrip, admin, etc), the name of the data category, e.g. definition, being
the value of the type attribute, and
the content of the element being the value of the data category (i.e. the
content of the element is the definition itself). This is not the only possible
representation style, of course, and TMF will allow many different but
interoperable styles.
For example, a definition might be represented as:
<descrip type='definition'>A piston with
three grooves...</descrip>,
while an alternative representation style might
place the data category name as the tag name, as follows:
<definition>A piston with
three grooves...</definition>.
Whatever the style, the
name of the data category must appear somewhere in the element. It can be represented
as a tag or attribute name or as an attribute value or element content. All
members of the XLT family of formats have the same primary representation
style. Thus, a particular XLT format is derived simply by selecting a
particular data constraint specification (DCS). All these particular formats taken together form a
"family" of related formats.
The other formats defined
within the TMF project can be thought of as "extended family" related
to the XLT family. Since all core structures in the TMF family are equivalent
and since all styles in TMF are equivalent, any two TMF formats that use the
same DCS will be interoperable, that is, instances of them can be converted
back and forth to the other format thorugh a lossless (i.e. bi-jective) mapping.
Figure D.3.2 is a diagram of how XLT family of
formats fits together. The database modality is
only given a token box labeled “DB mode”, since the focus of this diagram is
XML formats. A particular XML format at
level 3 is defined by the combination of a choice of core structure (at level
2), a data constraint specification (at level 2) and representation style (at
level 3).
Figure D.3.2: A family of
formats defined by three levels of abstraction
Note: the XML-mode
core structure is split into the TMF abstract structure and various compatible
core structures, such as one for MSC and other XLT formats and one for Geneter
Note: primary and
secondary representations are alternative representation styles.
Note: Data-category
specification in this figure should read "data constraint specification
(DCS)".
Analysis is normally performed with some
purpose in mind. Typically, analysis will result in conversion of information
between an original data collection and an intermediate format for the purpose
of interchange or dissemination.
Interchange involves a transfer of information
between two computer systems and is typically bi-directional. Dissemination is
uni-directional and can be either for use by another computer system or for
human viewing. Both interchange and dissemination can be performed either in
batch mode or one entry at a time. Data analysis may lead to either
dissemination or interchange or may be focused solely on improving
understanding of the source data collection.
In a representation, for any purpose, there can
be various degrees of "blindness", which involves neutralization of
certain details of the source format, so that differences between formats are
reduced, as represented in the intermediate format. The more blind a format is,
the less an interpreter of data in that format needs to interact with the
originators of the data or know about the original format of the data. When
there are only two interchange partners and they are known to each other,
blindness is not an issue. But when
there are multiple sources of terminological data that must be imported by a
single routine, especially if it is desirable to add more sources without
modifying the import routine, blindness becomes very important. It is important
to note that blindness is relative to the knowledge of the receiver of an
instance document. There is no such thing as an absolutely blind format.
In interchange, the objective is usually to
maximize preservation of information. But in the case of dissemination,
the representation can be intentionally
partial, leaving out some information that was in the original data collection.
Such is the case in a map that preserves only certain aspects of geographical
reality, such as elevation, or rivers, or roads, or buildings or political
boundaries but not necessarily all of these. For example, a dissemination
representation for human translators does not necessarily include some
administrative information that is only relevant to terminologists maintaining
the database.
Thus, as shown in Figure D.3.3, analysis is
involved in the design of all representations of terminological data. Some analysis is for the purpose of
dissemination of terminological data to people and to computer systems, and
some analysis is for bi-directional interchange between computer systems. The
specifics of a particular XML format will be influenced by the purpose of the
format (simply for analysis, for dissemination, or for interchange) and by the
degree of blindness that is required.
Whatever the purposes and real-world needs that
guided the design of an XML intermediate format at level 2, the format, once
implemented at level 3, takes on a life of its own as it is used to represent a
variety of data, some of which may be unanticipated by the designer. By
following the integrated approach described here, the resulting format will be
more likely to be adaptable to varied circumstances and to be compatible with
other formats that are part of the same family of formats.
Figure D.3.3 – Relationship between interchange,
dissemination and data analysis
DXLT, the Default XLT Format, is a privileged
member of the SALT XLT family of formats. The purpose of this family of formats
is to represent terminological data. A particular member of the XLT family is
distinguished from other members by one choice: data-constraint specification
(DCS) file.. DXLT is defined by choosing a particular data-constraint
specification with constraints designed to support relatively blind
interchange. The DXLT core structure module and DXLT data-category
specification module have formal definitions specified in Annex A and Annex B,
respectively, of this SALT document. Note that Annex A may not currently
include the NLP module because of on-going development in the OLIF project.
Other members of the SALT XLT family can be defined by various user-groups. It
is expected that for most purposes, a member of the XLT family can be defined
by a subset of the Default DCS file. However, it is understood that unusual
user needs may require the definition of a DCS file that is not a subset of the
Default XLT DCS file.
There is another format mentioned in Annex E of
this SALT document. It is called DXLT-SRa and it differs from DXLT only in
representation style. Therefore DXLT and DXLT-SRa, are fully interoperable.
As mentioned above, user-groups can define
various subsets of DXLT. All subsets use the XLT data-constraint specification
layout. A subset of DXLT is defined by the combination of the XLT core
structure and a subset of the Default XLT data constraint specification. When
dealing with subsets of DXLT, the full data-category specification is called
the DXLT master data-category-specification. Any XML data stream that is a
valid instance of a DXLT-subset format will also be a valid DXLT data stream,
but some valid DXLT data streams are not sufficiently constrained to be valid
DXLT-subset data streams.
A significant benefit of this approach of
defining DXLT subsets is that the person defining the subset does not need to
understand or even see an XML DTD or schema. Only the data-category
specification itself. And the user can be buffered not only from DTDs and
schemas but also from the internal representation of the DCS itself through the
use of a software application.
A format defined by a subset of the Default XLT
DCS could be called a child of DXLT, and an XLT format not defined by the Default
XLT or a subset of it could be called a sibling of DXLT. A format within the
TMF project that is not part of the XLT family could be called a cousin of
DXLT. All formats in the extended
family defined by TMF that share the same DCS are interoperable.
It remains to be seen whether the TBX format
being developed by the LISA OSCAR group will be a child or sibling of DXLT.
However, it has been agreed TBX will, if at all possible, be a member of the
SALT XLT family of formats. Various OSCAR member companies will then use TBX,
or some subset of it, for exchanging terminological data within their
organization and with outside contractors and other groups as appropriate. It has also been agreed that the OLIF2
consortiium will supply the OLIF2 format to the SALT project so that a software
utility can be developed for merging an OLIF2 file into an XLT file and
subsequently recreating the original OLIF file, thus facilitating the
coordination of machine-translation and human-translation terminological resources,
for example. It is anticipated that at some point XLT will replace the current
chapter on terminological data in the Text Endoding Initiative guidelines.
Finally, as noted above, it is anticipated that XLT, without the NLP option,
will be compatible with the ISO TC/37 TMF project. The DXLT format should
provide the basis for the ISO TC/37 MSC project when it is re-activated. This
means that it will be possible to generate the XLT core structure from the TMF
abstract structure by specifying a vocabulary of tag names and an appropriate
set of representation choices. The same will be true of the Geneter format.
Thus, most of the work being done world-wide in the area of representation
formats for terminological data will be interoperable so long as compatible
data constraints are being used.
Annex E
(informative)
Alternate representations
DXLT is a format is a member
of the ISO/TC37 TMF set of TMLs (formats) defined by choosing (1) the core
structure-module of DXLT (that is, the DTD and DCS layout in Annex A), (2) the
DXLT master DCS in Annex B, and (3) MARTIF representation style.
Other formats, known as DXLT
subset formats, are defined by creating a subset of the DXLT master DCS.
Another format, called
DXLT-SRa, is defined by choosing the same core structure and DCS as DXLT but a
different representation style. DXLT
and DXLT-SRa are fully interoperable formats.
Indeed, a fairly simple algorithm can convert between them.
Here is the sample DXLT
document from section 5 of this specificatiion. It has been slightly adapted to
work with the XDR-schema version of the DXLT core-structure. This schema is
almost exactly equivalent to the DTD.
<?xml version='1.0'?>
<!--
copied from DXLT doc (16503) and adapted for schemas -->
<!-- hide doctype: martif PUBLIC "ISO 12200:1999A//DTD
MARTIF core (DXFcdV04)//EN" -->
<martif type='DXLT' lang='en'
xmlns="x-schema:DXFcsV04.xml">
<martifHeader>
<fileDesc><sourceDesc><p>from an Oracle corporation
termBase</p></sourceDesc></fileDesc>
<encodingDesc><p
type='DCSName'>DXFd-mwk</p></encodingDesc>
</martifHeader>
<text>
<body>
<termEntry
id='ID67'>
<descrip type='subjectField'
datatype="noteText">manufacturing</descrip>
<descrip type='definition'
datatype="noteText">A value between 0 and 1 used in
...</descrip>
<langSet lang='en'>
<tig>
<term>alpha smoothing
factor</term>
<termNote type='termType'
datatype="picklistVal">fullForm</termNote>
</tig>
</langSet>
<langSet
lang='hu'>
<tig><term>Alfa
simítási tényezõ
</term></tig>
</langSet>
</termEntry>
</body> </text>
</martif>
Now
we give the same information in DXLT-SRa format (except that some comments were
deleted):
<?xml version="1.0"?>
<martif type="DXLT-SRa" lang="en"
xmlns="x-schema:DXFga-mwk.xml">
<martifHeader>
<fileDesc>
<sourceDesc>
<p>from an Oracle corporation termBase</p>
</sourceDesc>
</fileDesc>
<encodingDesc>
<p type="DCSName">DXFd-mwk</p>
</encodingDesc>
</martifHeader>
<text>
<body>
<termEntry id="ID67">
<subjectField metaType="descrip"
datatype="noteText">manufacturing</subjectField>
<definition metaType="descrip"
datatype="noteText">A value between 0 and 1 used in
...</definition>
<langSet lang="en">
<tig>
<term>alpha smoothing factor</term>
<termType metaType="termNote"
value="fullForm" datatype="picklistVal"/>
</tig>
</langSet>
<langSet lang="hu">
<tig>
<term>Alfa simítási
tényezõ </term>
</tig>
</langSet>
</termEntry>
</body>
</text>
</martif>
The DXLT-SRa data stream is
completely equivalent to the primary representation of DXLT, on an element for
element basis. A simple algorithm can
convert back and forth between primary and secondary representation without
loss of information. Indeed, the
DXLT-SRa version was created from the DXLT version by a webpage using embedded
JavaScript to access the XML DOM built into Internet Explorer 5.
A careful examination of the
DXLT version shows that the optional datatype attribute in the DXLT core
structure has been used to indicate the datatype of the text found in each
instantiation of an DXLT meta data category (subjectField and definition are
note text and termType is a picklist value). This information, which can be
retrieved from the DCS file, guides the algorithm that converts from DXLT to
DXLT-SRa format.
[1] ISO
639:1988, Code for the representation of
names of languages.
[2] ISO
639-2:1998, Code for the representation
of names of languages – Part 2: Alpha-3 code.
[3] W3C, Extensible
Markup Language (XML) 1.0 (W3C recommendation 10-February-1998) (http://www.w3.org/TR/1998/REC-xml-19980210)
[4] W3C,
XML Schema Part 1: Structures (W3C Working Draft 7 April 2000)
(http://www.w3.org/TR/xmlschema-1)
[5] W3C,
XML Schema Part 2: Datatypes (Working Draft 7 April 2000)
(http://www.w3.org/TR/xmlschema-2)
[SEW1]This is an important addition here because MSC gets cited later without any explanation. This also establishes the current status of MSC so that there will be no confusion on the part of anyone reading this document.
[SEW2].
The Default XLT DCS file defines the Default
XLT Format (referred to as DXLT when a three-letter abbreviation is
needed). This sentence was retained from the old document that used DFX and
consequently is total gibberish now because we don’t have nearly as much
explaining to do with the new acronym, i.e., it’s unnecessary to explain XLT
again.
[SEW3]I didn’t like burying the initial discussion of DXLT in the paragraph on style because it really belongs with the discussion of content. This is also why I moved the comment about MSC up in the direction of my explanation of what MSC is/was. I’ve suggested some significant rewriting here because this paragraph is not at all clear to someone who doesn’t already know a lot about MARTIF/XLT.