Copyright © 2001
Translation Research Group
Send comments to
comments@ttt.org
Last updated: February 7, 2001 |
SALT: Standards-based Access
service to multilingual Lexicons and Terminologies
Alan K. Melby
SALT has begun! For information
on the SALT Project in Europe please visit http://www.loria.fr/projets/SALT.
-
What is SALT?
SALT is a consortium
of academic, government, association, and commercial groups in the
U.S. and Europe who are working together on the task of testing, refining,
and implementing a universal putting together format for the interchange
of terminology databases and machine translation lexicons. This universal
"lex/term" format is based on the recently-adopted MARTIF
standard (ISO 12200, which is in turn based on ISO 12620) for human-oriented
terminology database ("term" for short) exchange (for further
info see www.ttt.org) and the OLIF
format for machine-translation dictionary and other NLP lexicon ("lex"
for short) exchange (see www.otelo.lu),
along with some Unicode and meta-markup features of the TMX standard
for translation-memory database exchange (see www.lisa.unige.ch
in the SIG section) and, finally, coordination with results from other
related projects, such as Transterm and Geneter. The Transterm project
has been finished for some time, but we intend to coordinate the MARTIF
conceptual data model with the Transterm format conceptual data model,
which is similar. Another project with a similar conceptual data model
is the University of Rennes project with the Geneter format, which
is used in the Inesterm project. We are working on an automatic conversion
between MARTIF and a version of the Geneter format.
As stated, the SALT
project itself involves:
- testing and refining
an XML-based lex/term-data interchange format combining MARTIF and
OLIF and called XLT,
- development of
a website for people to try out various XLT utilities, and
- development of
an XLT toolkit for lex/term-related product developers
The utilities will include
conversion routines between OLIF and XLT, between Geneter and XLT, and
between several other formats and XLT, as well as guidelines for those
who want to develop their own conversion routines.
For many people in
the language industries, the benefits having one widely use term-data
interchange format are obvious. Indeed, the following typical comment
was made at the Localization Industry Standards Association Forum
in Boston (February 1999) in response to the idea of combining MARTIF
and OLIF: "This is what we have been waiting twenty years for!"
However, to facilitate
your discussions with colleagues who have not been thinking about
these issues continually over the past twenty years, let me repeat
some of the projected benefits of a universal term-data exchange format:
- Faster
Insertion of New Terms into a Database
The language industries are rapidly embracing the use of translation
tools such as automatic terminology lookup, terminology mining,
terminology consistency checkers, and machine translation. Authoring
tools that provide access to a termbase are also appearing, at least
in the context of controlled language, but will hopefully soon be
applied to the control of terminlogy in the authoring process even
when the syntax is less controlled. Each of these technologies is
driven by or drives terms in databases. No longer is a paper glossary
sufficient or even a word processing file that is useable only by
a human. With each database potentially using a different internal
format, receiving term information from multiple sources and incorporating
that information into your database can either hand re-keying (ugly!)
or custom programming of format-to-format filters (expensive!).
Once one lex/term-data exchange format (say, for example, XLT) is
widely accepted, every translation tool developer can include just
one import/export filter (to and from XLT) into their application.
Then, everyone can request term data in the XLT format appropriate
to their user-group and incorporate it into their database without
re-keying or writing custom filters.
- More Consistency
Across Documents
It is well known that, given professional competence on the part
of an author, translator, or reviewer, the single most significant
controllable factor in documentation/translation quality is consistency
in the use of terms. Whenever multiple authors or translators are
working on a large document or set of related documents, there is
the chance for inconsistency in the use of terms, even if each person
in the document production chain is using some kind of translation
tool that includes a termbase. XLT will facilitate the dissemination
of current versions the constantly updated master term database
to multiple authors/translators at multiple locations around the
globe using multiple translation tools, so long as they all include
an appropriate XLT import filter. Without a universal exchange format,
expensive custom programming may be required to maintain consistency.
- Synchronization
of Human and Machine Translation
An increasingly common scenario in large organizations is the use
of both human translation (with technology assistance such as translation
memory lookup and automatica terminology lookup) and machine translation
(with human revision of raw output). It is imperative that the human
and machine translation sides of such an operation use terms consistently.
That is why XLT includes both a human-translation aspect (MARTIF)
and a machine-translation aspect (OLIF) that have been integrated
into a single format framework (XLT). The SALT project will include
the development of freeware tools for merging human-translation
term data and machine-translation term data into a single database,
with automatic reporting of potential holes on either side and potential
conflicts (such as the same concept in the same domain being designated
by different terms on the human and machine traslation sides), to
be brought to the attention of a human terminologist for evaluation.
The merged database could even be used as the master repository
for noun and noun-noun compound terms for both the human translator
tools and the machine translation system in your organization. The
cost of synchronizing may be greatly reduced by using the SALT utilities.
The cost of not synchronizing may be enormous in terms of the deleterious
effects of inconsistency. It is reported that the internal costs
of developing and implementing a human/machine translation synchronization
system in one very large organization ran into the millions of dollars
yet was expected to pass the break even point within a few years.
When they heard about the SALT project, they wished it had been
done sooner so that they could have reaped a return on investment
sooner.
-
Who is SALT?
The US coordinator
is Alan K. Melby (Brigham Young University (BYU) and LinguaTech International)
and also includes others from academia (e.g. a terminologist, Sue
Ellen Wright at Kent State University, and Deryle Lonsdale at BYU,
who will head up the ontology aspects of SALT). The European Coordinator
is Gerhard Budin at the University of Vienna. The following commmercial
translation tools developers have been contacted and have expressed
interest in the project: Trados, Star, EP, Logos, Systran, and L&H.
Additional potential main partners that have expressed interest are:
the University of Applied Sciences (Cologne), IAI (Saarbrucken), the
European Academy (Bozono, Italy), the Intstitute for Business Informatics
(Kolding, Denmark), Loria Labs and Termisti (France and Belgium),
the University of Surrey (UK), and SAP [representing Otelo] (Germany).
[Clearly, the participation of key members of the Otelo project, particularly
those focused on OLIF, is important. The Otelo final user's group
meeting includes possible SALT collaboration as an agenda item.] An
advisory group is being formed, consisting of governmental and non-governmental
organizations that have a vested interest in term data sharing. So
far, we have received positive responses from LISA (HA near Geneva),
Infoterm (Vienna), AMTA (Ed Hovy). We will also be coordinating with
two relevant EC agencies that cannot be formally listed in the proposal
but are willing to advise informally. We have also contacted various
LISA member companies for letters of support. We have also arranged
additional corporate partners, such as Medtronic and HP. A number
of other companies, including Microsoft, are involved through LISA,
as IT developers, Localization service providers, and as LISA-OSCAR
Steering Committee members.
-
What are the duties
and benefits of being a SALT partner?
There are essentially
two levels of participation possible in the SALT project:
- main partner:
- perform data
collection and analysis
- develop/test
demo website featuring utilities for validation, conversion,
merging, etc.
- develop and
test the software development toolkit (which will be used in
the website utilities and made available to developers for integration
into their applications)
- advisory
partner:
- provide sample
term data to data collectors
- test website
as it is developed
- provide end-user
feedback
The principal benefit
from being a SALT partner is not funding. Partners should be motivated
by a desire to satisfy the strongly-felt need within the language
industry for a universal term data interchange format. Should NSF/EU
funding not be approved (which would be extremely unfortunate), the
project will proceed on a much smaller scale anyway, especially with
the industrial partners, since it is in their interest to be involved
in the refinement and promotion of the primary interchange format
for human-translator and machine-translation term data. However, a
strong show of support now will increase the chances of funding. Specific
benefits to partners include the prestige of early adoption, the advantage
of influencing the refinement process, and the potential for eventual
consulting work to help others implement XLT.
Of course, the partners
will also share in the same benefits that will be reaped by everyone
in the language industries. These benefits are listed in section 1.
-
How do MARTIF, OLIF,
Geneter, TBX, XML, and XLT fit together?
Many projects over
the years have worked on the problem of term data exchange. There
are two sides: human-oriented, concept-oriented terminology databases
(termbases) and machine-translation (and other NLP applications),
which are lexicons and/or lexicographically oriented lexicons (hence
"lex").
On the termbase side,
we have seen the MATER project, the MicroMATER project, and the terminological
data aspect of the TEI project that have culminated in the MARTIF
standard (ISO 12200, based on ISO 12620), which were published as
ISO standards in the third quarter of 1999. In the August 1999 ISO
meetings in Berlin it was decided to pursue
- an application
of MARTIF called MARTIF with Standardized Constraints (MSC)
- another intermediate
format called Geneter, which, like MARTIF, is based on ISO 12620
- a meta-model encompassing
both MARTIF and Geneter
In addition a resolution
was passed mandating an effort to make both MSC and Geneter interoperable.
An expected outcome of this effort is that Geneter will be brought into
the family of MARTIF-compatible formats in view of market needs for
a single lex-term interchange format.
On the machine-translation
lexicon side, we have seen a series of EC projects, including Eurotra-7,
Multilex/MLEXd, and Genelex projects, and, more recently, the Transterm
project, along with several commercial exchange formats, such as MLIF
(from METAL) and LEF (from Logos). The most recent MT-lexicon exchange
format, OLIF, emerged from the Otelo project, which ended in the 2nd
quarter of 1999. One OLIF paper specifically acknowledges the following
previous formats: MARTIF, Transterm, Interval, MLIF, and LEF.
Another important
historical thread in the development of term-data exchange standards
is LISA (the Localization Industry Standards Association). This important
trade association started work in 1997 on a standard format for exchanging
translation memory database data. The result, called TMX, was one
of the first applications of XML and is being implemented by major
commercial translation technology vendors. The next data exchange
standards project of LISA is to define TBX for term database exchange.
Suddenly these various
threads (MARTIF, OLIF, and LISA) came together starting in February
1999. While chairing a OSCAR meeting in Boston in February (OSCAR
is the LISA data exchange standards body), Alan Melby got feedback
that the previous OSCAR plan to look at termbase and MT-lexicon data
exchange separately was unacceptable to the localization industry
(and probably the wider language industries). An integrated standard
was needed now. While working on an integrated Martif-Olif format
for the TBX proposal during February and March, he noticed a call
for joint USA-EU proposals from the National Science Foundation on
the USA side and from the 5th Framework/IST/HLT program on the EU
side. The title of the call is "Multilingual Access and Management",
and the description of the call includes terminology management, human
and machine translation, and data exchange standards. It seemed a
natural step to propose a project that combines further work on MARTIF
and OLIF, along the lines of the TBX proposal but going beyond LISA.
That step was taken
and the SALT project and its XLT format have been launched. XLT is
an XML-compliant framework for defining a family of closely related
term data exchange formats tailored to specific user groups. MARTIF
is an SGML application that has been adapted to the XML world in anticipation
of the adoption of an XML-Schema standard and has become the heart
of XLT. The essence of OLIF, which is a tagged, but not SGML, application
has been integrated into XLT by inserting the OLIF header into the
XLT header, merging the OLIF Central Entry into the corresponding
element of XLT taken from MARTIF, and adding to each XLT concept entry
an optional NLP feature-value pair list that corresponds exactly the
feature-value pair list of OLIF but is recast in MARTIF-style XML.
In addition, the TMX method of documenting user-defined Unicode characters
and the TMX meta-markup method of including presentational markup
in running text (for contextual examples, etc) have been incorporated
into XLT. The resulting format, XLT, is described in a paper presented
at the TKE conference in August 1999. That draft is available through
the www.ttt.org homepage as a PDF
file called "Leveraging
Terminological Data for Use in Conjunction with Lexicographical Resources".
-
What are the Objectives,
Goals, and Relevance of SALT?
The objective of SALT
is to build on the work that has been done in several projects dealing
with sharing what we call lex/term-data (including Otelo, Transterm,
and Martif). Specific goals include (a) the testing and refinement
of a unified XML-based format called XLT (of which TBX is the LISA
subset), (b) the development of a demonstration website for end-users
to submit files in various formats and validate them, merge them,
and get them back in another format, using XLT as the intermediate
format, and (c) the development of a toolkit for translation technology
developers who want to integrate XLT filters into their software applications.
A specific research
goal for SALT is to investigate the difficult problem of mapping positions
from one ontology into another. This problem arises necessarily when
attempting to minimize information loss going from one termbase to
another when they two termbases do not use the same ontologies (classification
systems, concept systems, and thesauri). Other less challenging but
useful goals are the tasks of extracting a concept system from an
existing termbase and grafting an existing ontology onto a termbase
that does not yet have links to one.
The overall goal of
SALT, however, is extremely practical. It is to reach "critical
mass" with XLT so that tools developers, such as Star, Trados,
EP, Logos, Systran, L&H, and Xerox, will incorporate some level of
XLT support in their products and so that various companies will provide
on-going consulting services to anyone who wants to get their proprietary
lex/term-data into XLT format or XLT data into their proprietary format.
The demonstration website will of course use the XLT toolkit. Developers
and consultants will all use the detailed specifications, sample files,
and tools for XLT that will be made universally available as freeware
with the only restriction being strict adherence to the standard into
order to use the SALT/XLT logo.
Without such a "jump
start" I fear that widespread use of data exchange standards
for lexi[cographical]/term[inological]-data will be unnecessarily
delayed. Given the work that has been done to date on lex/term-data
exchange, we do not need to search for the "ideal" or absolutely
perfect format. The OLIF and MARTIF formats are good enough and the
need is growing. Let's get them widely enough known in their integrated
form as XLT (with various user-group-specific subsets such as TBX)
so that no fragmentation can take place. The language industries need
one format that is good enough, not multiple competing formats. And
they need it now.
The timetable for
SALT is that has essentially already begun since the XLT framework
is ready for testing and initial data collection has begun. We hope
to hold a major SALT conference in conjunction with the TAMA 2000
conferences.
Please direct comments
and questions concerning SALT to:
Alan K. Melby <akm@byu.edu> (+1
801 378-2144)
with a cc: to Arle Lommel <fenevad@ttt.org>
(+1 801 378-4414)
Brief letters of support for the SALT project should be written on organization
letterhead and mailed to:
Alan K. Melby, Dept. of Linguistics
2129 JKHB BYU
Provo, Utah 84602
USA
and faxed to +1 801 377-3704
| Homepage
| Theory | Technology
| XLT Page | SALT
Project | OSCAR |
| Press Releases | CLS Framework | TAMA 2001 | About
us |
|