term (main entry term):
|
Automobil
|
synonym (full form):
|
Personenkrafttwagen
|
short form (1):
|
Auto
|
short form (2):
|
Wagen
|
abbreviation:
|
PKW
|
Figure 2: Sample Data Model 1
3.2 Data Category Variation
Examination of numerous data modeling conventions has also revealed
that different needs result in different data structures, a phenomenon
that developers of the standards have sometimes called "bimodality",
but which is perhaps more clearly called "data modeling variance".
Some evaluators have voiced the view that it would be appropriate
to simply concentrate on the "normal" way of structuring
terminological data, but examination of the full spectrum of existing
systems reveals that there is no one "normal" way. Consequently,
the standard has been designed to ensure the compatibility of
the various approaches. The apparent complexity that results is,
however, a reflection of real-world conditions.
Deviations in ata modeling manifest themselves at two levels.
In some cases, variation involves simply the choice of divergent
data category names, which results in polysemy and synonymy. This
phenomenon is not difficult to deal with because such names can
be reassigned fairly easily during the data conversion process,
provided database developers have followed consistent procedures
in assigning data to any given field, which may not always be
the case.
The second type of modeling variation involves the way that categories
behave in individual databases. Some databases, especially those
influenced by paper-management traditions or lexicographical approaches,
use data categories for different types of terms as field names
in the data entry and enter terms as the content of these fields,
for instance as follows in the German language group of an entry
for the concept automobile (Figure 2).
|
|
Attributes
|
term:
|
Automobil
|
full form
|
term:
|
Personenkraftwagen
|
full form
|
term:
|
Auto
|
short form
|
term:
|
Wagon
|
short form
|
term:
|
PKW
|
abbreviation
|
Figure 3: Sample Data Model 2
Other databases simply use a data category term to which
all terms, regardless of term type, are assigned. The philosophy
behind these types of databases is that all terms in a terminological
entry are "created equal" as synonyms or equivalents
(or at least quasi-synonyms and equivalents). There are several
practical advantages of this system. For instance, sometimes in
any given terminological entry, a particular term will be considered
to be a main entry term or preferred term by one
customer, but only a synonym or even a deprecated
term by another. This approach also simplifies look-up
routines because queries only have to search for one data category
(term), and automatic cut-and-paste routines for translation
purposes do not have to be designed to accommodate any other data
categories. If additional information are added to categorize
these terms, it is supplied as individual attributes describing
the term field as shown in Figure 3.
Figure 4 represents yet a third, less common modeling structure.
In Figures 2 and 3, only those items that specifically apply in
a given instance appear in the data entry. In contrast, the model
in Figure 4 can lead to an accumulation of redundant data because
each permissible instance is enumerated for every item and given
a yes or no value instead of treating negative instances as unstated
default values.
A fourth approach to data "modeling" in an import environment
involves merely dumping all information as text data into an undifferentiated
text file (i.e., without assigning individual items to data categories
at all). Information is retrieved using all-word indexes. Interchange
back to any articulated database environment from such a system
would, of course, be impossible, and if the system is large, human
users will be burdened with an unmanageable number of hits for
many queries. Hence, this "solution" will not be treated
here as a serious alternative.
term:
|
Automobil
|
full form:
|
yes
|
short form:
|
no
|
abbreviation:
|
no
|
term:
|
Personenkraftwagen
|
full form:
|
yes
|
short form:
|
no
|
abbreviation:
|
no
|
etc.
|
|
Figure 4: Sample Data Model 3
Based on these examples, it is obvious that even simple data
structures pose complex problems if there is a desire to combine
entries with different structures. At one point in the evolution
of the standard, an attempt was made to allow for the expression
of all possible data modeling variations at the interchange level,
but this led to elaborate levels of complexity that also threatened
to create difficulties in parsing the resulting data. The choice
was made to simplify the interchange format by providing for a
single way of expressing these relations.
The final version of die MARTIF standard favors Sample Data Model
2 (Figure 3), primarily because it is the most concept-oriented
of the three and because it allows for the inclusion of multiple
attributes associated with a given term. In this, as in any other
information-management situation, problems of complexity cannot
be eliminated simply by not taking cognizance of them. In those
cases where source or target databases differ from the chosen
model, the import and export routines used to convert those databases
to and from MARTIF will have to be more complex. This degree of
complexity is not simply the problem of MARTIF, however; it is
an expression of the complexities existing in the "real world".
Defining a simpler MARTIF would not resolve the problems posed
by complexityit would only hide them and potentially lead
to lost or misinterpreted data.
3.3 Generic Identifiers and Compound Tags
Some evaluators have registered the criticism that the full "tag"
naming convention used in MARTIF appears to be excessively verbose.
Instead of giving each element a simple generic identifier (GI)
in Standard Generalized Markup Language (SGML), composite names
are used. These composites consist of a GI taken from the small
set of GIs defined in the standard, followed by the attribute
type plus a value. It is at this value level that the full
tag is uniquely identified. For example, the common field name
"Definition", becomes <descrip type='definition'>
in MARTIF.
Early in the development of the standard, the collected data
categories were classified and proposed as a list of GIs to be
used to tag the individual elements of the MARTIF entry. The initial
assumption was that each item would constitute the basis for a
GI in the MARTIF standard. For instance, a short list of GIs might
read: <term>, <definition>, <source>, <context>,
<responsibility>.
A major consideration within the TEI environment, however, was
whether the data categories identified as GIs would be a closed
set (such as can be defined for a standard format like the MARC
record) or whether it should remain an open list that can be augmented
as need arose. This decision is a critical one, for GIs and attributes
in SGML must be declared in the Document Type Declaration DTD),
which serves as the master 'program" that specifies the rules
for parsing the SGML document. Once a standard DTD is set, it
is not feasible to change the GIs.
At the time this problem was considered in the TEI group, many
new data categories were being added to the list every year. Although
new additions have become rare in the last few years, it is simply
not realistic to assume that terminologists have now discovered
all the terminological applications that can exist. Consequently,
the decision was made to declare a relatively short list of GIs
that would be truly generic and to modify these tags using the
value of the attribute type.
The resulting composite tags for the data categories cited above
now are as follows: <tern> (which remains unchanged and
reflects the data modeling structure illustrated in Figure 3),
<descrip type='definition'>, <ref type='source'>,
<descrip type='context'>, and <admin type='responsibility'>.
On the surface, this decision indeed introduces wordier (but not
necessarily more complicated) full tag names. What is gained is
first of all the ability to add data categories as needed because
only the GIs and the attributes have to be declared in the standard
DTD, whereas values assigned to the attribute type do not. (They
can, however, be listed in the header of the DTD. Standard SGML
parsers would not catch anomalies at the level of the type
value, but work groups would be see to design their own subsidiary
validation routines to check the actual data categories and in
some cases, even their content.) Furthermore, one also gains the
ability to classify the behavior for whole groups of data categories
by defining the behavior of the GIs associated with them.
The decision to use the long multi-component tags is certainly
an added complexity so far as human, manual markup is concerned,
for die omission of one small element can result in parsing errors,
but the objective of the standard is not to produce a mechanism
that will be easy for humans to produce and read. The goal is
rather to provide a powerful tool for automatic conversion of
data. What is "simple" to a stupid machine in terms
of SGML logic is not necessarily the same thing that will be simple
to a human being. The length of the tags is not particularly relevant
to the machine only the fact that they have unique values matters.
Nor does the number of values associated with any given GI constitute
a problem so long as they are clearly identifiable to the machine,
which has no trouble remembering all of them the way a human would.
Furthermore, limiting the GIs significantly simplifies and shortens
the DTD, which constitutes one way in which MARTIF is not
as complex as it might be had it been designed differently. Listing
each value of type as a separate GI would make the DTD unwieldy
even if the list did not change. Doing so would also discourage
the definition of specific subsets in individual work groups.
The aspects in this section lead us to the recurring question:
MARTIF is potentially complex, but for whom is it complex? Looking
at the raw data is complex for the average user, but not necessarily
for a system designer or programmer. The ideal situation would
be to retain the complexity of MARTIF, together with its built-in
flexibility and power, but to present potential users with an
interface that would drastically simplify the reuse of MARTIF
documents (see Section 6).
3.4 Embedding and GI Behavior
Early on, some criticisms questioned the idea of using SGML as
a language for expressing terminological data structures. These
critics argued that SGML is fundamentally designed as a document
exchange format that is only capable of expressing flat structures
and is unsuitable for implementing relational links. Today these
comments seem ironic when we consider how SGML-related formats
such as HTML (Hypertext Markup Language) and HYTIME are being
used to create highly flexible, multi-layered information systems
capable of facilitating powerful links across computer platforms
on a world-wide scale.
It is, of course, possible to use SGML to produce very flat,
simple structures. Figure 5a represents an example of what a hypothetical
flat model might look like, whereas Figure 5b shows a similar
simple entry expressed in MARTIF format. The examples are arranged
side-by-side for easy comparison of parallel structures.
Both models follow the typical SGML practice of placing related
information into containers, identified here by full tag names
that appear in boldface. At first glance, both models appear to
express equally simple data structures, although obviously the
MARTIF tags are slightly longer. Both entries begin by including
concept-related information that refers to the entire terminological
entry. 5a includes a container for concept-level information,
but MARTIF does not. Such a container is unnecessary, however,
since the <termnetry> contains all the information that
belongs to the concept-oriented terminological entry, and all
the data categories that are not grouped in the <ntig>s
pertain to the concept itself.
Both models place term-related information inside containers.
In Figure 5a, information associated with the main entry term
is packaged in <termInfo> tags and that for equivalent terms
in <equivlnfo> tags. This procedure works well if a single
language is given preference, but it remains unclear how such
a system would work if it were desirable to present terms without
preferencing a specific language.
Figure 5b places all information having to do with a given term
inside <ntig> tags, i.e., nested terminology information
groups. In the MARTIF model, all terms are treated equally.
The equivalence 'of these terms is implicit in the fact that they
appear in the same entry. Not only is their equal treatment in
the entry more "politically correct" (it does not give
preference to any one language) it also facilitates directional
change in using the database or in importing the data and readily
allows for the interchange of data to or from other data models
where different languages are given preference. If there is a
clear need to preference a single term in a single source language,
the MARTIF notation <termNote type='termType'> main entry
term</termNote> can be used.
Example 5a expresses dependency between items by just placing
them adjacent to each other: <grammar>, <termType>,
and <termSource>: simply follow <term>, and <defSource>
follows <definition>, thus producing an essentially flat
structure. These relations are uncomplicated enough for humans
to intuit based on their knowledge of terminological behavior,
but such dependencies are not, however, easy for dumb computers
to keep straight: Are each of the items inside <term> equally
and independently reliant on the. <term>, or is <def>
somehow dependent on <termSource> perhaps? How would an
import routine manage to package the data correctly in order to
redistribute it in a target system that looks very different from
the source database structure? It would appear that this model
explicitates some information that is already clearly implicit
in the entry (i.e. term equivalence), while at the same time failing
to mark important dependencies that are not at all implicit.
In contrast, the MARTIF entry is structured rather like a Russian
matrushka doll with small elements tucked inside of bigger
elements tucked inside of bigger elements, i.e., the, dependency
of items on other items is very clearly and securely expressed
by use of the powerful embedding capability that is native to
SGML. Packaging information in MARTIF is also like organizing
a kitchen cupboard: each individual herb or spice goes first into
a separate labeled container. No sensible cook would just dump
them in piles one after the other in the comer of the kitchen
and hope to be able to retrieve them based on the fact that they
are merely adjacent to one another.
Data, like condiments, are easier to retrieve and manipulate
if they are clearly packaged and marked than if one has to search
through the whole collection to find a single small item. In order
to "know" that certain elements belong together, these
relations must be explicitly stated in away that the SGML parser
can understand in order to avoid introducing ambiguities into
the parsing process that pose serious dangers of data loss, contamination,
or misinterpretation.
The markup inside the <ntig> (shown here in boldface) provides
a powerful tool for indicating that certain pieces of data are
closely related to one another. By putting all the information
pertaining to a term (in this case, the term string itself, its
part of speech, gender, and term status) into a <termGrp>
container and all the information pertaining to the definition
into a description container (<descripGrp>), the interchange
format begins to look much more complex than a simple hardcopy
entry designed for use by the human reader, and yet in so doing,
it provides a mechanism for transporting that information into
a new environment, perhaps even into a data architecture where
these elements of information are organized in a very dissimilar
way. The introduction of what looks like complexity to the human
reader is actually a powerful tool for making data manipulation
much simpler: for the computer.
The MARTIF entry departs from the flat model in yet another powerful
way. The <termSource> tags in Example 5a appear simply to
provide the information concerning the source of the terms. Even
if other information resides elsewhere in the database, there
is no indication in terms of SGML notation that "Sanchez
45" or "Hardwired96" could serve as relational
links to that information. Thus the markup again remains flat
and is confined to the entry at hand. In MARTIF, the <ptr>
and <ref> tags enable the user to program links to targeted
locations elsewhere in the body or backmatter of the document
or even to facilitate links to external references beyond the
document. This feature, coupled with the security offered by the
use of SGML embedding features, does indeed make MARTIF more complex,
but in exchange the user is liberated from the flat world of Example
5a to an environment where powerful hypertext links can express
a wide variety of relations within the entry, between entries,
and between entries and other types of data. Figure 5a constitutes
the results of a self-fulfilling prophecy that proclaims that
SGML is a flat system, incapable of fully representing terminological
data, whereas MARTIF demonstrates the full power of SGML to accommodate
knowledge on a multi-level scale. The difference between the flat
structure of 5a and MARTIF is even more pronounced when one examines
more complex examples and more sophisticated databases. This contrast
is as dramatic as the difference between a flat piece of hardcopy
printed on paper and the dynamic hypertext environment of the
World Wide Web.
|