[2011-07-01 – This document is reproduced with permission of the Localization Industry Standards Association. All emendations to the original text are indicated by striking through the original text and inserting new text in red bold face in square brackets.]
[This version:
http://www.ttt.org/oscarStandards/srx/srx10.html
Latest version:
http://www.ttt.org/oscarStandards/srx/]
Editor:
David Pooley
Copyright © The Localisation Industry Standards
Association [LISA] 2004. All Rights
Reserved.
This document and translations of it may
be copied and furnished to others, and derivative works that comment on or
otherwise explain it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are included
on all such copies and derivative works. However, this document itself may not
be modified in any way, such as by removing the copyright notice or references
to LISA.
The limited permissions granted above are
perpetual and will not be revoked by LISA or its successors or assigns.
This document and the information
contained herein is provided on an "AS IS" basis and LISA DISCLAIMS
ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY
THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
This document defines the Segmentation
Rules eXchange format (SRX). The purpose of the SRX format is to provide a
standard method to describe segmentation rules that are being exchanged among
tools and/or translation vendors.
This document is the current draft for the
SRX format. Comments may be sent to segmentation_wg@lists.lisa-open.org.
1. Introduction
1.1. XML compliance
1.2. Regular expressions
2.1. Language rules
2.2. Map rules
2.3. Example document
3.1. Elements
3.2. Attributes
A. References
D. Document Type Definition for SRX
SRX is intended to enhance the TMX
standard so that translation memory (TM) data that is exchanged between
applications can be used more effectively. Having the segmentation rules that
were used when a TM was created will increase the leverage that can be achieved
when deploying the TM data. SRX does not, however, address the following
procedural issues, which would cause loss of leverage when TMX data is deployed
in another environment:
·
The use of
segmentation rules that are different from those that were originally used to
create the TM data
·
TM data that was
generated using a variety of segmentation rules
SRX is defined in two parts:
<languagerules>
element.<maprules>
element.An SRX document is a
companion to a TMX document which will allow the application receiving the TMX
data to determine how the original text was segmented before being inserted in
to the translation memory. As such, it is assumed that any text that is being
segmented with the rules defined in the SRX document is already in TMX format
and only contains the standard TMX defined formatting tags <bpt>,
<ept>, <ph> and <it>.
In its current
implementation, SRX is focused primarily on sentence segmentation. The reason
behind this is that TM tools that currently support (or intend to support) TMX
are also primarily focused on sentence segmentation. Future versions of SRX may
address more complex segmentation such as phrases and terms.
SRX is XML-compliant. SRX files are
intended to be created automatically by export routines and processed
automatically by import routines. SRX files are "well-formed" XML
documents that can be processed without explicit reference to the SRX DTD.
However, a "valid" SRX file must conform to the SRX DTD, and any
suspicious SRX file should be verified against the SRX DTD using a validating
XML parser.
Since XML syntax is case sensitive, any
XML application must define casing conventions. All elements and attributes
names of SRX are defined in lowercase.
The SRX namespace is defined as "http://www.lisa.org/srx10
". The following XML sample includes the SRX
namespace definition.
<?xml version="1.0"?>
<myformat>
<data>
<srx xmlns="http://www.lisa.org/srx10"
version="1.0">
... SRX data ...
</srx>
</data>
</myformat>
The segmentation rules themselves are represented
using regular expressions. This allows for maximum flexibility in the
definition of the rules. The following definitions are a subset of the current
definition for the ICU regular expressions. Applications using other engines
will need to adapt this format for use with their own parser.
Character |
Description |
|
Match a BELL, |
|
Match at the beginning of the input. Differs from |
|
Match if the current position is a word boundary. Boundaries occur at
the transitions betweem word ( |
|
Match a BACKSPACE, |
|
Match if the current position is not a word boundary. |
|
Match a control-X character. |
|
Match any character with the Unicode General Category of Nd (Number, Decimal
Digit.) |
|
Match any character that is not a decimal digit. |
|
Match an ESCAPE, |
|
Terminates a |
|
Match a FORM FEED, |
|
Match if the current position is at the end of the previous
match. |
|
Match a LINE FEED, |
|
Match the named character. |
|
Match any character with the specified Unicode Property. |
|
Match any character not having the specified Unicode Property. |
|
Quotes all following characters until |
|
Match a CARRIAGE RETURN, |
|
Match a white space character. White space is defined as |
|
Match a non-white space character. |
|
Match a HORIZONTAL TABULATION, |
|
Match the character with the hex value hhhh. |
|
Match the character with the hex value hhhhhhhh. Exactly eight hex
digits must be provided, even though the largest Unicode code point is |
|
Match a word character. Word characters are |
|
Match a non-word character. |
|
Match the
character with hex value hhhh |
|
Match the character with two digit hex value hh |
|
Match a Grapheme
Cluster |
|
Match if the current position is at the end of input, but before the
final line terminator, if one exists. |
|
Match if the current position is at the end of input. |
|
Match the character with octal value nnn |
|
Back Reference. Match whatever the nth capturing group matched. n must
be > 1 and < total number of capture groups in the pattern |
|
Match any one character from the set. See UnicodeSet for
a full description of what may appear in the pattern. |
|
Match any character. |
|
Match at the beginning of a line. |
|
Match at the end of a line. |
|
Quotes the following character. Characters that must be quoted to be
treated as literals are |
The following operators can be used.
Operator |
Description |
|
Alternation. A|B matches either A or B. |
|
Match 0 or more times. Match as many times as possible. |
|
Match 1 or more times. Match as many times as possible. |
|
Match zero or one times. Prefer one. |
|
Match exactly n times |
|
Match at least n times. Match as many times as possible. |
|
Match between n and m times. Match as many times as possible, but not
more than m. |
|
Match 0 or more times. Match as few times as possible. |
|
Match 1 or more times. Match as few times as possible. |
|
Match zero or one times. Prefer zero. |
|
Match exactly n times |
|
Match at least n times, but no more than required for an overall
pattern match |
|
Match between n and m times. Match as few times as possible, but not
less than n. |
|
Match 0 or more times. Match as many times as possible when first
encountered, do not retry with fewer even if overall match fails (Possessive
Match) |
|
Match 1 or more times. Possessive match. |
|
Match zero or one times. Possessive match. |
|
Match exactly n times |
|
Match at least n times. Possessive Match. |
|
Match between n and m times. Possessive Match. |
An SRX document is enclosed in an <srx>
root
element. The <srx>
element contains two elements: <header>
and <body>
. The <header>
element contains zero or more <formathandle>
elements. The <body>
element contains two elements: <languagerules>
and <maprules>
.
The <languagerules>
element contains information about the segmentation rules for each particular
language. It is a collection of <languagerule>
elements. Each one of these contains a collection of <rule>
elements.
Each <rule>
element contains zero or one <beforebreak>
element and zero or one <afterbreak>
element which provide details of the regular
expressions for the rules themselves. The break
attribute indicates whether this is a segment break
or an exception rule. The rules are applied in the order that they are
specified within the <languagerule>
element.
Note that this approach is an adaptation of the method
described in Unicode Technical
Report 29 which covers text boundaries. Readers are encouraged to study
this report with particular attention being given to the "sentence
boundaries" section.
The <maprules>
element contains information
as to how each language should be segmented. It is a collection of <maprule>
elements. Each one of these contains a collection of <languagemap>
elements that describe which rules should be used for
each language.
See the sample document
section for an example of a SRX document.
This section lists the various elements used in the
SRX document.
<afterbreak>
, <beforebreak>
, <body>
, <formathandle>
, <header>
, <languagemap>
, <languagerule>
, <languagerules>
, <maprule>
, <maprules>
, <rule>
, <srx>
.
After
break - The <afterbreak>
element encloses a
regular expression.
Required
attributes:
None.
Optional
attributes:
None.
Contents:
A regular
expression which represents the text that appears after a segment break.
Before break - The <beforebreak>
element
encloses a regular expression.
Required
attributes:
None.
Optional
attributes:
None.
Contents:
A regular
expression which represents the text that appears before a segment break.
Body - The <body>
element
encloses the language rules and language maps that are contained within the
file.
Required
attributes:
None.
Optional
attributes:
None.
Contents:
Zero or
one <languagerules>
element
and zero or one <maprules>
element.
Format
handling - The <formathandle>
element
determines how formatting that falls on a segment boundary should be handled.
The type
attribute determines the type of formatting
and the include
attribute
indicates how this formatting should be handled. As these elements are optional
in the <header>
element,
the following defaults will apply:
<formathandle
type="start" include="no"/>
<formathandle type="end" include="yes"/>
<formathandle type="isolated" include="no"/>
Required
attributes:
Optional
attributes:
None
Contents:
None
Header - The <header>
element
contains information that is relevant to the whole document.
Required
attributes:
segmentsubflows, includeformatting
Optional
attributes:
None.
Contents:
Zero, one,
two or three <formathandle>
elements.
Language
map - The <languagemap>
element maps one or
more languages to a language rule.
Required
attributes:
languagepattern
, languagerulename
Optional
attributes:
None.
Contents:
None.
Language
rule - The <languagerule>
element encloses one
instance of language rule data, a set of <rule>
elements.
Required attributes:
Optional
attributes:
None.
Contents:
One or
more <rule>
elements.
Language
rules - The <languagerules>
element encloses the
language rules data, the set of <languagerule>
elements.
Required
attributes:
None.
Optional
attributes:
None.
Contents:
One or
more <languagerule>
elements.
Map rule - The <maprule>
element
encloses one instance of map rule data, a set of <languagemap>
elements.
Required
attributes:
Optional
attributes:
None.
Contents:
One or
more <languagemap>
elements.
Map rules - The <maprules>
element
encloses the map rules data, the set of <maprule>
elements.
The order of the <maprule>
elements
determines the logical order in which these rules should be applied.
Required
attributes:
None.
Optional
attributes:
None.
Contents:
One or
more <maprule>
elements.
Break or
exception rule - The <rule>
element
defines a segmentation rule for a language using the <beforebreak>
and <afterbreak>
elements. The break
attribute determines whether this is a rule
that determines a break or an exception. If the break
attribute
is missing, it is assumed to be a break rule.
Required
attributes:
None.
Optional
attributes:
Contents:
Zero or
one <beforebreak>
element and
zero or one <afterbreak>
element.
Root
element - The <srx>
element is the root
element of the document. It encloses the header and body information for the
file.
Required
attributes:
Optional
attributes:
None.
Contents:
One <header>
element and one <body>
element.
This section lists the various attributes used in the
SRX elements.
break
, include
, languagepattern
, languagerulename
, maprulename
, segmentsubflows
, type
, version
.
Break
indicator - Specifies whether a rule is a break or an
exception.
Value
description:
A value of
"no" indicates that the rule is an exception rule. A value of
"yes" indicates that the rule is a break rule.
Default
value:
"yes"
Used in:
Formatting
code behaviour - The include
attribute indicates
whether formatting is included in the segment being created.
Value description:
A value of
"no" indicates that the format code does not belong to the segment
being created. A value of "yes" indicates that the format code
belongs to the segment being created.
Default
value:
"no"
Used in:
Language
pattern - Identifies a language pattern.
Value
description:
Specifies
a regular expression for the language codes that map to the given language
rule. Language codes are defined as in [RFC 3066].
Default
value:
Undefined.
Used in:
Language
rule name - Specifies a unique name for a language rule.
Value
description:
Used to
link a language rule between the <languagerule>
and <languagemap>
elements.
Default
value:
Undefined.
Used in:
<languagerule>
, <languagemap>
.
Map rule
name - Specifies a unique name for a mapping rule.
Value
description:
Used to
uniquely identify a mapping rule.
Default
value:
Undefined.
Used in:
Subflow
segmentation behaviour - The segmentsubflows
attribute
indicates how subflows should be segmented.
Value
description:
A value of
"no" indicates that subflows within a segment should not be
segmented. A value of "yes" indicates that subflows should be
segmented according to the rules. A subflow is defined as being a piece of text
that appears within another segment but which should be handled separately. For
example, in the following HTML snippet:
<p>Click
<img src="..\button.gif" alt="Toolbar button. Click to
preview."/> to preview the document.</p>
The text "Toolbar
button. Click to preview."
is a subflow. The segmentsubflows
attribute
determines whether this text should be segmented according to the rules.
Default
value:
"yes"
Used in:
Formatting
code type - The type
attribute indicates
the type of formatting for which the <formathandle> is being
applied.
Value
description:
This
attribute can have one of three values. These are:
·
"start" to indicate the start of a pair of
formatting codes
·
"end" to indicate the end of a pair of
formatting codes
·
"isolated" to indicate a format that has no
partner
Default
value:
Undefined
Used in:
SRX
version - The version
attribute indicates
the version of the SRX format to which the document conforms.
Value
description:
Fixed
text: the major version number, a period, and the minor version number. For
example: version="1.0"
.
Default
value:
"1.0
"
Used in:
[Unicode Character Database 4.0.0]
Unicode Character
Database 4.0.0. Unicode Organisation, Apr 2003.
Codes for the
Representation of Names of Languages. ISO (International Organization
for Standardization), Nov 2001.
Codes for
the representation of names of countries and their subdivisions. ISO
(International Organization for Standardization), Jun 2000.
RFC 3066 Tags for the
Identification of Languages. IETF (Internet Engineering Task Force),
Jan 2001.
Extensible
Markup Language (XML) 1.0 Second Edition. W3C (World Wide Web
Consortium), Oct 2000.
ICU
Regular Expressions User Guide. IBM, 2003.
UAX #29: Text Boundaries, Unicode Consortium, 2003.
In this example of an SRX document indentations are
added for ease of reading, and the different types of notation are mixed to
illustrate the various possibilities.
<?xml version="1.0"?>
<!DOCTYPE srx PUBLIC "-//SRX//DTD SRX//EN" "srx.dtd">
<srx version="1.0">
<header segmentsubflows="yes">
<formathandle type="start" include="no"/>
<formathandle type="end" include="yes"/>
<formathandle type="isolated" include="yes"/>
</header>
<body>
<languagerules>
<languagerule languagerulename="Default">
<rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak>\s[a-z]</afterbreak>
</rule>
<rule break="no">
<beforebreak>\sMr\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\.\?!]+</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak></beforebreak>
<afterbreak>\n</afterbreak>
</rule>
</languagerule>
<languagerule languagerulename="Japanese">
<rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\.\?!]+</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak></beforebreak>
<afterbreak>\n</afterbreak>
</rule>
</languagerule>
</languagerules>
<maprules>
<maprule maprulename="Default">
<languagemap languagepattern="JA.*" languagerulename="Japanese"/>
<languagemap languagepattern=".*" languagerulename="Default"/>
</maprule>
</maprules>
</body>
</srx>
This section provides some examples of
how segmentation rules might be applied to fragments of text. These are simple
examples and are by no means a complete reference to segmentation.
|
Text to segment |
Result |
Notes |
|
The U.K. Prime Minister, Mr. Blair, was seen out with his family today. |
(1) The U.K. |
The simple full-stop followed by a space rule here showing its limitations |
|
The U.K. Prime Minister, Mr. Blair, was seen out with his family today. |
(1) The U.K. Prime Minister, Mr. |
Partially corrected with an exception for "U.K." |
|
The U.K. Prime Minister, Mr. Blair, was seen out with his family today. |
(1) The U.K. Prime Minister, Mr. Blair, was seen out with his family today |
Sufficient exceptions to prevent segmentation on "U.K." and "Mr." |
<!-- SRX
Public Identifier: "-//SRX//DTD SRX//EN"
History of modifications (latest first):
Apr-21-2004 by DRP: Convert to version 1.0.
Mar-22-2004 by DRP: Eighth draft version.
Ensure the <excludeexception> element is removed
Update version number
Mar-17-2004 by DRP: Seventh draft version.
Remove <exceptions>, <exception>, <endrules>, <endrule> and <excludeexception> elements
Add <rule> element
Update version number
Feb-02-2004 by DRP: Sixth draft version.
Update version number
Oct-27-2003 by DRP: Fifth draft version.
Removed includeformatting attribute from <header> element
Added <formathandle> element to the <header>
Removed priority attribute from <endrule> and <exception> elements
Added name attribute to <exception> element
Added <excludeexception> element to the <endrule> element
Oct-10-2003 by DRP: Fourth draft version.
Removed <classdefinitions> and <classdefinition> elements
Removed classdefinitionname attribute
Removed <digitcharacters>, <whitespacecharacters> and <wordcharacters>
Added priority attribute to <endrule> and <exception> elements
Added includeformatting attribute to <header> element
Jul-24-2003 by DRP: Third draft version.
Removed <charsets> and <charset> to be replaced with <classdefinitions> and <classdefinition>
Renamed <digits> to <digitcharacters>
Renamed <whitespace> to <whitespacecharacters>
Renamed <wordchars> to <wordcharacters>
<digitcharacters>, <whitespacecharacters> and <wordcharacters> are now optional
Renamed <langrules> to <languagerules>
Renamed <langrule> to <languagerule>
Renamed <langmap> to <languagemap>
Renamed langrulename to languagerulename
Renamed langpattern to languagepattern
Jun-19-2003 by DRP: Second draft version.
Removed the <codepage> element.
Added <header> and <body> elements.
Nov-22-2002 by DRP: First draft version
-->
<!ELEMENT srx (header, body) >
<!ATTLIST srx
version CDATA #FIXED "1.0"
>
<!ELEMENT header (formathandle*) >
<!ATTLIST header
segmentsubflows CDATA #REQUIRED
>
<!ELEMENT formathandle EMPTY >
<!ATTLIST formathandle
type CDATA #REQUIRED
include CDATA #REQUIRED
>
<!ELEMENT body (languagerules?, maprules?) >
<!ELEMENT languagerules (languagerule+) >
<!ELEMENT languagerule (rule+) >
<!ATTLIST languagerule
languagerulename CDATA #REQUIRED
>
<!ELEMENT rule (beforebreak?, afterbreak?) >
<!ATTLIST rule
break CDATA #IMPLIED
>
<!ELEMENT beforebreak (#PCDATA) >
<!ELEMENT afterbreak (#PCDATA) >
<!ELEMENT maprules (maprule+) >
<!ELEMENT maprule (languagemap+) >
<!ATTLIST maprule
maprulename CDATA #REQUIRED
>
<!ELEMENT languagemap EMPTY >
<!ATTLIST languagemap
languagepattern CDATA #REQUIRED
languagerulename CDATA #REQUIRED
>
<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/10/XMLSchema">
<element name="srx">
<complexType>
<sequence>
<element ref="header" minOccurs="1" maxOccurs="1" />
<element ref="body" minOccurs="1" maxOccurs="1" />
</sequence>
<attribute name="version" type="string" use="required" value="1.0" />
</complexType>
</element>
<element name="header">
<complexType>
<element ref="formathandle" minOccurs="0" maxOccurs="3" />
<attribute name="segmentsubflows" type="string" use="required" />
</complexType>
</element>
<element name="formathandle">
<complexType>
<attribute name="type" type="string" use="required" />
<attribute name="include" type="string" use="required" />
</complexType>
</element>
<element name="body">
<complexType>
<sequence>
<element ref="languagerules" minOccurs="0" maxOccurs="1" />
<element ref="maprules" minOccurs="0" maxOccurs="1" />
</sequence>
</complexType>
</element>
<element name="languagerules">
<complexType>
<element ref="languagerule" minOccurs="1" maxOccurs="unbounded" />
</complexType>
</element>
<element name="languagerule">
<complexType>
<sequence>
<element ref="rule" minOccurs="1" maxOccurs="unbounded" />
</sequence>
<attribute name="languagerulename" type="string" use="required" />
</complexType>
</element>
<element name="rule">
<complexType>
<sequence>
<element ref="beforebreak" minOccurs="0" maxOccurs="1" />
<element ref="afterbreak" minOccurs="0" maxOccurs="1" />
</sequence>
<attribute name="break" type="string" use="optional" />
</complexType>
</element>
<element name="beforebreak">
<complexType mixed="true" />
</element>
<element name="afterbreak">
<complexType mixed="true" />
</element>
<element name="maprules">
<complexType>
<element ref="maprule" minOccurs="1" maxOccurs="unbounded" />
</complexType>
</element>
<element name="maprule">
<complexType>
<element ref="languagemap" minOccurs="1" maxOccurs="unbounded" />
<attribute name="maprulename" type="string" use="required" />
</complexType>
</element>
<element name="languagemap">
<complexType>
<attribute name="languagepattern" type="string" use="required" />
<attribute name="languagerulename" type="string" use="required" />
</complexType>
</element>
</schema>