acm - an acm publication
Articles

A novel 3-tier XML schematic approach for web page translation

Ubiquity, Volume 2005 Issue November | BY Goutam Kumar Saha 

|

Full citation in the ACM Digital Library

The proposed 3-Tier XML schematic approach is to demonstrate how to embed syntactic, semantic and computational linguistic metadata information in the structure of an XML document and how the various markups help in Internationalization and Localization processes toward faster and more meaningful machine translation of Web contents from one human language to another human language. In this approach, an XML content author needs to embed a source human language specific metadata information in an XML document. Various markups used in this novel 3-Tier XML Schematic approach are very useful for the machine translation of a web content. This is a significant step forward toward better internationalization and localization of web pages as well.


Introduction

 

An XML schema is any type of model document that defines the structure of an XML document. We can create XML schemas using basic XML. In human languages, we often find that a word has several meanings (word sense ambiguity) at various content contexts (or content domain of a paragraph of a web page). Similarly, a word may have several linguistic parts of speech (POS ambiguity). For an example, the word "light" has several POS namely, verb, adjective, noun. Again, a metadata about a sentence helps in parsing during the machine translation of a web content.   The proposed 3-Tier XML Schema approach uses three schemas for a web content. The first schema is meant for content domain, the second schema is for sentence level metadata and the third one is meant for the word level metadata or markups.  We need to validate an XML document against the proposed three schemas to examine whether the XML content is well formed to conform the schemas. The proposed 3-Tier XML Schema aims to markup both syntactic and semantic metadata information in the structure of an XML document. This approach is an excellent solution to yield meaningful translation. Such embedded information is very important to both the internationalization and localization processes. We need to follow the following three basic steps of the 3-Tier Schemas to embed linguistic-related metadata information in the structure of an XML document in order to improve the translation process for obtaining more meaningful translation. This 3-tier schema scheme is also useful for the Translation Memory processes to keep context markups when Internationalization & Localization developers use this scheme for both source and target text. We develop the 1st XML schema that contains various categories on content domain. The 2nd XML schema contains various categories on sentences. The 3rd XML schema contains various Parts-of-Speech categories on words.

 

Using Three Levels Markups

A web content author does not need to markup every parts of his/her document. An author should use such markups only at very language specific parts and thus the content does not get over weighted with extra markups. XML schema authors should prefer to use attributes for adding metadata information because of better flexibility and portability.  The proposed scheme uses three XML elements namely, content domain, sentence category and POS category. The schematic block diagram of the proposed 3-Tier or 3-Layered XML Schema approach is shown in figure 1.

 

 

 

 

 

 


 

Content domain includes various contexts namely, information technology, medicine, travel, personal, sports, mathematics and romance etc. Sentence categories include simple, compound, complex, proverbial, taunt, suspicion, active & passive voice, direct and indirect speech etc. Parts-of-Speech categories include noun, pronoun, verb, adjective, adverb, preposition, postposition, interjection, conjunction and indeclinable etc. A content author having school level grammatical knowledge will not find any difficulty on using such markups because this scheme does not limit one to add an appropriate markup as an attribute. Content author may not use such three level markups at all parts (not for all words and sentences) of a document. Markups need to be used only at the sensitive or difficult parts or ambiguous parts of a document. For some languages, a content author even may not need to add finer sub-category markups at his/her document.

Metadata information about the domain, sentence type or specific words will help translators to do better quality work or to do the work quickly. If translators know that a word belongs to a specific domain then they can go to a terminology database and check the word; thus, even for human translators this 3-Tier or 3-layer schema will be helpful. One cannot do an accurate translation without such information. For an example:

 

<!-- Markup for Word/Phrase Sense Disambiguation or for Context Dependent Usage -->

<content_domain name="factory">

   Rabin works in a factory.  There are many electromechanical machines in this factory.

 

...sentence_i

<sentence_cat name="imperative">

   Give  <pos_cat name="noun" type="material"> oil </pos_cat> .

</sentence_cat>

...sentence_k

</content_domain>

….

<content_domain name="office">

   Rabin works in a government office.  He advises a new employee.

  <pos_cat name="verb" type="joining" meaning="to please"> Give oil </pos_cat>

    to your senior.

<!--such information via attributes aims to provide semantic as well as language

specific transformation constructs to a translator-->

...sentence_m

</content_domain>

 

Another example is stated below.

 

<content_domain name="literature" type="drama">

    ....

<sentence_category name="semantic" type="demonstrative">

       He  <pos_category name="verb" meaning="to act"> played </pos_category> in Odyssey.

      <!-- here, "played"  implies the verb "acted" -->

   </sentence_category>

      .........

   </content_domain>

 

We also use markup for indicating phrases & idioms in various human languages. For an example, the Phrases/Idioms "cats and dogs" in English we can use markup in the following way.

 

<!-- Markup for Phrases and Idioms -->

  <sentence_cat name="phrases_idioms" meaning="heavily">

     cats and dogs

</sentence_cat>

 

Such metadata will be of an immense help to a localization process (in order to find an appropriate phrases & idioms in a target human language) without even knowing here the source language- English well. Similarly, in Bangla-source language, the phrases and idioms say,            "Dumurer (English meaning is Fig's)  Fool (English meaning is Flower)" we use the following markup as stated below.

 

<!-- Markup for phrases and idioms -->

<sentence_cat name="phrases_idioms" meaning="rarely visible">

           Dumurer Fool

</sentence_cat>

 

Another example is given to show how the proposed approach is so useful to disambiguate a word sense ambiguity in a Bengali religious prayer "Hari, Din To Gelo, Sandhya Holo, Paar Karo Aamare." Though the common meaning of "Paar Karo" is "to cross", here it means "give me death".


<content-domain name="religious">

 <!-- Hari (God's name), Din (day),  To (an indeclinable), Gelo (passed), Sandhya (evening), Holo (becomes), Par Karo (to cross) and Aamare (me) -->

 <sentence_cat name="compound">

   Hari, Din To Gelo Sandhya Holo,

   <pos_cat name="verb" type="joining" meaning="to give death">Par Karo </pos_cat>

         Aamare.

 </sentence_cat>

</content_domain>

  

This article shows how to embed syntactic, semantic and computational linguistic related metadata information in the structure of an XML document towards better translation.

XML content authors having school level language grammar knowledge will not find any difficulty in marking up such language specific information. It is not mandatory for an author to add finer classified metadata at all. He/she has to add metadata at some parts of his/her content, which are exceptionally special with respect to his/her source language aspects. Such metadata is very useful as a semantic markup to a localization process, irrespective of a target language.

Again, for a link inside an XML PCDATA/ text content, we might differentiate links from the text by the following markup to treat them separately, for an example, "Click Here for Sign Up"

 

<!-- Markup for a Link -->

<sentence_cat name="link">Click Here for Sign Up</sentence_cat>

 

Or, for the link-word say, Here, we might markup in the following way:

<!-- Markup for a Link Word -->

<pos_cat name="link">Here </pos_cat>

 

For the following Bengali or Bangla dialects sentence "Kaam (Kaaj in Bangla or Work in English) Saira Falo (Shesh Koro in Bangla or Complete in english)," we should markup the text with the three-layer metadata information in the following way:

 

<!-- Markup for Dialect -->

<text xml:lang="ben">

<content_domain name="dialect">

<!-- content domain metadata -->

.... other sentences

 <sentence_cat name="imperative">

 <!-- sentence level metadata is optional here -->

     <pos_cat name="noun" meaning="work"> Kam </pos_cat>

     <pos_cat name="verb" meaning="to complete"> Saira Falo </pos_cat>

 <!-- word level parts-of-speech -->

 </sentence_cat>

......

</content_domain>

</text>

 

Anaphoric to (or to without verb, e.g., "Yes, I would like to." The omitted verb after to say, "go" here, is to be learnt from previous context by means of discourse analysis.  Such information is very important to the translator module otherwise, the sentence will be detected as an incorrect one during parsing.

 

<!-- Markup for Anaphoric -->

<!-- Markup for "Yes, I Would like to. " -->

<!-- sentence level metadata is optional here -->

Would you like to go with him?

<sentence_cat name="assertive">

  Yes, I would like <pos_cat name="verb" type="anaphoric"> to </pos_cat>

</sentence_cat>

 

Another example is given here to show how the proposed 3-Tier schema helps in disambiguating both the Word Sense and POS ambiguities. The sentence "Light the light light." can be marked up with word-level parts-of-speech metadata information in the following way without using finer parts-of-speech categories (depending on the requirements of a translation parser for a specific language-pair).

 

<!-- Markup for word-level sense and POS Disambiguation -->

<pos_cat name="verb">Light </pos_cat> the

<pos_cat name="adjective"> light </pos_cat>

<pos_cat name="noun"> light </pos_cat> .

 

These markups are also useful for a content author to add disambiguation related metadata information in order to disambiguate a text / PCDATA in between "<" and ">" from element tags. For an example, for the text say, "Readers may refer to work in <GKSaha2005> for more information." Please note that though <GKSaha2005> looks identical to an element tag but it is not intended to mean it as an element tag. Rather, it is meant for readers' references only. How to convey such disambiguation information to an XML Parser? Solution to this problem is to markup the text in the following way in order to denote that <GKSaha2005> is not meant for an element tag.

 

<!-- Markup to Disambiguate between an element-tag and a text/PCDATA in between "<" and ">" -->

  Readers may refer to work in

 <pos_cat name="punctuation" type="left_parenthesis"> < </pos_cat> GKSaha2005

 <pos_cat name="punctuation" type="right_parenthesis"> > </pos_cat> 

   for more information.

 

There exist many kinds of date calendars, e.g., Bangla Calendar (Bangabdo), English calendar, Shakabdo etc. Whenever we see some date it may not be English year & date. In such cases, it is better to indicate first the kind of calendar being considered and thus, we can internationalize the date. So, we may use the following markup for Date type data along with date format.

 

<!-- Markup for Date type data Internationalization & Localization -->

<pos_cat name="date type="yy/mm/dd" meaning="english_date"> 05/10/17 </pos_cat>

<!-- default may be English_date -->

<pos_cat name="date" type="dd/mm/yy" meaning="bangla_date"> 26/06/12 </pos_cat>

<!-- Present year is 1412 in Bangabdo -->

<pos_cat name="date" type="mm/dd/yyyy" meaning="bangla_date"> 06/27/1412 </pos_cat>

<pos_cat name="date" type="dd-mm-yyyy" meaning="english_date"> 17-10-2005 </pos_cat>

<pos_cat name="date" type="dd MMM, yyyy" meaning="english_date"> 18 Oct, 2005 </pos_cat>

<pos_cat name="date" type="MMM dd, yyyy" meaning="english_date"> Oct 18, 2005 </pos_cat>

<pos_cat name="date" type="dd MMM, yyyy" meaning="bangla_date"> 29 Ash, 1412 </pos_cat>

<!-- "Ash" stands for the Bangla Calender Month: Ashwin -->

<pos_cat name="date" type="dd She MMMM, yyyy" meaning="bangla_date"> 22 She Ashwin, 1412 </pos_cat>

<!-- 22nd Ashwin -->

<pos_cat name="date" type="dd i MMMM, yyyy" meaning="bangla_date"> 12 i Ashwin, 1412 </pos_cat>

<!-- 12th Ashwin -->

<pos_cat name="date" type="dd MMMM, yyyy" meaning="malayalam_date"> 1 Madam, 1181 </pos_cat>

<!-- 1st Madam (that is, the 1st month in the Malayalam Calendar, current year is 1811) -->

 

We often see that an image (along with an embedded ToolTip text) is inserted in a sentence. We intend to translate the sentence as well as the ToolTip text (for an example here, "begin"). We may use the following markup.

 

<!-- Word-Level Markup for ToolTip text word embedded inside an Image -->

 <para> Click here

             <image source="begin.jpg" alt="begin" /> 

             <pos_cat name="alt_value"> begin </pos_cat>

                           to play now.

 </para>

 

The following Word- Level Markups can be used for handling Personal Names in various conventions. This metadata is also useful in sorting various personal names.

 

<pos_cat name="noun" type="proper" meaning="person_first_middle_surname">

Goutam Kumar Saha

</pos_cat>

<pos_cat name="noun" type="proper" meaning="person_surname_first_middle">

Saha Goutam Kumar

</pos_cat>

<!-- a typical example for person's name in south India (i.e. in Kannada, Malayalam, Tamil and

Telugu Languages -->

<!-- first one is meant for the initial of  the ancestor's place, second one is meant for initial of father's given name, third one is meant for the given name of the person and the fourth one is meant for the surname or family name of the person -->

<pos_cat name="noun" type="proper"

meaning="person_ancestorplaceInit__fathernameInit_first_surname">

K M Rama Rao

</pos_cat>

<!-- a typical example of  a person's name (a modern convention). Here, first name stands for the

 given name of a person and the second one for his wife -->

<pos_cat name="noun" type="proper" meaning="person_first_wife">

 Rodger Ami

</pos_cat>

 

In many sentences we often use multilingual words. For example, in the Hindi sentence, "Kaam joldi start Kijiye" (i.e., in English:- "Start the work immediately." Lexicons:- Kaam/ Work, Joldi/ immediately, Kijiye/ Do). Please note that here we have the English word "start" in the source language (Hindi) sentence. Such usage of multilingual-wordings is very common in any urban area. As we are providing the meaning of a foreign language word (e.g., start) in a sentence of some other source language, say, Hindi, so there won't be any problem for a translation parser for understanding a sentence that contains multilingual words.

 

<!-- Markup for a Sentence having Multilingual Words -->

<!-- Markup for the Hindi sentence "Kaam Joldi Start Kijiye" -->

<text xml:lang="hin">

<sentence_cat name="imperative">

Kam Joldi

<pos_cat name="verb" type="compound" meaning="start"> start kijiye </pos_cat>

</sentence_cat>

</text>

 

Sentence Level Markup for translating the HTML Title Attribute value:

A "title" attribute is often inserted inside any HTML tag. Inserting this attribute gives the element a tooltip that pops up when the mouse moves over it (for an example here, on W3C). For Internationalization and Localization, we should translate the value of the HTML Title Attribute. Such markup is useful for translating VBScript /JavaScript ToolTips text on various events like ONMOUSEOVER etc.

 

<!-- Sentence-Level Markup for HTML Title Attribute -->

<sentence_cat name="title_value">

<a href="http://www.w3.org" title="Click here for the W3C ">W3C

</a>

</sentence_cat>

 

Another example of Markup for "HTML Title Attribute" for a form using the "Input Text Box" is stated below.

 

<!-- Markup for HTML Title Attribute using Input Text Box -->

<sentence_cat name="title_value">

<form>

<input type="text" size=20 title="Enter your email address here">

<input type="button" value="Submit">

</form>

</sentence_cat>

 

An example on Markup for JavaScript ToolTips Text that needs to be translated also into a target human language is stated below.

 

 <sentence_cat name="scripttitle_value">

 <!-- Markup for Javascript Tooltips text on events like ONMOUSEOVER -->

 <A HREF="/tips/page2.asp"

     ONMOUSEOVER="this._tip='It <FONT COLOR=red>

                              <B>simplifies</B></FONT>

                               XHTM is powerful'">

                           DHTML XHTML

</A>

</sentence_cat>

 

An example on Markup for JavaScript document.write Text that needs to be translated is stated below.

 

<!-- Markup for Javascript document.write text  -->

<sentence_cat name="script_document_write_value">

<script type="text/javascript">

var d = new Date()

var time = d.getHours()

if (time>12)

{

document.write("<b>Good Afternoon</b>")

}

</script>

</sentence_cat>

 

XML Schema

 

A typical XML schema for content domain is stated below.

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"   elementFormDefault="qualified">

  <xs:complexType name="catType">

    <xs:attribute name="name" use="required">

        <xs:simpleType>

          <xs:restriction base="xs:string">

                        <xs:enumeration value="administrative"/>

                        <xs:enumeration value="advertise"/>

                        <xs:enumeration value="agriculture"/>

                        <xs:enumeration value="astrology"/>

                        <xs:enumeration value="businesstrade"/>

                        <xs:enumeration value="citation"/>

                        <xs:enumeration value="communications"/>

                        <xs:enumeration value="defence"/>

                        <xs:enumeration value="diallect"/>

                        <xs:enumeration value="economics"/>

                        <xs:enumeration value="education"/>

                        <xs:enumeration value="emotion"/>

                        <xs:enumeration value="enggtech"/>

                        <xs:enumeration value="entertainment"/>

                        <xs:enumeration value="environment"/>

                        <xs:enumeration value="figure"/>

                        <xs:enumeration value="finance"/>

                        <xs:enumeration value="geography"/>

                        <xs:enumeration value="gossip"/>

                        <xs:enumeration value="history"/>

                        <xs:enumeration value="it"/>

                        <xs:enumeration value="law"/>

                        <xs:enumeration value="literature"/>

                        <xs:enumeration value="mathematics"/>

                        <xs:enumeration value="medical"/>

                        <xs:enumeration value="news"/>

                        <xs:enumeration value="occupation"/>

                        <xs:enumeration value="philosophy"/>

                        <xs:enumeration value="politics"/>

                        <xs:enumeration value="religion"/>

                        <xs:enumeration value="review"/>

                        <xs:enumeration value="science"/>

                        <xs:enumeration value="sex"/>

                        <xs:enumeration value="society"/>

                        <xs:enumeration value="speech"/>

                        <xs:enumeration value="sports"/>

                        <xs:enumeration value="travel"/>

                        <xs:enumeration value="violence"/>

                        <xs:enumeration value="weather"/>

          </xs:restriction>

        </xs:simpleType>

                </xs:attribute>

                <xs:attribute name="type">

                        <xs:simpleType>

                         <xs:restriction base="xs:string">

                                <xs:enumeration value="civic"/>

                                <xs:enumeration value="cultural"/>

                                <xs:enumeration value="drama"/>

                                <xs:enumeration value="humanities"/>

                                <xs:enumeration value="international"/>

                                <xs:enumeration value="lyrics"/>

                                <xs:enumeration value="national"/>

                                <xs:enumeration value="poetry"/>

                                <xs:enumeration value="story"/>

                </xs:restriction>

             </xs:simpleType>

                </xs:attribute>

        </xs:complexType>

        <xs:element name="content_domain">

                <xs:complexType>

                        <xs:sequence>

                                <xs:element name="cat" type="catType"/>

                        </xs:sequence>

                </xs:complexType>

        </xs:element>

</xs:schema>

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:c="http://www.kolkatacdac.in/w3ci18ncd" elementFormDefault="qualified">

 <xs:import namespace="http://www.kolkatacdac.in/w3ci18ncd" schemaLocation="C:\Documents and  Settings\Administrator\My Documents\auth-contdom-13091.xsd"/>

    <xs:element name="content_domain">

       <xs:complexType>

        <xs:sequence>

                        <xs:element ref="c:cat" maxOccurs="unbounded"/>

        </xs:sequence>

      </xs:complexType>

   </xs:element>

</xs:schema>

 

 

Conclusion

 

The proposed 3-Tier XML Schema is very useful for adding both language specific as well as other translation related markups in an XML content for more meaningful and faster translation with an affordable overhead on markups.   This novel approach is a significant step forward towards machine translation of web content in one human language to another human language. We can translate a web content even without having source language specific much resource (for example, bilingual lexicon etc) or source language specific knowledge, if we afford to add such markups in most of the source language words in an XML document.

 

Acknowledgement: Author is thankful to Dr. A.B. Saha, Executive Director, CDAC Kolkata for general encouragement given to me. He also thanks to Mr. S. Pahari for his help.

 

References

 

1.     Goutam Kumar Saha,  "Computational Linguistic Markup," Source: http://esw.w3.org/topic/its0908LinguisticMarkup  (WWW Consortium Archive), USA, 2005. 

2.     David Hunter, et al., "Beginning XML," 3rd Edition, Wiley Publishing, Inc, 2005.

3.     R.P. Sinha, "English Grammar,"  OXFORD University Press, 2003.

4.     Goutam Kumar Saha, "The E2B Machine Translation: a New Approach to HLT," ACM Ubiquity, Vol. 6(32), ACM Press, USA, 2005.

5.     Goutam Kumar Saha, "The EB-Anubad: a Hybrid Scheme," International Journal of Zhejiang University Science, Vol. 6A(10), pp. 1047-1050, RPC, 2005.

6.     Goutam Kumar Saha, et al,  "Computer Assisted Bangla Word POS Tagging, Proceedings of the ISTRANS'04, New Delhi, 2004.

 

7.     Goutam Kumar Saha, "Bangla Text Parsing with Intelligence," Proceedings of the ICMS'05,

      Marrakech, 2005.

 

 

Author’s Biography: In his last seventeen years’ research and development experience, he has worked as a scientist in LRDE, Defence Research & Development Organisation, Bangalore, and  at  Electronics  Research & Development Centre of  India , Calcutta.  At present, he is with the Centre for Development of Advanced Computing, Kolkata, India, as a Scientist-F.  He has authored about hundred research papers in various International Journals and Proceedings.  He is a senior member in IEEE, Computer Society of India, ACM and a fellow member in IETE, MSPI, IMS etc. He is also a member in the W3C ITS Working Group. He has received various awards, scholarships and grants from national and international organizations. He is a referee for AMSE journals (France), JZUS, CSI Journals, IJCPOL and IEEE Potentials Magazine etc. His field of interest is on dependable, fault tolerant computing and Natural Language Engineering.  He is an associate editor of the ACM Ubiquity. He can be reached via sahagk@gmail.com , gksaha@rediffmail.com.

 


 

 

 

COMMENTS

very effective research

— Engr. Ahsan Arif, Sun, 18 Sep 2011 18:10:49 UTC

POST A COMMENT
Leave this field empty