Cogprints: No conditions. Results ordered -Date, Title.

Sentence syntax trees should be made from morphemes. Semantically ordered trees

2015-02-24T18:36:37Z

Some critique of usage of sentence parse trees in modern linguistics. Two propositions on constructing trees, as mentioned in the title. Introduction of an English-to-Tatar translator program that is being developed by the author. Precedence by specificity.

Indonesian Innovations on Information Technology 2013: Between Syntactic and Semantic Textual Network

2013-11-18T21:08:04Z

Network and graph model is a good alternative to analyze huge collective textual data for the ability to reduce the dimensionality of the data. Texts can be seen as syntactic and semantic network among words and phrases seen as concepts. The model is implemented to observe the proposals of Indonesian innovators for implementation of information technology. From the analysis some interesting insights are outlined.

Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration

2013-11-18T21:01:54Z

Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.

Development of a Hindi Lemmatizer

2013-11-18T21:01:21Z

We live in a translingual society, in order to communicate with people from different parts of the world we need to have an expertise in their respective languages. Learning all these languages is not at all possible; therefore we need a mechanism which can do this task for us. Machine translators have emerged as a tool which can perform this task. In order to develop a machine translator we need to develop several different rules. The very first module that comes in machine translation pipeline is morphological analysis. Stemming and lemmatization comes under morphological analysis. In this paper we have created a lemmatizer which generates rules for removing the affixes along with the addition of rules for creating a proper root word.

Part of Speech Tagging of Marathi Text Using Trigram Method

2013-11-18T21:01:58Z

In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done.

Rule Based Transliteration Scheme for English to Punjabi

2013-11-18T21:02:02Z

Machine Transliteration has come out to be an emerging and a very important research area in the field of machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper transliteration of name entities plays a very significant role in improving the quality of machine translation. In this paper we are doing machine transliteration for English-Punjabi language pair using rule based approach. We have constructed some rules for syllabification. Syllabification is the process to extract or separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper names and location). For those words which do not come under the category of name entities, separate probabilities are being calculated by using relative frequency through a statistical machine translation toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to Punjabi.

Getting the most from a surname study: semantics, DNA and computer modelling

2012-11-09T19:41:09Z

We here address such questions as: what does a surname mean; is it single origin; and, why do some surnames grow abnormally large? Though most surnames are rare, most people have populous surnames. In 1881 for example, 90% of the population of England and Wales had the most populous 4% of surnames; and, in 1998, 80% had the 1% most populous. In this paper, we consider the evidence that some frequent surnames could be single source; this would imply that a single family has grown abnormally large. For some populous surnames, they have a geographical distribution that might be thought to be consistent with a single origin though, as yet, such supposition generally lacks support from adequate DNA evidence. With the onset of DNA testing, some scientists are becoming more active in surname studies and they might be more reluctant than some traditionalists to infer too much from categories of surname meaning. Instead, they are likely to maintain that statistical analyses of the data should be properly performed. For example, King and Jobling (2009) considered forty English surnames and found no statistically significant correlation between the supposed semantic category of a surname and its degree of DNA matching into single male-line families. As a specific example that we here describe in some detail, little can be deduced about the inter-relatedness of those called Plant from the assumption of a semantic category, such as by arguing that it is locative and hence single-origin, or occupational and hence multi-origin. By comparison, more surely, we discuss the DNA evidence that this name’s main family grew unusually. Though motivated initially by the evidence of unusual growth for Plant, we extend our deliberations more generally to other surnames. Guided by the empirical evidence, our computer simulations identify various reasons for a surname family’s prolific growth. In particular, chance is a main factor, along with favourable conditions during the Industrial Age when overall population growth took off, evidently earlier in some regions than in others. Also, the modelling suggests that some additional factor such as polygyny or resilience to plague or favourable economic circumstance, after an early start to a hereditary surname, is beneficial in seeing a family through initial precarious times, sustaining its survival through to a small but real chance of subsequent proliferation in favourable Industrial Age conditions.

Human and Automatic Evaluation of English to Hindi Machine Translation Systems

2013-09-17T14:35:23Z

Machine Translation Evaluation is the most formidable activity in Machine Translation Development. We present the MT evaluation results of some of the machine translators available online for English-Hindi machine translation. The systems are measured on automatic evaluation metrics and human subjectivity measures.

Design of English-Hindi Translation Memory for Efficient Translation

2013-11-18T21:01:33Z

Developing parallel corpora is an important and a difficult activity for Machine Translation. This requires manual annotation by Human Translators. Translating same text again is a useless activity. There are tools available to implement this for European Languages, but no such tool is available for Indian Languages. In this paper we present a tool for Indian Languages which not only provides automatic translations of the previously available translation but also provides multiple translations, in cases where a sentence has multiple translations, in ranked list of suggestive translations for a sentence. Moreover this tool also lets translators have global and local saving options of their work, so that they may share it with others, which further lightens the task.

Expectations eclipsed in foreign language education: learners and educators on an ongoing journey / edited by Hülya Görür-Atabaş, Sharon Turner.

2012-11-09T18:50:42Z

Between June 2-4, 2011 Sabancı University School of Languages welcomed colleagues from 21 different countries to a collaborative exploration of the challenging and inspiring journey of learners and educators in the field of language education. The conference provided an opportunity for all stakeholders to share their views on language education. Colleagues met with world-renowned experts and authors in the fields of education and psychology, faculty and administrators from various universities and institutions, teachers from secondary educational backgrounds and higher education, as well as learners whose voices are often not directly shared but usually reported. The conference name, Eclipsing Expectations, was inspired by two natural phenomena, a solar eclipse directly before the conference, and a lunar eclipse, immediately after. Learners and educators were hereby invited to join a journey to observe, learn and exchange ideas in order

Some Inquiries to Spontaneous Opinions: A case with Twitter in Indonesia

2010-11-22T14:17:39Z

The paper discusses opportunities to utilize the series of micro-blogs as provided by the Twitter in observation of opinion dynamics. The spontaneity of tweets is more, as the service is attached more to the mobile communications. The extraction of information in the series of tweets is demonstrated as in conceptual map and mention map. From the latter, the social network stylized properties, i.e.: power law distribution is shown. The exemplification of the methodology is on the 82nd commemoration of Indonesian Youth Pledge and the participatory movement of Indonesian capitol city, Jakarta.

Evaluation of Computational Grammar Formalisms for Indian Languages

2013-11-18T21:01:37Z

Natural Language Parsing has been the most prominent research area since the genesis of Natural Language Processing. Probabilistic Parsers are being developed to make the process of parser development much easier, accurate and fast. In Indian context, identification of which Computational Grammar Formalism is to be used is still a question which needs to be answered. In this paper we focus on this problem and try to analyze different formalisms for Indian languages.

Input Scheme for Hindi Using Phonetic Mapping

2013-11-18T21:01:41Z

Written Communication on Computers requires knowledge of writing text for the desired language using Computer. Mostly people do not use any other language besides English. This creates a barrier. To resolve this issue we have developed a scheme to input text in Hindi using phonetic mapping scheme. Using this scheme we generate intermediate code strings and match them with pronunciations of input text. Our system show significant success over other input systems available.

Exploring the N-th Dimension of Language

2012-11-09T19:24:55Z

This paper is aimed at exploring the hidden fundamental computational property of natural language that has been so elusive that it has made all attempts to characterize its real computational property ultimately fail. Earlier natural language was thought to be context-free. However, it was gradually realized that this does not hold much water given that a range of natural language phenomena have been found as being of non-context-free character that they have almost scuttled plans to brand natural language contextfree. So it has been suggested that natural language is mildly context-sensitive and to some extent context-free. In all, it seems that the issue over the exact computational property has not yet been solved. Against this background it will be proposed that this exact computational property of natural language is perhaps the N-th dimension of language, if what we mean by dimension is nothing but universal (computational) property of natural language.

Representation and computation

2010-07-29T01:51:45Z

This is an encyclopedia entry and does not include an abstract.

Research on Social Engagement with a Rabbitic User Interface

2010-07-02T03:30:00Z

Companions as interfaces to smart rooms need not only to be easy to interact with, but also to maintain long-term relationships with their users. The FP7-funded project SERA (Social Engagement with Robots and Agents) contributes to knowledge about and modeling of such relationships. One focal activity is an iterative field study to collect real-life long-term interaction data with a robotic interface. The first stage of this study has been completed. This paper reports on the set-up and the first insights.

A Constructive Mathematic approach for Natural Language formal grammars

2012-12-22T13:16:19Z

A mathematical description of natural language grammars has been proposed first by Leibniz. After the definition given by Frege of unsaturated expression and the foundation of a logical grammar by Husserl, the application of logic to treat natural language grammars in a computational way raised the interest of linguists, for example applying Lambek's categorial calculus. In recent years, the most consolidated formal grammars (e.g., Minimalism, HPSG, TAG, CCG, Dependency Grammars) began to show an interest in giving a strong psychological interpretation to the formalism and hence to natural language data on which they are applied. Nevertheless, no one seems to have paid much attention to cognitive linguistics, a branch of linguistics that actively uses concepts and results from cognitive sciences. Apparently unrelated, the study of computational concepts and formalisms has developed in pair with constructive formal systems, especially in the branch of logic called proof theory, see, e.g., the Curry-Howard isomorphism and the typed functional languages. In this paper, we want to bridge these worlds and thus present our natural language formalism, called Adpositional Grammars (AdGrams), that is founded over both cognitive linguistics and constructive mathematics.

LETEC (Learning and Teaching Corpus) Simuligne

2009-04-24T15:26:29Z

Learning and Teaching Corpus of the online educational experiment Simuligne (2001). Its scenario is based on a global simulation for the learning of French as a foreign language. It also includes an intercultural activity, "Interculture", based on the Cultura project. The corpus includes the pedagogical scenario, described in several formats, the research protocol, participant's online interactions and productions (structured in XML), list of participants, licences of use. The LETEC corpus associated (mce.simu.all.all-CP.zip) is organized as an IMS-CP archive. We define a Learning & Teaching Corpus as a structured entity containing all the elements resulting from a communicative on-line learning situation, whose context is described by an educational scenario and a research protocol. The core data collection includes all the interaction data, the productions of the course participants, and the tracks, resulting from the participants’ actions in the learning environment and stored according to the research protocol. In order to be able to be shared, and to respect participant privacy, these data should be anonymised and a license for its use be provided in the corpus. A derived analysis can be linked to a given set of data under consideration, used or computerized for this analysis. An analysis consisting in data annotation/transcription/transformation, accurately connected to its original data, can be merged with the corpus itself, in order for other researchers to compare their own results on a concurrent analysis or to build their complementary analysis upon these results. The definition of a Learning & Teaching Corpus as a whole entity comes from the need of explicit links, between interaction data, context and analyses. This explicit context is crucial for an external researcher to interpret the data and to perform its own analyses. This definition seeks to capture the context of the data stemming from the course in order to allow a researcher to look for, understand and connect this information whether or not he/she was involved in the original course. More details about a LETEC corpus an ist structure at : http://mulce.univ-fcomte.fr/metadata/LETECorpus-en.pdf

The Latent Relation Mapping Engine: Algorithm and Experiments

2009-01-05T23:58:22Z

Many AI researchers and cognitive scientists have argued that analogy is the core of cognition. The most influential work on computational modeling of analogy-making is Structure Mapping Theory (SMT) and its implementation in the Structure Mapping Engine (SME). A limitation of SME is the requirement for complex hand-coded representations. We introduce the Latent Relation Mapping Engine (LRME), which combines ideas from SME and Latent Relational Analysis (LRA) in order to remove the requirement for hand-coded representations. LRME builds analogical mappings between lists of words, using a large corpus of raw text to automatically discover the semantic relations among the words. We evaluate LRME on a set of twenty analogical mapping problems, ten based on scientific analogies and ten based on common metaphors. LRME achieves human-level performance on the twenty problems. We compare LRME with a variety of alternative approaches and find that they are not able to reach the same level of performance.

A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations

2008-08-31T12:24:12Z

Recognizing analogies, synonyms, antonyms, and associations appear to be four distinct tasks, requiring distinct NLP algorithms. In the past, the four tasks have been treated independently, using a wide variety of algorithms. These four semantic classes, however, are a tiny sample of the full range of semantic phenomena, and we cannot afford to create ad hoc algorithms for each semantic phenomenon; we need to seek a unified approach. We propose to subsume a broad range of phenomena under analogies. To limit the scope of this paper, we restrict our attention to the subsumption of synonyms, antonyms, and associations. We introduce a supervised corpus-based machine learning algorithm for classifying analogous word pairs, and we show that it can solve multiple-choice SAT analogy questions, TOEFL synonym questions, ESL synonym-antonym questions, and similar-associated-both questions from cognitive psychology.

A MDL-based Model of Gender Knowledge Acquisition

2008-08-30T23:21:12Z

This paper presents an iterative model of knowledge acquisition of gender information associated with word endings in French. Gender knowledge is represented as a set of rules containing exceptions. Our model takes noun-gender pairs as input and constantly maintains a list of rules and exceptions which is both coherent with the input data and minimal with respect to a minimum description length criterion. This model was compared to human data at various ages and showed a good fit. We also compared the kind of rules discovered by the model with rules usually extracted by linguists and found interesting discrepancies.

Boundary effects in a three-state modified voter model for languages

2008-01-27T03:55:57Z

The standard three-state voter model is enlarged by including the outside pressure favouring one of the three language choices and by adding some biased internal random noise. The Monte Carlo simulations are motivated by states with the population divided into three groups of various affinities to each other. We show the crucial influence of the boundaries for moderate lattice sizes like 500 x 500. By removing the fixed boundary at one side, we demonstrate that this can lead to the victory of one single choice. Noise in contrast stabilizes the choices of all three populations. In addition, we compute the persistence probability, i.e., the number of sites who have never changed their opinion during the simulation, and we consider the case of ”rigid-minded” decision makers.

Ontology and Formal Semantics - Integration Overdue

2007-12-19T03:04:38Z

In this note we suggest that difficulties encountered in natural language semantics are, for the most part, due to the use of mere symbol manipulation systems that are devoid of any content. In such systems, where there is hardly any link with our common-sense view of the world, and it is quite difficult to envision how one can formally account for the considerable amount of content that is often implicit, but almost never explicitly stated in our everyday discourse. The solution, in our opinion, is a compositional semantics grounded in an ontology that reflects our commonsense view of the world and the way we talk about it in ordinary language. In the compositional logic we envision there are ontological (or first-intension) concepts, and logical (or second-intension) concepts, and where the ontological concepts include not only Davidsonian events, but other abstract objects as well (e.g., states, processes, properties, activities, attributes, etc.) It will be demonstrated here that in such a framework, a number of challenges in the semantics of natural language (e.g., metonymy, intensionality, metaphor, etc.) can be properly and uniformly addressed.

Empirical Evaluation of Four Tensor Decomposition Algorithms

2007-11-22T21:41:22Z

Higher-order tensor decompositions are analogous to the familiar Singular Value Decomposition (SVD), but they transcend the limitations of matrices (second-order tensors). SVD is a powerful tool that has achieved impressive results in information retrieval, collaborative filtering, computational linguistics, computational vision, and other fields. However, SVD is limited to two-dimensional arrays of data (two modes), and many potential applications have three or more modes, which require higher-order tensor decompositions. This paper evaluates four algorithms for higher-order tensor decomposition: Higher-Order Singular Value Decomposition (HO-SVD), Higher-Order Orthogonal Iteration (HOOI), Slice Projection (SP), and Multislice Projection (MP). We measure the time (elapsed run time), space (RAM and disk space requirements), and fit (tensor reconstruction accuracy) of the four algorithms, under a variety of conditions. We find that standard implementations of HO-SVD and HOOI do not scale up to larger tensors, due to increasing RAM requirements. We recommend HOOI for tensors that are small enough for the available RAM and MP for larger tensors.

An Alternative Postulate to see Melody as “Language”

2007-07-14Z

The paper proposes a way to see melodic features in music/songs in the terms of “letters” constituting “words”, while in return investigating the fulfillment of Zipf-Mandelbrot Law in them. Some interesting findings are reported including some possible conjectures for classification of melodic and musical artifacts considering several aspects of culture. The paper ends with some discussions related to further directions, be it enrichment in musicology and the possible plan for musical generative art.

Conjecture to Statistical Proximity with Tree of Language (?): Report on Few Austronesian Languages of Indonesian Ethnics

2007-05-28Z

We continue some steps showing the distinctions and proximities of languages over statistical facts as it has been pioneered previously [3]. In the paper, we construct the homology tree from the distance matrix yielded from the transformation of some statistical aspects of the empirical observations into binary sequences in order to conform to the concepts of memetics [2]. The resulting visualizations show interesting facts and possibly challenge some further steps for the advancement of our understanding to the discourse of languages and ethnicities.

A Note on Ontology and Ordinary Language

2007-05-19Z

We argue for a compositional semantics grounded in a strongly typed ontology that reflects our commonsense view of the world and the way we talk about it. Assuming such a structure we show that the semantics of various natural language phenomena may become nearly trivial.

Regimes in Babel are Confirmed: Report on Findings in Several Indonesian Ethnic Biblical Texts

2007-04-04Z

The paper introduces the presence of three statistical regimes in the Zipfian analysis of texts in quantitative linguistics: the Mandelbrot, original Zipf, and Cancho- Solé-Montemurro regimes. The work is carried out over nine different languages of the same intention semantically: the bible from different languages in Indonesian ethnic and national language. As always, the same analysis is also brought in English version of the Bible for reference. The existence of the three regimes are confirmed while in advance the length of the texts are also becomes an important issue. We outline some further works regarding the quantitative analysis for parameterization used to analyze the three regimes and the task to have broad explanation, especially the microstructure of the language in human decision or linguistic effort – emerging the robustness of them.

An Observational Framework to the Zipfian Analysis among Different Languages: Studies to Indonesian Ethnic Biblical Texts

2007-04-04Z

The paper introduces the used of Zipfian statistics to observe the human languages by using the same (meaning) corpus/corpora but different in grammatical and structural utterances. We used biblical texts since they contain corpuses that have been most widely and carefully translated into many languages. The idea is to reduce the possibility of noise came from the meaning of the texts in distinctive language. The result is that the robustness of the Zipfian law is observable and some statistical differences are discovered between English and widely used national and several ethnic languages in Indonesia. The paper ends by modestly propose further possible framework in interdisciplinary approaches to human language evolution.

Designing Domain Ontology: A Study in Lexical Semantics

2007-03-16Z

Preparing a multi-purpose lexicon requires a systematic analysis of inter-conceptual relations. These relations are of two types, namely (i) syntactic and (ii) semantic, which can further be decomposed to capture the greater explanatory adequacy. But the exploration of the lexical structure becomes intricate because of the hidden dynamics of the context; since traditionally, language has been viewed as a totality of lexicon and computation system, and major emphasis has been given to the designing of the computational system, considering the designing of the lexicon internal domain ontology as a mere metaphysical game, when in reality it is a serious epistemic concern, because of having the capacity of licensing inferences. Therefore a lexical level representation should have enough scope to incorporate the contextual information. Designing domain ontology is important since it tells us about the conceptual constellation within the coherent whole of which the related terms are meaningful. Isolating a term from the corresponding constellation will results into the evaporation of meaning. Furthermore it provides the basis, upon which the entire linguistic structure rests. If so, then how is it possible to construct a lexicon, by divorcing the ontological issues? And at the same time, ontology by itself is not enough, again because of the reason that the higher order typifications of those (grounded) concepts and their corresponding interrelations among the types ultimately results into the consequent super-ordinating levels, containing the syntactic information pertinent to a symbol manipulating system. In this paper I would show that the representation of a lexical structure, should include both kind of information which are pertinent to the closed class and as well as the open class semantics, on the basis of examples, cited from English and Bengali.

Fast & Confident Probabilistic Categorization

2007-07-28Z

We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the probabilistic categoriser described in (Gaussier et al., ECIR'02). This categoriser is adapted to handle multiple labelling and a piecewise-linear confidence estimation layer is added to provide an estimate of the labelling confidence. This technique achieves a score of 1.689 on the test data.

Language, logic and ontology: uncovering the structure of commonsense knowledge

2007-05-08Z

The purpose of this paper is twofold: (i) we argue that the structure of commonsense knowledge must be discovered, rather than invented; and (ii) we argue that natural language, which is the best known theory of our (shared) commonsense knowledge, should itself be used as a guide to discovering the structure of commonsense knowledge. In addition to suggesting a systematic method to the discovery of the structure of commonsense knowledge, the method we propose seems to also provide an explanation for a number of phenomena in natural language, such as metaphor, intensionality, and the semantics of nominal compounds. Admittedly, our ultimate goal is quite ambitious, and it is no less than the systematic ‘discovery’ of a well-typed ontology of commonsense knowledge, and the subsequent formulation of the longawaited goal of a meaning algebra.

Statistical Phrase-based Post-editing

2007-07-28Z

We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and produces post-edited target-language text. We report on experiments that were performed on data collected in precisely such a setting: pairs of raw MT output and their manually post-edited versions. In our evaluation, the output of our automatic post-editing (APE) system is not only better quality than the rule-based MT (both in terms of the BLEU and TER metrics), it is also better than the output of a state-of-the-art phrase-based MT system used in standalone translation mode. These results indicate that automatic post-editing constitutes a simple and efficient way of combining rule-based and statistical MT technologies.

Experiments on predictability of word in context and information rate in natural language

2007-11-13T00:51:03Z

Based on data from a large-scale experiment with human subjects, we conclude that the logarithm of probability to guess a word in context (unpredictability) depends linearly on the word length. This result holds both for poetry and prose, even though with prose, the subjects don't know the length of the omitted word. We hypothesize that this effect reflects a tendency of natural language to have an even information rate.

The Missing Link between Morphemic Assemblies and Behavioral Responses:a Bayesian Information-Theoretical model of lexical processing

2006-03-16Z

The Missing Link between Morphemic Assemblies and Behavioral Responses:a Bayesian Information-Theoretical model of lexical processing

2006-03-06Z

We present the Bayesian Information-Theoretical (BIT) model of lexical processing: A mathematical model illustrating a novel approach to the modelling of language processes. The model shows how a neurophysiological theory of lexical processing relying on Hebbian association and neural assemblies can directly account for a variety of eects previously observed in behavioral experiments. We develop two information-theoretical measures of the distribution of usages of a word or morpheme. These measures are calculated through unsupervised means from corpora. We show that our measures succesfully predict responses in three visual lexical decision datasets investigating the processing of in ectional morphology in Serbian and English languages, and the eects of polysemy and homonymy in English. We discuss how our model provides a neurophysiological grounding for the facilitatory and inhibitory eects of dierent types of lexical neighborhoods. In addition, our results show how, under a model based on neural assemblies, distributed patterns of activation naturally result in the arisal of discrete symbol-like structures. Therefore, the BIT model oers a point of reconciliation in the debate between distributed connectionist and discrete localist models. Finally, we argue that the modelling framework exemplied by the BIT model, is a powerful tool for integrating the different levels of the description of the human language processing system.

Expressing Implicit Semantic Relations without Supervision

2006-08-01Z

We present an unsupervised learning algorithm that mines large text corpora for patterns that express implicit semantic relations. For a given input word pair X:Y with some unspecified semantic relations, the corresponding output list of patterns is ranked according to how well each pattern Pi expresses the relations between X and Y. For example, given X=ostrich and Y=bird, the two highest ranking output patterns are "X is the largest Y" and "Y such as the X". The output patterns are intended to be useful for finding further pairs with the same relations, to support the construction of lexicons, ontologies, and semantic networks. The patterns are sorted by pertinence, where the pertinence of a pattern Pi for a word pair X:Y is the expected relational similarity between the given pair and typical pairs for Pi. The algorithm is empirically evaluated on two tasks, solving multiple-choice SAT word analogy questions and classifying semantic relations in noun-modifier pairs. On both tasks, the algorithm achieves state-of-the-art results, performing significantly better than several alternative pattern ranking algorithms, based on tf-idf.

Similarity of Semantic Relations

2006-09-01Z

There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason:stone is analogous to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of information retrieval has been adapted to measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and (3) automatically generated synonyms are used to explore variations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying semantic relations, LRA achieves similar gains over the VSM.

Corpus-based Learning of Analogies and Semantic Relations

2005-08-24Z

We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the SAT college entrance exam. A verbal analogy has the form A:B::C:D, meaning "A is to B as C is to D"; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47% of a collection of 374 college-level analogy questions (random guessing would yield 20% correct; the average college-bound senior high school student answers about 57% correctly). We motivate this research by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as "laser printer", according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearest-neighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5% (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2% (random: 20%). The performance is state-of-the-art for both verbal analogies and noun-modifier relations.

Measuring Semantic Similarity by Latent Relational Analysis

2005-08-11Z

This paper introduces Latent Relational Analysis (LRA), a method for measuring semantic similarity. LRA measures similarity in the semantic relations between two pairs of words. When two pairs have a high degree of relational similarity, they are analogous. For example, the pair cat:meow is analogous to the pair dog:bark. There is evidence from cognitive science that relational similarity is fundamental to many cognitive and linguistic tasks (e.g., analogical reasoning). In the Vector Space Model (VSM) approach to measuring relational similarity, the similarity between two pairs is calculated by the cosine of the angle between the vectors that represent the two pairs. The elements in the vectors are based on the frequencies of manually constructed patterns in a large corpus. LRA extends the VSM approach in three ways: (1) patterns are derived automatically from the corpus, (2) Singular Value Decomposition is used to smooth the frequency data, and (3) synonyms are used to reformulate word pairs. This paper describes the LRA algorithm and experimentally compares LRA to VSM on two tasks, answering college-level multiple-choice word analogy questions and classifying semantic relations in noun-modifier expressions. LRA achieves state-of-the-art results, reaching human-level performance on the analogy questions and significantly exceeding VSM performance on both tasks.

On Parsing CHILDES

2005-04-12Z

Research on child language acquisition would benefit from the availability of a large body of syntactically parsed utterances between parents and children. We consider the problem of generating such a ``treebank'' from the CHILDES corpus, which currently contains primarily orthographically transcribed speech tagged for lexical category.

Human-Level Performance on Word Analogy Questions by Latent Relational Analysis

2004-12-11Z

This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, machine translation, and information retrieval. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason/stone is analogous to the pair carpenter/wood; the relations between mason and stone are highly similar to the relations between carpenter and wood. Past work on semantic similarity measures has mainly been concerned with attributional similarity. For instance, Latent Semantic Analysis (LSA) can measure the degree of similarity between two words, but not between two relations. Recently the Vector Space Model (VSM) of information retrieval has been adapted to the task of measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus (they are not predefined), (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data (it is also used this way in LSA), and (3) automatically generated synonyms are used to explore reformulations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying noun-modifier relations, LRA achieves similar gains over the VSM, while using a smaller corpus.

Combining Independent Modules in Lexical Multiple-Choice Problems

2005-01-10Z

Existing statistical approaches to natural language problems are very coarse approximations to the true complexity of language processing. As such, no single technique will be best for all problem instances. Many researchers are examining ensemble methods that combine the output of multiple modules to create more accurate solutions. This paper examines three merging rules for combining probability distributions: the familiar mixture rule, the logarithmic rule, and a novel product rule. These rules were applied with state-of-the-art results to two problems used to assess human mastery of lexical semantics -- synonym questions and analogy questions. All three merging rules result in ensembles that are more accurate than any of their component modules. The differences among the three rules are not statistically significant, but it is suggestive that the popular mixture rule is not the best rule for either of the two problems.

Frequency Value Grammar and Information Theory

2004-06-05Z

I previously laid the groundwork for Frequency Value Grammar (FVG) in papers I submitted in the proceedings of the 4th International Conference on Cognitive Science (2003), Sydney Australia, and Corpus Linguistics Conference (2003), Lancaster, UK. FVG is a formal syntax theoretically based in large part on Information Theory principles. FVG relies on dynamic physical principles external to the corpus which shape and mould the corpus whereas generative grammar and other formal syntactic theories are based exclusively on patterns (fractals) found occurring within the well-formed portion of the corpus. However, FVG should not be confused with Probability Syntax, (PS), as described by Manning (2003). PS is a corpus based approach that will yield the probability distribution of possible syntax constructions over a fixed corpus. PS makes no distinction between well and ill formed sentence constructions and assumes everything found in the corpus is well formed. In contrast, FVG’s primary objective is to distinguish between well and ill formed sentence constructions and, in so doing, relies on corpus based parameters which determine sentence competency. In PS, a syntax of high probability will not necessarily yield a well formed sentence. However, in FVG, a syntax or sentence construction of high ‘frequency value’ will yield a well-formed sentence, at least, 95% of the time satisfying most empirical standards. Moreover, in FVG, a sentence construction of ‘high frequency value’ could very well be represented by an underlying syntactic construction of low probability as determined by PS. The characteristic ‘frequency values’ calculated in FVG are not measures of probability but rather are fundamentally determined values derived from exogenous principles which impact and determine corpus based parameters serving as an index of sentence competency. The theoretical framework of FVG has broad applications beyond that of formal syntax and NLP. In this paper, I will demonstrate how FVG can be used as a model for improving the upper bound calculation of entropy of written English. Generally speaking, when a function word precedes an open class word, the backward n-gram analysis will be homomorphic with the information source and will result in frequency values more representative of co-occurrences in the information source.

Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities

2004-07-30Z

This paper describes the National Research Council (NRC) Word Sense Disambiguation (WSD) system, as applied to the English Lexical Sample (ELS) task in Senseval-3. The NRC system approaches WSD as a classical supervised machine learning problem, using familiar tools such as the Weka machine learning software and Brill's rule-based part-of-speech tagger. Head words are represented as feature vectors with several hundred features. Approximately half of the features are syntactic and the other half are semantic. The main novelty in the system is the method for generating the semantic features, based on word co-occurrence probabilities. The probabilities are estimated using the Waterloo MultiText System with a corpus of about one terabyte of unlabeled text, collected by a web crawler.

Anchoring of semiotic symbols

2003-07-16Z

This paper presents arguments for approaching the anchoring problem using {\em semiotic symbols}. Semiotic symbols are defined by a triadic relation between forms, meanings and referents, thus having an implicit relation to the real world.Anchors are formed between these three elements rather than between `traditional' symbols and sensory images. This allows an optimization between the form (i.e. the `traditional' symbol) and the referent. A robotic experiment based on adaptive language games illustrates how the anchoring of semiotic symbols can be achieved in a bottom-up fashion. The paper concludes that applying semiotic symbols is a potentially valuable approach toward anchoring.

Combining independent modules to solve multiple-choice synonym and analogy problems

2003-09-19Z

Existing statistical approaches to natural language problems are very coarse approximations to the true complexity of language processing. As such, no single technique will be best for all problem instances. Many researchers are examining ensemble methods that combine the output of successful, separately developed modules to create more accurate solutions. This paper examines three merging rules for combining probability distributions: the well known mixture rule, the logarithmic rule, and a novel product rule. These rules were applied with state-of-the-art results to two problems commonly used to assess human mastery of lexical semantics -- synonym questions and analogy questions. All three merging rules result in ensembles that are more accurate than any of their component modules. The differences among the three rules are not statistically significant, but it is suggestive that the popular mixture rule is not the best rule for either of the two problems.

Grounded lexicon formation without explicit reference transfer: who's talking to who?

2003-07-16Z

This paper presents a first investigation regarding lexicon grounding and evolution under an iterated learning regime without an explicit transfer of reference. In the original iterated learning framework, a population contains adult speakers and learning hearers. In this paper I investigate the effects of allowing both adults and learners to take up the role of speakers and hearers with varying probabilities. The results indicate that when adults and learners can be selected as speakers and hearers, their lexicons become more similar but at the cost of reduced success in communication.

Investigating social interaction strategies for bootstrapping lexicon development

2003-07-16Z

This paper investigates how different modes of social interactions influence the bootstrapping and evolution of lexicons. This is done by comparing three language game models that differ in the type of social interactions they use. The simulations show that the language games which use either joint attention or corrective feedback as a source of contextual input are better capable of bootstrapping a lexicon than the game without such directed interactions. The simulation of the latter game, however, does show that it is possible to develop a lexicon without using directed input when the lexicon is transmitted from generation to generation.

Iterated learning and grounding: from holistic to compositional languages

2003-07-16Z

This paper presents a new computational model for studying the origins and evolution of compositional languages grounded through the interaction between agents and their environment. The model is based on previous work on adaptive grounding of lexicons and the iterated learning model. Although the model is still in a developmental phase, the first results show that a compositional language can emerge in which the structure reflects regularities present in the population's environment.

Learning Analogies and Semantic Relations

2003-07-25Z

We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the Scholastic Aptitude Test (SAT). A verbal analogy has the form A:B::C:D, meaning "A is to B as C is to D"; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47% of a collection of 374 college-level analogy questions (random guessing would yield 20% correct). We motivate this research by relating it to work in cognitive science and linguistics, and by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as "laser printer", according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearest-neighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5% (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2% (random: 20%). The performance is state-of-the-art for these challenging problems.

Measuring praise and criticism: Inference of semantic orientation from association

2003-09-19Z

The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This paper introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8% on the full test set, but the accuracy rises above 95% when the algorithm is allowed to abstain from classifying mild words.

A Proposed Mathematical Theory Explaining Word Order Typology

2003-04-15Z

In this paper I attempt to lay the groundwork for an algorithm that measures sentence competency. Heretofore, competency of sentences was determined by interviewing speakers of the language. The data compiled forms the basis for grammatical rules that establish the generative grammar of a language. However, the generative grammar, once established, does not filter out all incompetent sentences. Chomsky has noted that there are many sentences that are grammatical but do not satisfy the notion of competency and, similarly, many non-grammatical constructions that do. I propose that generative grammar constructions as well as formal theory frameworks such as Transformational Grammar, Minimalist Theory, and Government and Binding do not represent the most irreducible component of a language that determines sentence competency. I propose a Mathematical Theory governing word order typology that explains not only the established generative grammar rules of a language but, also, lays the groundwork for understanding sentence competency in terms of irreducible components that has not been accounted for in previous formal theories. I have done so by relying on a mathematical analysis of word frequency relationships based upon large, representative corpuses that represents a more basic component of sentence construction overlooked by current text processing and artificial intelligence parsing systems and unaccounted for by the generative grammar rules of a language.

THSim v3.2: The Talking Heads simulation tool

2003-07-16Z

The field of language evolution and computation may benefit from using efficient and robust simulation tools that are based on widely exploited principles within the field. The tool presented in this paper is one that could fulfil such needs. The paper presents an overview of the tool -- THSim v3.2 -- and discusses some research questions that can be investigated with it.

Phonemic Coding Might Result From Sensory-Motor Coupling Dynamics

2003-03-12Z

Human sound systems are invariably phonemically coded. Furthermore, phoneme inventories follow very particular tendancies. To explain these phenomena, there existed so far three kinds of approaches : ``Chomskyan''/cognitive innatism, morpho-perceptual innatism and the more recent approach of ``language as a complex cultural system which adapts under the pressure of efficient communication''. The two first approaches are clearly not satisfying, while the third, even if much more convincing, makes a lot of speculative assumptions and did not really bring answers to the question of phonemic coding. We propose here a new hypothesis based on a low-level model of sensory-motor interactions. We show that certain very simple and non language-specific neural devices allow a population of agents to build signalling systems without any functional pressure. Moreover, these systems are phonemically coded. Using a realistic vowel articulatory synthesizer, we show that the inventories of vowels have striking similarities with human vowel systems.

The physical symbol grounding problem

2003-07-16Z

This paper presents an approach to solve the symbol grounding problem within the framework of embodied cognitive science. It will be argued that symbolic structures can be used within the paradigm of embodied cognitive science by adopting an alternative definition of a symbol. In this alternative definition, the symbol may be viewed as a structural coupling between an agent's sensorimotor activations and its environment. A robotic experiment is presented in which mobile robots develop a symbolic structure from scratch by engaging in a series of language games. In this experiment it is shown that robots can develop a symbolic structure with which they can communicate the names of a few objects with a remarkable degree of success. It is further shown that, although the referents may be interpreted differently on different occasions, the objects are usually named with only one form.

The adaptive advantage of symbolic theft over sensorimotor toil: Grounding language in perceptual categories

2002-01-16Z

Using neural nets to simulate learning and the genetic algorithm to simulate evolution in a toy world of mushrooms and mushroom-foragers, we place two ways of acquiring categories into direct competition with one another: In (1) "sensorimotor toil,” new categories are acquired through real-time, feedback-corrected, trial and error experience in sorting them. In (2) "symbolic theft,” new categories are acquired by hearsay from propositions – boolean combinations of symbols describing them. In competition, symbolic theft always beats sensorimotor toil. We hypothesize that this is the basis of the adaptive advantage of language. Entry-level categories must still be learned by toil, however, to avoid an infinite regress (the “symbol grounding problem”). Changes in the internal representations of categories must take place during the course of learning by toil. These changes can be analyzed in terms of the compression of within-category similarities and the expansion of between-category differences. These allow regions of similarity space to be separated, bounded and named, and then the names can be combined and recombined to describe new categories, grounded recursively in the old ones. Such compression/expansion effects, called "categorical perception" (CP), have previously been reported with categories acquired by sensorimotor toil; we show that they can also arise from symbolic theft alone. The picture of natural language and its origins that emerges from this analysis is that of a powerful hybrid symbolic/sensorimotor capacity, infinitely superior to its purely sensorimotor precursors, but still grounded in and dependent on them. It can spare us from untold time and effort learning things the hard way, through direct experience, but it remain anchored in and translatable into the language of experience.

Evolution of communication and language using signals, symbols and words

2002-01-11Z

This paper describes different types of models for the evolution of communication and language. It uses the distinction between signals, symbols, and words for the analysis of evolutionary models of language. In particular, it show how evolutionary computation techniques, such as artificial life, can be used to study the emergence of syntax and symbols from simple communication signals. Initially, a computational model that evolves repertoires of isolated signals is presented. This study has simulated the emergence of signals for naming foods in a population of foragers. This type of model studies communication systems based on simple signal-object associations. Subsequently, models that study the emergence of grounded symbols are discussed in general, including a detailed description of a work on the evolution of simple syntactic rules. This model focuses on the emergence of symbol-symbol relationships in evolved languages. Finally, computational models of syntax acquisition and evolution are discussed. These different types of computational models provide an operational definition of the signal/symbol/word distinction. The simulation and analysis of these types of models will help to understand the role of symbols and symbol acquisition in the origin of language.

Humanoid Theory Grounding

2001-11-27Z

In this paper we consider the importance of using a humanoid physical form for a certain proposed kind of robotics, that of theory grounding. Theory grounding involves grounding the theory skills and knowledge of an embodied artificially intelligent (AI) system by developing theory skills and knowledge from the bottom up. Theory grounding can potentially occur in a variety of domains, and the particular domain considered here is that of language. Language is taken to be another problem space in which a system can explore and discover solutions. We argue that because theory grounding necessitates robots experiencing domain information, certain behavioral-form aspects, such as abilities to socially smile, point, follow gaze, and generate manual gestures, are necessary for robots grounding a humanoid theory of language.

Bootstrapping grounded symbols by minimal autonomous robots

2003-07-16Z

In this paper an experiment is presented in which two mobile robots develop a shared lexicon of which the meanings are grounded in the real world. The robots start without a lexicon nor shared meanings and play language games in which they generate new meanings and negotiate words for these meanings. The experiment tries to find the minimal conditions under which verbal communication may begin to evolve. The robots are autonomous in terms of computing and cognition, but they are otherwise far simpler than most, if not all animals. It is demonstrated that a lexicon nevertheless can be made to emerge even though there are strong limits on the size and stability of this lexicon.

Quantitative Neural Network Model of the Tip-of-the-Tongue Phenomenon Based on Synthesized Memory-Psycholinguistic-Metacognitive Approach

2000-10-17Z

A new three-stage computer artificial neural network model of the tip-of-the-tongue phenomenon is proposed. Each words node is build from some interconnected learned auto-associative two-layer neural networks each of which represents separate words semantic, lexical, or phonological components. The model synthesizes memory, psycholinguistic, and metamemory approaches, bridges speech errors and naming chronometry research traditions, and can explain quantitatively many tip-of-the-tongue effects

Prospects for in-depth story understanding by computer

2000-03-01Z

While much research on the hard problem of in-depth story understanding by computer was performed starting in the 1970s, interest shifted in the 1990s to information extraction and word sense disambiguation. Now that a degree of success has been achieved on these easier problems, I propose it is time to return to in-depth story understanding. In this paper I examine the shift away from story understanding, discuss some of the major problems in building a story understanding system, present some possible solutions involving a set of interacting understanding agents, and provide pointers to useful tools and resources for building story understanding systems.

Book Review--Ronald Cole (editor-in-chief), Joseph Mariani, Hans Uszkoreit, Annie Zaenen, and Victor Zue, eds., Survey of the State of the Art in Human Language Technology

1999-08-20Z

This is a review of Survey of the State of the Art in Human Language Technology, edited by Ronald Cole (editor-in-chief), Joseph Mariani, Hans Uszkoreit, Annie Zaenen, and Victor Zue, published by Cambridge University Press in 1997.

Correlates of linguistic rhythm in the speech signal

1999-04-21Z

Spoken languages have been classified by linguists according to their rhythmic properties, and psycholinguists have relied on this classification to account for infants capacity to discriminate languages. Although researchers have measured many speech signal properties, they have failed to identify reliable acoustic characteristics for language classes. This paper presents instrumental measurements based on a consonant/vowel segmentation for eight languages. The measurements suggest that intuitive rhythm types reflect specific phonological properties, which in turn are signaled by the acoustic/phonetic properties of speech. The data support the notion of rhythm classes and also allow the simulation of infant language discrimination, consistent with the hypothesis that newborns rely on a coarse segmentation of speech. A hypothesis is proposed regarding the role of rhythm perception in language acquisition.

Modeling the evolution of communication: From stimulus associations to grounded symbolic associations

2002-01-11Z

This paper describes a model for the evolution of communication systems using simple syntactic rules, such as word combinations. It also focuses on the distinction between simple word-object associations and symbolic relationships. The simulation method combines the use of neural networks and genetic algorithms. The behavioral task is influenced by Savage-Rumbaugh & Rumbaughs (1978) ape language experiments. The results show that languages that use combination of words (e.g. verb-object rule) can emerge by auto-organization and cultural transmission. Neural networks are tested to see if evolved languages are based on symbol acquisition. The implications of this model for Deacons (1997) hypothesis on the role of symbolic acquisition for the origin of language are discussed.

Working with Constrained Systems: A Review of A. K. Joshi's IJCAI-97 Research Excellence Award Acceptance Lecture

2000-11-12Z

This is a brief review of Joshi's award acceptance lecture published in AI Magazine. This review appeared in the AI Watch column in Computers and Society, a quarterly magazine.

An Analysis of English Punctuation: The Special Case of Comma

1998-11-10Z

Punctuation has usually been ignored by researchers in computational linguistics over the years. Recently, it has been realized that a true understanding of written language will be impossible if punctuation marks are not taken into account. This paper contains the details of a computer-aided exercise to investigate English punctuation practice for the special case of comma (the most significant punctuation mark) in a parsed corpus. The study classifies the various ``structural'' uses of the comma according to the syntax-patterns in which a comma occurs. The corpus (Penn Treebank) consists of syntactically annotated sentences with no part-of-speech tag information about individual words.

Choice Factors in Translation

1998-07-29Z

In this article, grammatical forms in context are viewed as processual patterns of choice activity. A hierarchy of choice factors is presented, using the example of the present perfect forms in parallel translations from Russian into several languages. To ensure adequacy of comparison, the notions of grammatical contextual complex and universal grammatical integral are introduced and used as the required tertium comparationis . Particular attention is devoted to the interplay of universal and language-specific features in processes of grammatical choice in translation. Les formes grammaticales en contexte sont envisagées comme des modèles procéduraux d'une activité de choix. Une hiérarchie de facteurs sélectifs est établie à partir des formes du passé composé dans des traductions parallèles du russe en plusieurs langues. Pour assurer l'adéquation de la comparaison, on distingue les notions de complèxe grammatical contextuel et de constituant grammatical universel; ces deux notions servent de tertium comparationis. Les processus de choix grammaticaux en traduction impliquent une interaction entre des traits linguistiques universaux et ceux qui sont particuliers aux langues.

Dashes as Typographical Cues for the Information Structure

1998-07-06Z

We take em-dash as our sample punctuation mark and examine its usage from a discourse perspective, using sentences from well-known corpora. We particularly comment on how dashes can give hints on information structure, focus, and anaphora. Throughout the paper Discourse Representation Theory is used as a framework.

Description Theory, LTAGs and Underspecified Semantics

2006-08-06Z

An attractive way to model the relation between an underspecified syntactic representation and its completions is to let the underspecified representation correspond to a logical description and the completions to the models of that description. This approach, which underlies the Description Theory of (Marcus et al. 1983) has been integrated in (Vijay-Shanker 1992) with a pure unification approach to Lexicalized Tree-Adjoining Grammars (Joshi et al.\ 1975, Schabes 1990). We generalize Description Theory by integrating semantic information, that is, we propose to tackle both syntactic and semantic underspecification using descriptions.

The evolution of a lexicon and meaning in robotic agents through self-organization

1998-06-24Z

This paper discusses interdisciplinary experiments, combining robotics and evolutionary computational linguistics. The goal of the experiments is to investigate if robotic agents can originate a language, in particular a lexicon. In the experiments two robots engage in a series of so-called language games. Starting from the assumption that the robots know how to communicate and are able to detect some sensory information from the environment, the agents ground conceptual meaning and develop a lexicon. The experiments show that the robots are able to form a shared communication system. The paper investigates the influence of using non-linguistic information in the formation of the lexicon, which takes the form of pointing (1) to indicate the topic of the language game, and (2) to give feedback on the outcome of the game.

An Information-Based Treatment of Punctuation in Discourse Representation Theory

1998-11-12Z

Punctuation has so far attracted attention within the linguistics community mostly from a syntactic perspective. In this paper, we give a preliminary account of the information-based aspects of punctuation, drawing our points from assorted, naturally occurring sentences. We present our formal models of these sentences and the semantic contributions of punctuation marks. Our formalism is a simplified analogue of an extension--due to Nicholas Asher--of Discourse Representation Theory.

The interaction between numerals and nouns

1998-03-24Z

This paper is a descriptive survey of the principal phenomena surrounding cardinal numerals in attribution to nouns, with some concentration on European languages, but within a world-wide perspective. The paper is focussed on describing the syntagmatic distribution and the internal structure of numerals. By contrast, the important topic of the paradigmatic context of numerals, that is how their structure and behavior related to those of quantifiers, determiners, adjectives, and nouns, does not receive systematic discussion here, although many relevant comments are made in passing. A further necessary limitation in scope is the exclusion of forms which are only marginally cardinal numerals, if at all, such as English both, dozen, fourscore, pair, triple and their counterparts in other languages.

Models of Speaking (To Their Amazement) Meet Speech-Synchronized Gestures

1998-05-22Z

The chapters in this volume have generally accepted the argument that speech-gesture integration is basic to language use. But what explains the integration itself? I will attempt to make the case that it can be understood with the concept of a `growth point' or GP (McNeill & Duncan this volume) It is called a GP since it is a theoretical unit in which principles that explain mental growth -- differentiation, internalization, dialectic, and reorganization -- apply to realtime utterance generation by adults (and children). It is also called a GP since it is meant to be the initial form of a thinking-while-speaking unit out of which a dynamic process of organization emerges. The emergence unpacks the GP into a surface utterance and gesture that articulates its meaning implications.

Thought as word dynamics

1999-06-27Z

A Hebbian model for speech generation opens a number of paths. A cross-linguistic scheme of functional relationships (inspired by Aristotle) dispenses with distraction by the "parts of speech" distinctions, while bridging the gap between "contents" and "structure" words. A gradient model identifies emotional and rational dynamics and shows speech generation as a process where a speaker's dissatisfaction gets minimised.

The Use of Situation Theory in Context Modeling

1998-06-16Z

At the heart of natural language processing is the understanding of context dependent meanings. This paper presents a preliminary model of formal contexts based on situation theory. It also gives a worked-out example to show the use of contexts in lifting, i.e., how propositions holding in a particular context transform when they are moved to another context. This is useful in NLP applications where preserving meaning is a desideratum.

Current Approaches to Punctuation in Computational Linguistics

1998-06-16Z

Some recent studies in computational linguistics have aimed to take advantage of various cues presented by punctuation marks. This short survey is intended to summarise these research efforts and additionally, to outline a current perspective for the usage and functions of punctuation marks. We conclude by presenting an information-based framework for punctuation, influenced by treatments of several related phenomena in computational linguistics.

The Many Functions of Discourse Particles: A Computational Model of Pragmatic Interpretation

2011-12-16T00:11:43Z

We present a connectionist model for the interpretation of discourse particles in real dialogues that is based on neuronal principles of categorization (categorical perception, prototype formation, contextual interpretation). It can be shown that discourse particles operate just like other morphological and lexical items with respect to interpretation processes. The description proposed locates discourse particles in an elaborate model of communication which incorporates many different aspects of the communicative situation. We therefore also attempt to explore the content of the category discourse particle. We present a detailed analysis of the meaning assignment problem and show that 80%– 90% correctness for unseen discourse particles can be reached with the feature analysis provided. Furthermore, we show that ‘analogical transfer’ from one discourse particle to another is facilitated if prototypes are computed and used as the basis for generalization. We conclude that the interpretation processes which are a part of the human cognitive system are very similar with respect to different linguistic items. However, the analysis of discourse particles shows clearly that any explanatory theory of language needs to incorporate a theory of communication processes.

Situated Nonmonotonic Temporal Reasoning with BABY-SIT

1998-06-16Z

After a review of situation theory and previous attempts at `computational' situation theory, we present a new programming environment, BABY-SIT, which is based on situation theory. We then demonstrate how problems requiring formal temporal reasoning can be solved in this framework. Specifically, the Yale Shooting Problem, which is commonly regarded as a canonical problem for nonmonotonic temporal reasoning, is implemented in BABY-SIT using Yoav Shoham's causal theories.

Combining Montague Semantics and Discourse Representation

2006-02-05Z

This paper embeds the core part of Discourse Representation Theory in the classical theory of types plus a few simple axioms that allow the theory to express key facts about variables and assignments on the object level of the logic. It is shown how the embedding can be used to combine core analyses of natural language phenomena in Discourse Representation Theory with analyses that can be obtained in Montague Semantics.

The Dilemma of Saussurean Communication

1998-06-15Z

A Saussurean communication system exists when an entire communicating population uses a single "language" that maps states unambiguously onto symbols and then back into the original states. This paper describes a number of simulations performed with a genetic algorithm to investigate the conditions necessary for such communication systems to evolve. The first simulation shows that Saussurean communication evolves in the simple case where direct selective pressure is placed on individuals to be both good transmitters and good receivers. The second simulation demonstrates that, in the more realistic case where selective pressure is only placed on doing well as a receiver, Saussurean communication fails to evolve. Two methods, inspired by research on the Prisoner's Dilemma, are used to attempt to solve this problem. The third simulation shows that, even in the absence of selective pressure on transmission, Saussurean communication can evolve if individuals interact multiple times with the same communication partner and are given the ability to respond differentially based on past interaction. In the fourth simulation, spatially organized populations are used, and it is shown that this allows Saussurean communication to evolve through kin selection.

Information-Based Aspects of Punctuation

1998-07-06Z

We offer a preliminary account of the information-based aspects of punctuation marks. We give our initial treatment within the Discourse Representation Theory and its segmented version. We hypothesize that this work will be useful in classifying the informational contributions of punctuation marks and bringing them to bear on the semantic characterization of written discourse.

An Information-Based Treatment of Punctuation

1998-07-07Z

Punctuation marks have recently attracted attention within the linguistics community mostly from a syntactic perspective. In this paper, we aim to give a preliminary account of information-based aspects of punctuation marks, drawing our points from examples and links with related phenomena such as intonation. We give our initial treatment within the Discourse Representation Theory.

Information-Oriented Computation with BABY-SIT

1998-06-19Z

While situation theory and situation semantics provide an appropriate framework for a realistic model-theoretic treatment of natural language, serious thinking on their `computational' aspects has only recently started. Existing proposals mainly offer a Prolog- or Lisp-like programming environment with varying degrees of divergence from the ontology of situation theory. In this paper, we introduce a computational medium (called BABY-SIT) based on situations. The primary motivation underlying BABY-SIT is to facilitate the development and testing of programs in domains ranging from linguistics to artificial intelligence in a unified framework built upon situation-theoretic constructs.

Language polygenesis: A probabilistic model

1998-06-24Z

Monogenesis of language is widely accepted, but the conventional argument seems to be mistaken; a simple probabilistic model shows that polygenesis is likely. Other prehistoric inventions are discussed, as are problems in tracing linguistic lineages. Language is a system of representations; within such a system, words can evoke complex and systematic responses. Along with its social functions, language is important to humans as a mental instrument. Indeed, the invention of language,that is the accumulation of symbols to represent emotions, objects, and acts may be the most important event in human evolution, because so many developments follow from it. For example, Edward Sapir speculated that some embryonic form of language must have been available to early man to help him fashion tools from stone (Sapir,1921). Sophisticated biface stone tools date to early Homo erectus some 1.5 million years ago, suggesting a similar age for language. This paper considers whether the invention of language occurred at only one pre-historic site or at several sites. In other words, did language emerge by monogenesis or polygenesis? Early thinkers believed in monogenesis, against a background of divine creation. Perhaps the best known account is the biblical story of Adam giving names to plants and animals in the Garden of Eden. Similar legends are found among many peoples. Modern linguists too assume monogenesis, but on probabilistic grounds (see, for instance, Southworth and Daswani, 1974, p.314). The argument seems to be that the invention of language is an extremely unlikely event, because symbolization involves abstraction and requires synchronized insight by several individuals; therefore, the probability of occurrence at more than one site must be vanishingly small. We have found no explicit quantitative treatment of this question in the literature, but the underlying logic has to be the multiplication of probabilities. If p is small at one site,then p.p for two sites is smaller still, and so on. This reasoning is false, as we show here. The fallacy lies in the focus on two particular sites rather than consideration of all pairs of sites.

Steps toward Formalizing Context

1998-06-17Z

The importance of contextual reasoning is emphasized by various researchers in AI. (A partial list includes John McCarthy and his group, R. V. Guha, Yoav Shoham, Giuseppe Attardi and Maria Simi, and Fausto Giunchiglia and his group.) Here, we survey the problem of formalizing context and explore what is needed for an acceptable account of this abstract notion.

Subsymbolic Case-Role Analysis of Sentences with Embedded Clauses

2001-11-18Z

A distributed neural network model called SPEC for processing sentences with recursive relative clauses is described. The model is based on separating the tasks of segmenting the input word sequence into clauses, forming the case-role representations, and keeping track of the recursive embeddings into different modules. The system needs to be trained only with the basic sentence constructs, and it generalizes not only to new instances of familiar relative clause structures, but to novel structures as well. SPEC exhibits plausible memory degradation as the depth of the center embeddings increases, its memory is primed by earlier constituents, and its performance is aided by semantic constraints between the constituents. The ability to process structure is largely due to a central executive network that monitors and controls the execution of the entire system. This way, in contrast to earlier subsymbolic systems, parsing is modeled as a controlled high-level process rather than one based on automatic reflex responses.

A Timing Model for Fast French

2000-07-21Z

Models of speech timing are of both fundamental and applied interest. At the fundamental level, the prediction of time periods occupied by syllables and segments is required for general models of speech prosody and segmental structure. At the applied level, complete models of timing are an essential component of any speech synthesis system. Previous research has established that a large number of factors influence various levels of speech timing. Statistical analysis and modelling can identify order of importance and mutual influences between such factors. In the present study, a three-tiered model was created by a modified step-wise statistical procedure. It predicts the temporal structure of French, as produced by a single, highly fluent speaker at a fast speech rate (100 phonologically balanced sentences, hand-scored in the acoustic signal). The first tier models segmental influences due to phoneme type and contextual interactions between phoneme types. The second tier models syllable-level influences of lexical vs. grammatical status of the containing word, presence of schwa and the position within the word. The third tier models utterance-final lengthening. The complete segmental-syllabic model correlated with the original corpus of 1204 syllables at an overall r = 0.846. Residuals were normally distributed. An examination of subsets of the data set revealed some variation in the closeness of fit of the model. The results are considered to be useful for an initial timing model, particularly in a speech synthesis context. However, further research is required to extend the model to other speech rates and to examine inter-speaker variability in greater detail.

Book Review -- Hans Kamp and Uwe Reyle, From Discourse to Logic: Introduction to Model-theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory

1998-06-26Z

This is a review of From Discourse to Logic: Introduction to Model-theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory, by Hans Kamp and Uwe Reyle, published by Kluwer Academic Publishers in 1993.

Situated Modeling of Epistemic Puzzles

1998-06-19Z

Situation theory is a mathematical theory of meaning introduced by Jon Barwise and John Perry. It has evoked great theoretical interest and motivated the framework of a few `computational' systems. PROSIT is the pioneering work in this direction. Unfortunately, there is a lack of real-life applications on these systems and this study is a preliminary attempt to remedy this deficiency. Here, we solve a group of epistemic puzzles using the constructs provided by PROSIT.

Towards Situation-Oriented Programming Languages

1998-06-19Z

Recently, there have been some attempts towards developing programming languages based on situation theory. These languages employ situation-theoretic constructs with varying degrees of divergence from the ontology of the theory. In this paper, we review three of these programming languages.

Modeling Context with Situations

1998-06-24Z

The issue of context arises in assorted areas of Artificial Intelligence. Although its importance is realized by various researchers, there is not much work towards a useful formalization. In this paper, we will present a preliminary model (based on Situation Theory) and give examples to show the use of context in various fields, and the advantages gained by the acceptance of our proposal.

Situations and Computation: An Overview of Recent Research

1998-06-24Z

Serious thinking about the computational aspects of situation theory is just starting. There have been some recent proposals in this direction (viz. PROSIT and ASTL), with varying degrees of divergence from the ontology of the theory. We believe that a programming environment incorporating bona fide situation-theoretic constructs is needed and describe our very recent BABY-SIT implementation. A detailed critical account of PROSIT and ASTL is also offered in order to compare our system with these pioneering and influential frameworks.

Computational Situation Theory

1998-06-19Z

Situation theory has been developed over the last decade and various versions of the theory have been applied to a number of linguistic issues. However, not much work has been done in regard to its computational aspects. In this paper, we review the existing approaches towards `computational situation theory' with considerable emphasis on our own research.

Categorial Grammar and Discourse Representation Theory

2006-01-21Z

In this paper it is shown how simple texts that can be parsed in a Lambek Categorial Grammar can also automatically be provided with a semantics in the form of a Discourse Representation Structure in the sense of Kamp [1981]. The assignment of meanings to texts uses the Curry-Howard-Van Benthem correspondence.

Pauses and the temporal structure of speech

2000-07-21Z

Natural-sounding speech synthesis requires close control over the temporal structure of the speech flow. This includes a full predictive scheme for the durational structure and in particuliar the prolongation of final syllables of lexemes as well as for the pausal structure in the utterance. In this chapter, a description of the temporal structure and the summary of the numerous factors that modify it are presented. In the second part, predictive schemes for the temporal structure of speech ("performance structures") are introduced, and their potential for characterising the overall prosodic structure of speech is demonstrated.

Situated Processing of Pronominal Anaphora

1998-06-23Z

We describe a novel approach to the analysis of pronominal anaphora in Turkish. A computational medium which is based on situation theory is used as our implementation tool. The task of resolving pronominal anaphora is demonstrated in this environment which employs situation-theoretic constructs for processing.

BABY-SIT: A Computational Medium Based on Situations

1998-06-23Z

While situation theory and situation semantics provide an appropriate framework for a realistic model-theoretic treatment of natural language, serious thinking on their `computational' aspects has just started. Existing proposals mainly offer a Prolog- or Lisp-like programming environment with varying degrees of divergence from the ontology of situation theory. In this paper, we introduce a computational medium (called BABY-SIT) based on situations. The primary motivation underlying BABY-SIT is to facilitate the development and testing of programs in domains ranging from linguistics to artificial intelligence in a unified framework built upon situation-theoretic constructs.

A Contribution to Reference Semantics of Spatial Prepositions: The Visualization Problem and its Solution in VITRA

2001-06-14Z

The cognitive function of mental images with respect to the referential aspect of language is examined and used in the listener model ANTLIMA of the natural language system SOCCER. An operational realization of the reference relation used to recognize instances of spatial concepts in the results of a vision system and also to visualize locative expressions is presented and compared to A. Herskovits' analysis of the semantics of spatial prepositions.

Anaphora and the Logic of Change

2006-01-21Z

This paper shows how the dynamic interpretation of natural language introduced in work by Hans Kamp and Irene Heim can be modeled in classical type logic. This provides a synthesis between Richard Montague's theory of natural language semantics and the work by Kamp and Heim.

Natural Language Processing with Modular Neural Networks and Distributed Lexicon

2001-11-18Z

An approach to connectionist natural language processing is proposed, which is based on hierarchically organized modular Parallel Distributed Processing (PDP) networks and a central lexicon of distributed input/output representations. The modules communicate using these representations, which are global and publicly available in the system. The representations are developed automatically by all networks while they are learning their processing tasks. The resulting representations reflect the regularities in the subtasks, which facilitates robust processing in the face of noise and damage, supports improved generalization, and provides expectations about possible contexts. The lexicon can be extended by cloning new instances of the items, that is, by generating a number of items with known processing properties and distinct identities. This technique combinatorially increases the processing power of the system. The recurrent FGREP module, together with a central lexicon, is used as a basic building block in modeling higher level natural language tasks. A single module is used to form case-role representations of sentences from word-by-word sequential natural language input. A hierarchical organization of four recurrent FGREP modules (the DISPAR system) is trained to produce fully expanded paraphrases of script-based stories, where unmentioned events and role fillers are inferred.

Review of Rosenfield's "The Invention of Memory"

1998-04-30Z

Evidence collected by Bartlett, Collingwood, James, Bransford, Jenkins, and Sacks argues against the memory-as-stored-structures hypothesis, the keystone of expert systems and cognitive modeling research.

Rethinking the language bottleneck: Why don't animals learn to communicate?

1998-06-15Z

While most work on the evolution of language has been centered on the evolution of syntax, my focus in this paper is instead on more basic features that separate human communication from the systems of communication used by other animals. In particular, I argue that human language is the only existing system of learned arbitrary reference. While innate communication systems are, by definition, directly transmitted genetically, the transmission of a learned learned systems must be indirect. Learners must acquire the system by being exposed its the use in the community. Although it is reasonable that a learner has access to the utterances that are produced, it is less clear how accessible the meaning is that the utterance is intended to convey. This particularly problematic if the system of communication is symbolic -- where form and meaning are linked in a purely conventional way. Given this, I propose that the ability to transmit a learned symbolic system of communication from one generation to the next represents a key milestone in the evolution of language.

The learning barrier: Moving from innate to learned systems of communication

1998-06-15Z

Human language is a unique ability. It sits apart from other systems of communication in two striking ways: it is syntactic, and it is learned. While most approaches to the evolution of language have focused on the evolution of syntax, this paper explores the computational issues that arise in shifting from a simple innate communication system to an equally simple one that is learned. Associative network learning within an observational learning paradigm is used to explore the computational difficulties involved in establishing and maintaining a simple learned communication system. Because Hebbian learning is found to be sufficient for this task, it is proposed that the basic computational demands of learning are unlikely to account for the rarity of even simple learned communication systems. Instead, it is the problem of *observing* that is likely to be central -- in particular the problem of determining what meaning a signal is intended to convey.