This site has been permanently archived. This is a static copy provided by the University of Southampton.
---
abstract: "Based on data from a large-scale experiment with human subjects, we conclude that the logarithm of probability to guess a word in context (unpredictability) depends linearly on the word length. This result holds both for poetry and prose, even though with prose, the subjects don't know the length of the omitted word. We hypothesize that this effect reflects a tendency of natural language to have an even information rate."
altloc:
- http://www.jip.ru/2006/229-236-2006.pdf
chapter: ~
commentary: ~
commref: ~
confdates: ~
conference: ~
confloc: ~
contact_email: ~
creators_id:
- manin@pobox.com
creators_name:
- family: Manin
given: Dmitrii
honourific: ''
lineage: ''
date: 2006-12-26
date_type: completed
datestamp: 2007-11-13 00:51:03
department: ~
dir: disk0/00/00/58/17
edit_lock_since: ~
edit_lock_until: ~
edit_lock_user: ~
editors_id: []
editors_name: []
eprint_status: archive
eprintid: 5817
fileinfo: /style/images/fileicons/application_pdf.png;/5817/1/unpred_article_e.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: pub
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: 'Natural language, information theory, information rate, entropy, experiment, word guessing'
lastmod: 2011-03-11 08:57:00
latitude: ~
longitude: ~
metadata_visibility: show
note: Text is somewhat extended compared to the published version.
number: 3
pagerange: 229-236
pubdom: FALSE
publication: Journal of Information Processes
publisher: Keldysh Institute of Applied Mathematics (KIAM) RAS
refereed: TRUE
referencetext: "\\bibitem{Shan51}{Shannon~C.E. Prediction and entropy of printed\r\n English. {\\it Bell System Technical Journal}, 1951, vol.~30, pp.~50--64.}\r\n\\bibitem{Shan48}{Shannon~C.E. A mathematical theory of communication. {\\it Bell System Technical Journal}, 1948, vol.~27, pp.~379--423.}\r\n\\bibitem{BurLick55}{Burton~N.G., Licklider~J.C.R. Long-range\r\n constraints in the statistical structure of printed English. {\\it\r\n American Journal of Psychology}, 1955, vol.~68, no.~4, pp.~650--653}\r\n\\bibitem{Fon}{F\\'onagy~I. Informationsgehalt von wort und laut in der\r\n dichtung. In: {\\it Poetics. Poetyka. Поэтика}. Warszawa:~Pa\\'nstwo\r\n Wydawnictwo Naukowe, 1961, pp.~591--605.}\r\n\\bibitem{Kolm65}{Kolmogorov~A. Three approaches to the quantitative\r\n definition of information. {\\it Problems Inform. Transmission},\r\n 1965, vol.~1, pp.~1--7.}\r\n\\bibitem{Yaglom2}{Yaglom~A.M. and Yaglom~I.M. {\\it Probability and\r\n information} Reidel, Dordrecht, 1983.}\r\n\\bibitem{CK78}{Cover~T.M., King~R.C. A convergent gambling estimate of\r\n the entropy of English. {\\it Information Theory, IEEE Transactions\r\n on}, 1978, vol.~24, no.~4, pp.~413--421.}\r\n\\bibitem{Moradi98}{Moradi~H., Roberts~J.A.,\r\n Grzymala-Busse~J.W. Entropy of English text: Experiments with humans\r\n and a machine learning system based on rough sets. {\\it Inf. Sci.},\r\n 1998, vol.~104, no.~1--2, pp.~31--47.}\r\n\\bibitem{Paisley66}{Paisley~W.J. The effects of authorship, topic\r\n structure, and time of composition on letter redundancy in English\r\n text. {\\it J. Verbal. Behav.}, 1966, vol.~5, pp.~28--34.}\r\n\\bibitem{BrownEtAl92}{Brown~P.F., Della~Pietra~V.J., Mercer~R.L.,\r\n Della~Pietra~S.A., Lai~J.C. An estimate of an upper bound for the\r\n entropy of English. {\\it Comput. Linguist.}, 1992, vol.~18, no.~1, pp.~31--40.}\r\n\\bibitem{Teahan96}{Teahan~W.J., Cleary~J.G. The entropy of English\r\n using PPM-based models. In: {\\it DCC '96: Proceedings of the\r\n Conference on Data Compression}, Washington: IEEE Computer Society, 1996, pp.~53--62.}\r\n\\bibitem{LM1}{Leibov~R.G., Manin~D.Yu. An attempt at experimental\r\n poetics [tentative title]. To be published in: {\\it Proc.\r\n Tartu Univ.} [in Russian], Tartu: Tartu University Press}\r\n\\bibitem{ChurchMercer93}{Church~K.W., Mercer~R.L. Introduction to the\r\n special issue on computational linguistics using large corpora. {\\it\r\n Comput. Linguist.}, 1993, vol.~19, no.~1, pp.~1--24.}\r\n\\bibitem{SG96}{T.Sch\\\"urmann and P.Grassberger. Entropy estimation of\r\n symbol sequences. {\\it Chaos}, 1996, vol.~6, no.~3, pp.~414--427.}\r\n\\bibitem{FreqDict}{Sharoff~S., The frequency dictionary for\r\n Russian. {\\it http://www.artint.ru/projects/frqlist/frqlist-en.asp}}\r\n\\bibitem{HockJoseph}{Hock~H.H., Joseph~B.D. Language History, Language\r\n Change, and Language Relationship. Berlin--New York: Mouton de Gruyter, 1996.}\r\n\\bibitem{GenzelCharniak}{Genzel \\& Charniak, 2002. {\\it Entropy rate constancy in text.}\r\nProc. 40th Annual Meeting of ACL, 199--206.}\r\n\\bibitem{Jaeger06}{Anonymous authors (paper under review), 2006. {\\it Speakers optimize\r\ninformation density through syntactic reduction.} To be published.}\r\n\\bibitem{AylettTurk}{Aylett M. and Turk A., 2004. {\\it The Smooth Signal\r\nRedundancy Hypothesis: A Functional Explanation for Relationships\r\nbetween Redundancy, Prosodic Prominence, and Duration in Spontaneous Speech.} Language and Speech, 47(1),\r\n31--56.}\r\n"
relation_type: []
relation_uri: []
reportno: ~
rev_number: 29
series: ~
source: ~
status_changed: 2007-11-13 00:51:03
subjects:
- comp-sci-lang
- ling-comput
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: Experiments on predictability of word in context and information rate in natural language
type: journalp
userid: 7373
volume: 6