.. Natural Language Processing Jan Daciuk Department of Intelligent Interactive Systems ETI Faculty, Gdańsk University of Technology May 5, 2014 Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 0. (1 / 1)
Rules for Receiving Credits There is one note for the whole subject: lecture and lab. Each part accounts for 50% of points. Min. 50% from each part. Lecture: test, each student gets individual, randomly chosen suite of questions (multiple choice) with 4 possible answers. Only one answer is correct. There are no negative points for choosing incorrect answers. % pts note 96 100 5,5 90 95 5 80 89 4,5 70 79 4 60 69 3,5 50 59 3 0 49 2 Additional materials (e.g. course materials) can be used during the test, but not the help of other people (including other students). The result is scaled to 45 points. Up to 5% of points can be received for attending lectures. Laboratory: the points (notes) for individual exercises will be summed up, and then scaled so that the maximum will be 50 points. Each exercise will be evaluated using traditional notes. The exercises are passed in the lab. There are penalties for delays: half a note for each week, in which there are lab classes. Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (2 / 14)
Bibliography.1 Daniel Jurafsky, James Martin, Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Second Edition, Prentice Hall, 2008..2 Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press, 2000..3 Emmanuel Roche, Yves Schabes, Finite-State Language Processing, MIT Press, 1997..4 Quarterly journal Computational Linguistics and proceedings of conferences organized by ACL (Association for Computational Linguistics). Available from http://acl.ldc.upenn.edu/ ACL Anthology. Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (3 / 14)
Additional bibliography Polish Language.1 Alicja Nagórko, Zarys gramatyki polskiej, Wydawnictwo Naukowe PWN, Warszawa, 1996..2 Zygmunt Saloni, Marcin Woliński, Robert Wołosz, Włodzimierz Gruszczyński, Danuta Skowrońska, Słownik gramatyczny języka polskiego, Wydanie II, Warszawa 2012..3 Gramatyka współczesnego języka polskiego. Morfologia pod redakcją Renaty Grzegorczykowej, Romana Laskowskiego i Henryka Wróbla, Volume 1 and 2, Wydawnictwo Naukowe PWN, Warszawa, 1998..4 Mirosław Bańko, Wykłady z polskiej fleksji, Wydawnictwo Naukowe PWN, Warszawa, 2002..5 Zygmunt Saloni, Czasownik polski. Odmiana. Słownik, Wiedza Powszechna, Warszawa, 2001..6 Stanisław Mędak, Słownik form koniugacyjnych czasowników polskich, Universitas, Kraków, 2004..7 Stanisław Mędak, Słownik odmiany rzeczowników polskich, Universitas, Kraków, 2003. Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (4 / 14)
Natural Language. Natural language is a language that emerged from historical development. It is geographically and socially varied. It can be opposed on the one hand to artificial languages (e.g. esperanto), and on the other hand to formal and programming languages. It differs from artificial languages by the polysemy of its expressions, and by the fact that it undergoes constant changes.. Encyklopedia językoznawstwa ogólnego (shortened), Ossolineum 1993 Polish, English, Turkish, Arab, Chinese are examples of natural language. C++ or first order predicate calculus language are not. Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (5 / 14)
Natural Language Processing. Natural language processing is such text processing that makes use of. specific properties of natural language. Counting characters in a text is not natural language processing. Counting sentences is. Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (6 / 14)
Applications of Natural Language Processing Spelling correction Machine translation Document retrieval Question answering Running a program/system Finding authorship Summarization Text classification... Natural language is natural format for storing information and for communication between people. Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (7 / 14)
Levels of Natural Language Processing P R A G M. A T I C S S E M A N T I C S S Y N T A X L E X I C O N S E G M E N T A T I O N Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (8 / 14)
Corpora A corpus can be tagged or untagged. The markup varies. The latest fashion is XML. Corpora play a key role in modern natural language processing systems. They make it possible to gather various statistics, they also make it possible to use machine learning. Tagged corpora are much more useful than untagged ones. The best known corpora for English are the Wall Street Journal corpus (WSJ) and the British National Corpus (BNC). For Polish, the canonical corpus is the IPI PAN Corpus available at http://korpus.pl/index.php?page=download Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (9 / 14)
Text Segmentation (1/3).1 What is a word? Is it a sequence of letters? Let us look at examples in Polish: cóżeś mi uczynił żebyś zdechł obym dożył tej chwili.2 Apostrophes: in English: it s a dog, dog s bone, dog s crazy, dogs house in French: qu est-ce que c est, aujourd hui, l amour, je l aime.3 Do words joined with a hyphen form a single word? W 1900 r. trafił do Niemieckiej Południowo-Zachodniej Afryki. Zakład Przemysłowo-Drzewny Henryków Żydowskie Stowarzyszenie Kulturalno-Oświatowe Tarbut SS-man Fuss aresztował Jankiela za sabotaż Kazimierz Opel ukrył 6-osobową rodzinę Górskich musieli oni nie tylko wykazać się znajomością programu 2-letniej państwowej szkoły elementarnej... Dochodząc w opowieści o PRL-u do takiego punktu,....4 Is po polsku a single word, or two words? Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (10 / 14)
Text Segmentation (1/3).1 What is a word? Is it a sequence of letters? Let us look at examples in Polish: cóżeś mi uczynił żebyś zdechł obym dożył tej chwili.2 Apostrophes: in English: it s a dog, dog s bone, dog s crazy, dogs house in French: qu est-ce que c est, aujourd hui, l amour, je l aime.3 Do words joined with a hyphen form a single word? W 1900 r. trafił do Niemieckiej Południowo-Zachodniej Afryki. Zakład Przemysłowo-Drzewny Henryków Żydowskie Stowarzyszenie Kulturalno-Oświatowe Tarbut SS-man Fuss aresztował Jankiela za sabotaż Kazimierz Opel ukrył 6-osobową rodzinę Górskich musieli oni nie tylko wykazać się znajomością programu 2-letniej państwowej szkoły elementarnej... Dochodząc w opowieści o PRL-u do takiego punktu,....4 Is po polsku a single word, or two words? Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (11 / 14)
Text Segmentation (2/3).1 Where is the end of a sentence? At a full stop (period)?... nie ma prawdy innej, jak cała prawda; to też wszelkie zatajanie jest popełnianiem kłamstwa. Czy to nasza wina, że mamy takich władców? Myśmy ich sobie nie wybierali! W tysiącletniej afgańskiej historii żaden z władców nie został wyniesiony na tron z woli poddanych..2 At a full stop, a semicolon, an exclamation mark and a question mark? W 1885 r. znalazł się Stanach Zjednoczonych, następnie w Wielkiej Brytanii; w 1900 r. w Johannesburgu i Kapsztadzie. W 1900 r. trafił do Niemieckiej Południowo-Zachodniej Afryki. Zmarł prawdopodobnie w Brukseli w 1912 r..3 Does a full stop signal only the end of a sentence? What about abbreviations, ordinal numbers (written with digits)? Does the full stop belong to the abbreviation, or is it a separate symbol? What about abbreviations at the end of a sentence? Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (12 / 14)
Text Segmentation (3/3) When does a full stop end a sentence?.1 The first approximation: when the next word begins with a capital letter..2 But there can be no punctuation after the full stop, the full stop cannot end a sentence when it ends an abbreviation that requires another word after it (e.g. a proper name). A proper end-of-sentence recognition requires recognition of abbreviations and named entities as well as part-of-speech tagging, which in turn require good segmentation... In languages such as Japanese or Chinese, words are written without spaces. When segmentation is done on voice data, the input is a string of phones... Good results can be achieved with a document-centered approach. Words that end with a full stop are investigated when they appear in other contexts. This could make it clear whether in that particular document they are abbreviations or not. Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (13 / 14)
Additional Bibliography on Segmentation.1 Gregory Grefenstette, Pasi Tapanainen, What Is a Word, What Is a Sentence? Problems of Tokenization, in proceedings of the Third Conference on Computational Lexicography and Text Research COMPLEX 94, Budapest, 1994. Available at: http://iling.torreingenieria.unam.mx/curso2002 2/lecturas/mltt- 004.pdf..2 Andrei Mikheev, Periods, Capitalized Words, etc., Computational Linguistics Volume 28, Number 3, pp. 289-318, September 2002. Available at: http://acl.ldc.upenn.edu/j/j02/j02-3002.pdf..3 David D. Palmer, Marti A Hearst, Adaptive Multilingual Sentence Boundary Disambiguation, Computational Linguistics, Volume 23, Number 2, pp. 241-269, June 1994. Available at: http://acl.ldc.upenn.edu/j/j97/j97-2002.pdf Jan Daciuk, DIIS, ETI, GUT Natural Language Processing 1. Introduction, Segmentation (14 / 14)