Recent Changes - Search:

PmWiki

pmwiki.org

edit SideBar

Training-notes

Some PDF teaching files

  • Notes from a course in PERL for linguists held at the Department for Oriental and African Languages, Göteborg
  • Unix for Poets, a course in Unix designed for linguists, not computer scientists, by Kenneth Ward Church, AT&T Research.

Corpus

When I was here in 2005 I collected some texts in order to build a small corpus for demonstration purposes. The texts are:

  1. Malawi Poverty Reduction Strategy (MPRS) (6 languages)
  2. HIV/AIDS Policy (October 2003) (3 languages)

To make a long story short (it will get longer) I have created a small corpus for each of the six languages and created access to them by using world wide web technology.

To be even shorter, it is here.

I will elaborate on the search language later. Corpus driven (or based) lexicography

I brought two books containing collections of articles:

  1. Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins (Euralex 2002)
  2. A Practical Guide to Lexicography, ed. by Piet van Sterkenburg (John Benjamins Publishing Company, 2003)

They are not the last word on the subject but they give a good solid start with an eye to tradtional lexicography willing to use modern methods when they are productive.

What does a lexicographer do?

  1. Gathers a vocabulary
  2. Divides the vocabulary into senses
  3. Writes definitions
  4. Illustrates the definitions with language examples

That is how it seems to work, but the order is not necessarily the one given above.

There was a linguist, Firth, who was active about the same time as Chomsky started to become popular. Firth became totally overshadowed by Chomsky even though both had arrived at equally significant results.

You might have heard:

You know a word by the company it keeps.

It is central to corpus based lexicography and one could argue, that it is central to understanding language as a means of communications. We can break all the syntactic rules we want (well a lot anyway) and we will still be understood. But if we use words the wrong way, then the likelihood that we will succeed in communicating ... and that is what language is all about, is not very great.

The context is central. Even if we use the wrong words, the context can clarify the situation. If I ask you what "foo" means, you will not know. I have pulled it out of context and there are no innate semantic entities residing in strings of letters of the alphabet.

But if I say: "Please give me a cup of foo." You will at least know that I am not asking for a glass of coke. I am asking for either coffee or tea. We do not drink cups of many more things. We know a word by the company it keeps.

The above is the guiding principle of corpus based lexicography. That being the case, one does not gather language examples after words have been defined, but defines words according to what the collection of language examples provide evidence for.

Collecting a vocabulary

Where does the vocabulary of a dictionary come from? What are the most frequent words used in Chichewa?

Here are the most frequent ones in the texts I have used.

Corpus composition

There are various schools when it comes to balancing a corpus. In the examples that we have seen today, you can tell that the texts come from the public domain and the vocabulary reflects that provenance.

More to come ...

Computer aided lexicography

Another aspect of computational lexicography is the use of specialized editing programs for entering dictionary data in such a way that:

  • Consistency can be guaranteed
  • Quality controls can be implemented automatically
  • Dictionary data can be reused

When I say "reused", I mean that the same database can be used to generate smaller or larger products as required. It can also be used to create the L1 side of a bilingual dictionary, where the second language is L2.

There is now a first version of an editing tool for entering lemmas (headwords) and their associated formal features:

Tool for printing dictionary articles

One thing you have to remember, if you want to enter characters like Ŵŵ is that you have to set the encoding for your browser to Unicode.

This is done under View->Character Encoding-> Unicode (UTF-8)

So if your text looks "clobbered", try setting the encoding manually instead of letting the computer figure it out.

There will be much more to say about this further on, as the project progresses.

Edit - History - Print - Recent Changes - Search
Page last modified on September 05, 2009, at 01:15 PM