Programming for Linguists, Winter 2008
- Professor: Jason
Riggle - jriggle around uchicago dot edu
- Illicit Covert TA: Max
Bane - bane around uchicago dot edu
- Office hours: After class on Wednesdays, and by
appointment.
- Room: SS 009A, Conference room of the Landahl Center.
- Textbook: Natural
Language Processing in Python (the NLTK book)
Useful Places
Things People Want to Do
- Corpus searching/statistics
- Simulations; rise of the machines; replicating Liberman 2000,
2002
- HMMs
- Working with R
〈〈〈〈〈〈〈〈〈〈〈〈〈〈〈〈〈
Check here for updates!
Bulletin
- March 12th Come get your parsing.py here
- March 5th We got even jiggier with finite state
transducers, turning strings of unstressed syllables (0s) into
strings of primarily and secondarily stressed syllables (2s and
1s). Lookie here: recognize4.py. Your
homework over the weekend is to write a function ("composeDet")
that takes two FSMs (which you can assume are deterministic and
complete) and returns their composition; i.e., a new FSM that does
the equivalent of applying both original FSMs simultaneously. Look
or the "HOMEWORK HERE" bit in recognize4.py for a hint.
- March 3rd Getting jiggy with parsing probabilistic
regular languages! recognize3.py
- February 27th Probabilistic finite state machine! Recognize2 here
- February 25th We started talking about parsing,
beginning with the regular languages. We wrote a recognizer for
finite state machines, here.
- February 20th Agents with phonemes rather than just
holistic words: get the code here.
- February 18th We added some additional metrics of how
similar the agents' probability distributions over words are, and
of the extend to which they have the same word for each idea. We
found that the agents tend to match each other's distributions,
rather than settling on common vocabularies in the basic model,
but that the addition of "vocabulary decay" over time results in
rapid lexical convergence. What other factors might lead to
convergence? Code is available here.
- Feburary 13th We continued experimenting with our
agent-based simulations, and implemented a metric of similarity
between the agents' vocabularies. We found that, according to the
metric, the agents do not currently seem to converge on a common
lexicon. The code is here.
For next time, continue experimenting with the agent-based model.
Try to get graphviz installed
and working on your machine so that you can make pretty pictures of
the social netowrk, and see if you can think of changes that might
result in the agents converging on a common lexicon. It may be
useful to take a look at this
paper by Mark Liberman for mental stimulation.
- February 11th Today we went over Jason's
implementation of the n-gram based imitator of Jane Austen
and William Shakespeare (available here),
and we wrote the first pass at an agent-based model of lexicon
convergence (code here). Your mission
for next time is to download the agent-based model code, grok it,
and then make some "linguistically interesting" change(s) to it.
- February 6th The transcript of what was done today on
the projector is available here,
and the ngram-counting functions we wrote are here For next time, finish chapter 3 if
you haven't already, and consider how you might represent
linguistic (phonological, syntactic, whatever) feature matrices
(i.e., attribute-value matrices) in Python. Try writing some
preliminary code to do so, and see if you can write a function
that takes a unicode (say, IPA) character and returns the
phonological feature matrix that it is traditionally purported to
represent (hint: chapter 3 has some helpful information on using
unicode).
- February 4th Homework for Wednesday: finish chapter 3
of the NLTK book, working through half-moon problems. The
frequency/rank calculating and graphing code we did today is
available here. You'll need matplotlib for the
graphing to work.
- January 30th: Homework for next Monday: write a
program that creates a CSV file containing each word in
some crazy Jane Austen novel (downloadable here), its frequency, and its rank
(i.e., x such that the word is the xth most frequent word). The
transcript of what we did in class today was unfortunately
destroyed.
- January 28th:The transcript of what we did in class
today is available here.
Homework for Wednesday the 30th: read sections 3.1 and 3.2. Do the
half-moon exercises in 3.2. Email your work to Max.
- January 23rd: Homework for Monday the 28th: finish
reading Chapter
2, attempt remaining half-moon exercises (especially the
Pig Latin one), attempt the star exercise (implementing SoundEx).
- Consider signing up for the upcoming "masterclass" on Corpus
Methods in Linguistics. Deadline is February 7th.