: My goal is to index wikipedia in order to demonstrate search to a class of
: middle school kids that I've volunteered to teach for a couple of hours.
: Which brings me to my next question...

twitter data is a little easier to ingest easily then the wikipedia markup 
(the json based streaming API gives you each tweet on it's own line in a 
way that's really trivial to convert into CSV with a perl script) and 
might seem more interesting to kids then wikipedia, while still having 
some interesting metadata (user, post date, hash tags) and lexigraphic 
challanges (synonyms, abbreviations, @ and # markup, etc...)


: One idea I have is to bring some actual "documents", say a poster board with a
: sentence written largely on it, have the students physically *tokenize* the
: document by cutting it up and lexicographically building the term dictionary.
: Thoughts on taking it further welcome!

cutting up a paper document is a great way to teach textual analysis, but 
i think the real key is having two copies of multiple documents (3 would 
be enough) ... cut up one copy of each doc to build the term dictionary 
and tape all of those to the wall; then tape the second copy of every doc 
on the wall arround them and draw lines from each term to the documents 
it's in  (using differnet a different color for the paper of each doc 
would be an easy way to spot which term is in which doc, and what the term 
frequency is).


-Hoss

Reply via email to