Parsing English Language (if you consider IRC logs to qualify)

Haha. I wrote this on October 3rd, 2006. I started logging IRC conversations in 2005. I’ve got lots from #perl and #mono. I wonder if the n’ere-do-wells who frequent those channels will mind me using their public commentary to build an english language parser. I’ll make it “open source,” of course. Erhm, I mean Free Software.

So… I’ve been trying to parse english language (text) for a while. I haven’t quite done it yet. My new idea involves IRC bots. We have lots and lots of conversations happen in public IRC channels every day. The conversations usually involve transmission of ideas in textual format. These transmissions are quite often (but not always) understood by the receiving party. The transmissions are expected to have been intercepted by the general public, so having a bot intercept and attempt to interpret should not offend anybody :) It will anyway, of course, but we’ll pretend that this 0.00001% minority doesn’t mind being marginalized for the time being.

So the idea is this: split the words on the standard word boundaries and do a dictionary lookup on each of them. Cache the results locally so a network hit isn’t incurred for each word we look up. The info we will take from the dictionary lookup is the “part of speech” that the word falls into. In the english language, this is not often a scalar value. Computers are getting fast, so I think that this shouldn’t be too much of a problem.

So now we have a list of words and the parts of speech for each of these words. These words are assumed to combine in such a way that they form a coherent idea.

At this point, we can create an XML document from these word elements, for instance:

  <noun plurality="plural">cats</noun>                                                                                                               

There would need to be an XSD or DTD against which to test this document in order to determine whether these parts of speech can be combined in this way to make a valid, gramatically correct, sentence.

This entry was posted in Databases, irc, language, mono, perl. Bookmark the permalink.

11 Responses to Parsing English Language (if you consider IRC logs to qualify)

  1. I studied this (academically) a bit and found out that this is basically impossible. That is, no one knows how to do it well.

    Lots of parsers exist that can be reused. None are very good.

  2. Ivan says:

    A guy from my MSc course had a similar idea and is doing something similar for his thesis but he is using the huge number of SRT and SUB files out there (subtitle files for videos) instead of spying on people :P and is doing it for the purpose of translation by using some sort of pattern matching based on the huge DB of words and their combination or something.

  3. A bit strange this XML – why use cannon for such thing.
    Better would be:



    Or something more lightweight then xml.

  4. Vlad says:

    If you are going to use XML, you might as well use RDF. You are trying to reinvent a semantic dictionary. Do a little research. If you use OWL and use the inferencing power therein, you will go quite far.



    • Thanks, Vlad. I will take a look.

      • Vlad says:

        No problem. And if you want something more lightweight than XM, there is a syntax analagous to JSON for RDF triples called N3.

        Warning: You will find applications for this everywhere. You will say disparaging remarks about DTD, XSD, Schematron and such things. It will get worse. When you learn about SPARQL, you will question why you are using RDBMSs. This stuff is crack. It does not solve every problem though.

        Since you are a Mono guy you should know that others have already paved the way. It is in F-Spot and Beagle. Check this out for who deserves credit. It will need help, though others are independently helping (LINQ to RDF).

Leave a Reply