Haha. I wrote this on October 3rd, 2006. I started logging IRC conversations in 2005. I’ve got lots from #perl and #mono. I wonder if the n’ere-do-wells who frequent those channels will mind me using their public commentary to build an english language parser. I’ll make it “open source,” of course. Erhm, I mean Free Software.
So… I’ve been trying to parse english language (text) for a while. I haven’t quite done it yet. My new idea involves IRC bots. We have lots and lots of conversations happen in public IRC channels every day. The conversations usually involve transmission of ideas in textual format. These transmissions are quite often (but not always) understood by the receiving party. The transmissions are expected to have been intercepted by the general public, so having a bot intercept and attempt to interpret should not offend anybody :) It will anyway, of course, but we’ll pretend that this 0.00001% minority doesn’t mind being marginalized for the time being.
So the idea is this: split the words on the standard word boundaries and do a dictionary lookup on each of them. Cache the results locally so a network hit isn’t incurred for each word we look up. The info we will take from the dictionary lookup is the “part of speech” that the word falls into. In the english language, this is not often a scalar value. Computers are getting fast, so I think that this shouldn’t be too much of a problem.
So now we have a list of words and the parts of speech for each of these words. These words are assumed to combine in such a way that they form a coherent idea.
At this point, we can create an XML document from these word elements, for instance:<sentence> <article>the</article> <noun plurality="plural">cats</noun> <verb>are</verb> <adjective>cute</adjective> </sentence>
There would need to be an XSD or DTD against which to test this document in order to determine whether these parts of speech can be combined in this way to make a valid, gramatically correct, sentence.