Sunday, October 16, 2016

I drew very strong connections from the syntatic tree structure of words and parts of speech with a computer science graph theory course I am currently taking. The amount and depth of analysis one can perform on data once the structure and relationships are defined in a graph form is immense. For instance, with the hierarchical structure, one can start to think about distances between words or constituents based on the distance between two leaf nodes in the graph. So, words that require more hops to get from one word to another in the tree are less likely to appear near each other in general. This is also how we determine part of speech, where the part of speech is not determined by the meaning or semantics, but rather, based on the place in the sentence or distributions. In graph theory we learned that it is common to calculate the distributions between two nodes in a tree structure, which is essentially the probability of two nodes linking to each other in the graph given their place in the graph. So, we can leverage the tree structure, where positions of words in a senetence correspond to their places in the graph, and calculate the probability distributions of words based on how far apart words are in the tree structure, thus allowing us to make sound judgements of the parts of speech of words in a sentence. In addition, it is common to replace subtrees in a hierarchical structure as a single node encapsulating the properties that the subtree otherwise has with the individual nodes, which corresponds to what Carnie mentioned about performing a “replacement test” to identify constituents.

Tying this back to the artificial intelligence computer science courses I’ve taken and research I’ve done in the past, I was able to draw parallels between the process of getting data across various languages as mentioned in the book to create more generalized models and rules to explain language in a systematic way. The data that we use to create these more generalized models is extremely important, as Carnie mentioned, because the “training data” we are utilizing to generate the model largely influences how general and how representative the the model is across languages. It is easy to “overfit”, or have a model that follows too closely to a non-generalized subset of the population we are trying to model and is extremely biased. At the same time, it is also equally easy to underfit, meaning that the model is overly general that it doesn’t characterize or explain the population we are trying to model in enough detail. Finally, just as in training neural networks in artificial intelligence, we want to always be taking in more and more training data (the more data the better) and constantly update the model in order to create a model that better models the true population--in this case, language--over time.

1 comment:

  1. Great post Michelle! I, too, was reminded of the problems of overfitting in machine learning problems. I was especially interested in the 'underdetermination of the data' problem in the context of machine learning -- while there exist many discriminative models for binary and multi-class classification when the data MUST belong to one of the given classes, there are far fewer techniques for dealing with 'non-valid' examples that belong to none of the given classes. Generative models, on the other hand, that try to construct distributions of classes rather than simply determine the posterior outright, may be better for characterizing whether a sentence is grammatical.

    ReplyDelete