Yale Data ScienceData science collective at Yale
http://yaledatascience.github.io/
Fri, 17 Mar 2017 18:19:32 +0000Fri, 17 Mar 2017 18:19:32 +0000Jekyll v3.4.2Neural Networks and Natural Language Processing<p>by <a href="https://jungokasai.github.io/">Jungo Kasai</a></p>
<h3 id="resources">Resources</h3>
<ul>
<li><a href="https://jungokasai.github.io/docs/midterm_nlp.pdf">My Presentation Slides</a></li>
<li><a href="http://cs224d.stanford.edu/">Richard Socher’s Course at Stanford</a></li>
</ul>
<h3 id="introduction">Introduction</h3>
<p>Neural Networks have been successful in many fields in machine learning such as Computer Vision and Natural Language Processing. In this post, we will go over applications of neural networks in NLP in particular and hopefully give you a big picture for the relationship between neural nets and NLP.
Having such a big picture should help us read papers in the literature and understand the context more deeply.</p>
<h3 id="challenges-and-resolutions">Challenges and Resolutions</h3>
<p>Applying neural networks to problems in Computer Vision is oftentimes straightforward. For example, we can represent each image in the <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> dataset by a square of 28 by 28 pixels where each pixel takes on a real value. In NLP, there arises an issue how we represent words. With regards to this issue, word embedding techniques come into play.
Perhaps, the dumbest thing we could do is to represent each word by a one-hot vector where all of the components are zero except one component being one. Namely, we represent each word by a unique index. This representation is not ideal because our vectors would be of the dimension equal to our vocabulary size, and moreover they would not preserve linguistically sensible structures. Since all of the one-hot vectors are orthogonal to each other by design, we would not be able to have any information in the distances among the vectors. Therefore, we want to have dense vector representation of words instead. There have been several techniques to vectorize words in English:</p>
<ul>
<li><a href="https://wordnet.princeton.edu/">WordNet at Princeton, starting in 1985</a></li>
<li>Global Approach: (Pointwise Mutual Information) Matrix Factorization</li>
<li>Local Apporach: <a href="https://arxiv.org/pdf/1301.3781.pdf">Word2Vec by Mikolov et al. in 2013</a></li>
<li>Global and Local: <a href="https://nlp.stanford.edu/projects/glove/">GloVe by Pennington, Socher, and Manning in 2014</a></li>
</ul>
<p>WordNet is analogous to a dictionary in that humans manually annotate relations among words and group synonyms together. The other three methods all utilize English corpora; instead of manually encoding relations among words, we learn relations from text data.</p>
<p>While the WordNet approach obviously scales badly as we extend it to more words and even other languages, the learning approaches are more flexible. One simple way of learning word vectors from text data is to take some statistics in the data, construct a matrix with each row corresponding to each word and each column corresponding to each “context”, and factorize the matrix. The SVD is a common approach to do this. The context can have multiple ways to define, including frequency of co-ocurrence and the positive mutual information, which was proposed by <a href="http://dl.acm.org/citation.cfm?id=89095">Church and Hanks 1990</a>. For a more extensive introduction to such methods of word vectorization, see <a href="http://www.jair.org/media/2934/live-2934-4846-jair.pdf">Turney and Pantel 2010</a>.</p>
<p>On the other hand, the Word2Vec algorithms are based on local information. The intuition is that we construct a shallow neural network that predicts words that appear around the word you are looking at (Skipgram) or the other way around (CBOW). Through this process, we obtain a vector representation for each word. We typically train this network by the stochastic gradient descent algorithm where the learning rate decays.</p>
<p>Images for Word2Vec from <a href="https://arxiv.org/pdf/1411.2738.pdf">Rong 2016</a>.</p>
<p><img src="https://jungokasai.github.io/images/SKIP.jpeg" style="float: left; width: 40%; margin-right: 1%; margin-bottom: 0.5em;" /> <img src="https://jungokasai.github.io/images/CBOW.jpeg" style="float: left; width: 38%; margin-right: 1%; margin-bottom: 0.7em;" /></p>
<div style="clear:both"></div>
<!--<img src="/images/SKIP.jpeg" style="float: left; width: 40%; margin-right: 1%; margin-bottom: 0.5em;"> <img src="/images/CBOW.jpeg" style="float: left; width: 38%; margin-right: 1%; margin-bottom: 0.7em;"> <p style="clear: both;"> -->
<p>The Word2Vec algorithms are very successful and notably, Levy and Goldberg proposed a “global way” of looking at the Skipgram algorithm <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, though it requires the strong assumption of matrix reconstructability.</p>
<p><a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> is intended to blend the global and local approaches by adding a global loss term to the objective function. It has achieved the state-of-the-art performances in the semantic and syntactic tasks.</p>
<p>As we saw above, there is plenty of work done on word embeddings. However, we still have an alternative to word embeddings as an ingredient to get fed into neural nets. As neural network architectures based on the Convolutional Neural Nets (CNNs) are getting attention in NLP <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>, character-based encoding together with a CNN stands a chance where we let the network figure out which characters form a word instead of tokenizing the data beforehand.
In this scheme, we can just represent each character in a one-hot vectorization fashion.
This methodology is particularly appealing when dealing with languages whose tokenization is not cheap, such as Japanese and Chinese unlike English.
In fact, a Japanese company reports that the character-based CNN achieved a great accuracy in classification of consumer reviews of restaurants <sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>.
Even with regards to English incorporating character-based models into a machine translation model is one of the current areas of research <sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>.</p>
<h3 id="architectural-explorations">Architectural Explorations</h3>
<p>Given the ingredients above, one way to further proceed with neural-network-based approaches in NLP is to explore different architectures.
Recurrent Neural Networks (RNNs) have been the most common architecture in neural nets for NLP since they are designed to deal with sequential data.
In particular, the Long Short Term Memory (LSTMs) have served as a key building block. The LSTMs are designed to address the vanishing/exploding gradient problem of backpropagation through time steps by performing an additive update as opposed to a matrix-multiplicative update.
Although the LSTMs have been established as a building block for neural-net-based approaches in NLP, the Gated Convolutional Neural Networks have recently achieved a better performance in language modeling with faster computation than the LSTMs <sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>.
Moreover, attention mechanisms have also been explored in machine translation <sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>, question answering, sentiment analysis, and part-of-speech tagging <sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>.</p>
<h3 id="decomposition-of-major-problems-in-nlp">Decomposition of Major Problems in NLP</h3>
<p>Personally, I believe that decomposition of the major problems based on linguistic insights will play a key role in the future of neural nets in NLP.
For example, with regards to the major task of syntactic syntactic parsing, people have prosed multiple formalisms to derive a tree for a sentence.
One direct method is the dependency grammar and Google has provided neural-net-based dependency parser called <a href="https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html">Parsey McParseface</a>. However, there are still different types of grammatical formalism: Tree Adjoining Grammar (TAG) introduced in <a href="http://www.sciencedirect.com/science/article/pii/S0022000075800195?via%3Dihub">Joshi et al. 1975</a> and Combinatory Categorial Grammar (CCG) introduced in <a href="https://www.jstor.org/stable/4047583?seq=1#page_scan_tab_contents">Steedman 1987</a>. In either of the two formalisms, parsing comprises two phases unlike the dependency grammar case. The first phase is supertagging where supertagging denotes a classification tasks into much richer categories than parts-of-speech. The second phase staples the assigned supertags together to derive a tree for each sentence. Both of the two phases can employ neural networks, and again there arise architectural questions. The CCG method has been successful to some degree <sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>. I am currently working on the TAG approach in <a href="https://jungokasai.github.io/projects/">my project</a>.</p>
<h3 id="notes">Notes</h3>
<p>In this post, we saw word embeddings as an ingredient for the use of neural nets in NLP. However, word embeddings serve other purposes in NLP and Computational Linguistics, including distributional semantics.</p>
<h3 id="references">References</h3>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>https://levyomer.files.wordpress.com/2014/09/neural-word-embeddings-as-implicit-matrix-factorization.pdf <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>https://arxiv.org/pdf/1612.08083.pdf <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>https://speakerdeck.com/bokeneko/tfug-number-3-rettyniokerudeep-learningfalsezi-ran-yan-yu-chu-li-hefalseying-yong-shi-li?slide=14 (Content in Japanese) <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>https://github.com/lmthang/thesis/blob/master/thesis.pdf <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>https://michaelauli.github.io/papers/gcnn.pdf <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>https://arxiv.org/pdf/1409.0473.pdf <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>https://arxiv.org/pdf/1506.07285.pdf (Dynamic Memory Networks) <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>https://aclweb.org/anthology/D16-1181 <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 17 Mar 2017 07:00:00 +0000
http://yaledatascience.github.io/2017/03/17/nnnlp.html
http://yaledatascience.github.io/2017/03/17/nnnlp.htmlneural networksnatural language processingMeta Resource<p>The <a href="https://github.com/YaleDataScience/Resources">YDS Meta Resource</a> was born today.</p>
<p><strong>What is a meta resource?</strong></p>
<p>A meta resource is a list of useful resources.</p>
<p><strong>What is the YDS Meta Resource?</strong></p>
<p>The YDS Meta Resource is an open list of resources, editable by anyone, that can be used to start/further your development in the areas of data science and machine learning. We are currently using Github to maintain this resource, as it is a convenient place for such lists. This list will include introductory and advanced links, so if you are just being introduced to a lot of this stuff, look for the unbolded links.</p>
<p><strong>How can I edit the YDS Meta Resource?</strong></p>
<p>Start by joining the <a href="https://yds.slack.com/messages/datasets/">#datasets</a> and <a href="https://yds.slack.com/messages/workshops/">#workshops</a> Slack channels, to post links to datasources/tutorials/other things you find online. You can also join the <a href="https://flipboard.com/@krishpop/yds-jjjtk8f9y">YDS Flipboard</a> for a nice interface to some of the blogposts we find. Finally, you can make contributions to the Meta Resource yourself by filling out a pull request on Github (right now, the write-access to the repository is protected so nobody can delete the whole list).</p>
<p>Please make contributions to this list! Its usefulness is entirely contingent on the collective effort of all of us.</p>
Tue, 21 Feb 2017 17:51:00 +0000
http://yaledatascience.github.io/2017/02/21/meta-resource.html
http://yaledatascience.github.io/2017/02/21/meta-resource.htmlmachinelearning,datascience,metalearningRegularization and Dropout<p>by <a href="https://github.com/jungokasai/">Jungo Kasai</a></p>
<p>(WIP)</p>
<h3 id="useful-resources">Useful Resources</h3>
<ul>
<li><a href="https://www.dropbox.com/s/l0w3sgtwlywnx7b/dropouttraining.pdf?dl=0">My Presentation Slides</a></li>
</ul>
<h2 id="dropout-as-regularization">Dropout as Regularization</h2>
<h3 id="introduction">Introduction</h3>
<p>One of the important challenges in the use of neural networks is generalization. Since we have a huge hypothesis space in neural networks, maximum likelihood estimation of parameters almost always suffers over-fitting. The most popular workaround to this problem is dropout <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.
Though it is clear that it causes the network to fit less to the training data, it is not clear at all what is the mechanism behind the dropout method and how it is linked to our classical methods, such as L-2 norm regularization and Lasso. With regards to this theoretical issue, Wager et al. <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> present a new view on dropout when applied to Generalized Linear Models.
They view dropout as a process that artificially corrupts data with multiplicative Bernoullie noise.
They prove that dropout when applied to Generalized Linear Models is nothing but adding adaptive L-2 norm regularization term (penalty term) to the negative log likelihood function up to the quadratic approximation.
It turns out that the L-2 norm penalty term works in a way that favors rare but discriminative features.</p>
<h3 id="background">Background</h3>
<h4 id="generalized-linear-models-and-exponential-family-distributions">Generalized Linear Models and Exponential Family Distributions</h4>
<p>The exponential family distribution is defined as the following:</p>
<script type="math/tex; mode=display">p(y ; \theta) = h(y)exp(\theta^T T(y) - A(\theta))</script>
<p>And the Generalized Linear Model assumption is to set $\theta= \beta^T x$ where $x$ is an explanatory variable.
$\theta$ is called the canonical parameter. The function $A$ is a normalizer, and</p>
<script type="math/tex; mode=display">\nabla A (\theta) = E[T(X)]\\
\nabla^2 A(\theta) = Cov[T(X)]</script>
<p>You can prove this, and this will come in handy later.
For those of you who are interested in more details, see <a href="http://cs229.stanford.edu/notes/cs229-notes1.pdf">Andrew Ng’s Notes</a>.
Assuming that we have independent samples in the training set, the negative log likelihood is</p>
<script type="math/tex; mode=display">l(\beta) = - \sum\limits_i log p(x^{(i)} \vert y^{(i)} ; \beta)</script>
<p>and</p>
<script type="math/tex; mode=display">\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\hat{\beta}_{MLE} = \argmin\limits_\beta l(\beta)</script>
<h4 id="robustness-and-constrained-optimization">Robustness and Constrained Optimization</h4>
<p>It is a terrible idea to fit a nth-degree polynomial to n data points. It does a perfect job on the given data points, but it fails to generalize to new data points. over-fitting.
Putting it in a more statistical language, the parameter estimation varies too much according to a particular set of realizations from our true distribution. This insight encourages us to constrain possible values that the parameter can take on so that the parameter estimate would not vary so much based on the realizations we get to observe. L-1 and L-2 norm constraints on the estimated parameter are both common choices for the constraint. This is the trade-off between bias and variance of a model.
For instance, we can pose an optimization problem constrained by the L-2 norm regularization instead of the normal linear regression problem.</p>
<script type="math/tex; mode=display">\min\limits_\beta \lVert y - X\beta \rVert_2^2 \\ \text{ such that }
\lVert \beta \rVert_2^2 \leq s</script>
<p>However, by the strong duality of the convex problem, we can solve the dual problem:</p>
<p><script type="math/tex">\min\limits_\beta \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2</script>
where there is a one-to-one relationship between $\lambda$ and $s$.</p>
<h3 id="adaptive-regularization-and-dropout">Adaptive Regularization and Dropout</h3>
<p>The vanilla regularization scheme, such as Lasso and Ridge Regression, penalizes big parameters uniformly.
Namely, our constraint is solely based on the parameter. However, sometimes, it might be useful to prefer some features than the other. For instance, when recognizing a handwritten digit in the MNIST dataset, we might want to look at rare features that are specific to each number. In such cases, adaptive regularization comes into play.
Wager et al.<sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup> proves that dropout works in a way that adaptive regularization happens. First, let’s think of the dropout training as a process artificially corrupting the data with noise.</p>
<script type="math/tex; mode=display">\widetilde x^{(i)} = x^{(i)} \odot \xi_i \quad where \quad \xi_i \sim \frac{Bern(1-\delta)}{1-\delta}</script>
<p>Then, the negative log likelihood (the cost function) becomes:</p>
<script type="math/tex; mode=display">l(\beta) = \sum_{i=1}^m - log h(y^{(i)}) + A(\beta^T x^{(i)}) - (\beta^T x^{(i)} ) y^{(i)} \\
E \widetilde l(\beta)
= E\left[\sum_{i=1}^m - log h(y^{(i)}) + A(\beta^T \widetilde x^{(i)}) - (\beta^T \widetilde x^{(i)} ) y^{(i)}\right]\\
= l(\beta) + \sum_{i=1}^m E[A(\beta^T \widetilde x^{(i)} ] - A(\beta^T x(i))\\
= l(\beta) + R(\beta)</script>
<p>We can approximate the R penalty term up to the quadratic Taylor expansion.</p>
<script type="math/tex; mode=display">R(\beta) = \sum_{i=1}^m E[A(\beta^T \widetilde x^{(i)})] - A(\beta^T x^{(i)}) \\
\approx \sum_{i=1}^m A^{\prime} (\beta^T x^{(i)}) \beta^T E(\widetilde x^{(i)}- x^{(i)}) + \frac{1}{2} A^{''}(\beta^T x^{(i)}) Var(\beta^T \widetilde x^{(i)}) \\
= \sum_{i=1}^m \frac{1}{2} A^{''}(\beta^T x^{(i)}) Var(\beta^T \widetilde x^{(i)}) \doteq R^q(\beta)</script>
<p>Now recall that $A^{‘’} = Cov(T(X))$ So for the logistic regression (Bernoulli distribution),</p>
<script type="math/tex; mode=display">R^q(\beta) = \sum_{i=1}^m \frac{1}{2} Var[\beta^T \widetilde x^{(i)}]\sigma(\beta^T x^{(i)})(1-\sigma(\beta^T x^{(i)})) \\
= \sum_{i=1}^m \frac{1}{2} Var[\beta^T \widetilde x^{(i)}]p^{(i)} (1-p^{(i)})</script>
<p>where $\sigma$ denotes the sigmoid function.</p>
<p>Also,</p>
<script type="math/tex; mode=display">Var(\xi) = \frac{\delta (1-\delta)}{(1-\delta)^2} I = \frac{\delta}{1-\delta}I\\
Var(\widetilde x^{(i)}) = Var[x^{(i)} \odot \xi] \\
= \frac{\delta}{1-\delta} diag\left(x_j^{(i)^2}\right)Var(\beta^T \widetilde x^{(i)}) \\
= \beta^T Var(\widetilde x^{(i)}) \beta\\
= \frac{\delta}{1-\delta} \sum_{j=1}^d x_j^{(i)^2} \beta_j^2</script>
<p>So intuitively, the penalty term penalizes the model when the model is not confident and the corresponding feature fires often. Namely, the dropout training favors rare but discriminative features.</p>
<h3 id="references">References</h3>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>https://arxiv.org/pdf/1207.0580.pdf <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
</ol>
</div>
Mon, 21 Nov 2016 08:45:00 +0000
http://yaledatascience.github.io/2016/11/21/dropout.html
http://yaledatascience.github.io/2016/11/21/dropout.htmlneuralnetworks,learningtheoryGuide to Autoencoders<p>by <a href="http://krishpop.xyz/">Krishnan Srinivasan</a></p>
<p>(A tutorial on autoencoders)</p>
<h3 id="useful-resources">Useful Resources</h3>
<ul>
<li><a href="https://arxiv.org/abs/1606.05908">VAE tutorial</a></li>
<li><a href="https://blog.keras.io/building-autoencoders-in-keras.html">keras tutorial</a></li>
</ul>
<h2 id="autoencoders">Autoencoders</h2>
<h3 id="introduction">Introduction</h3>
<p>We aren’t going to spend too much time on just autoencoders because they are not as widely used today due to the
development of better models. However, we will cover them because they are essential to understanding the later topics
of this guide.</p>
<p>The premise: you are trying to create a neural network that can efficiently encode your input data in a lower dimensional encoding,
which it is then able to decode back into the original input, with losing as little of the original input as
possible. This is useful for the following reason. Imagine your input data is very high dimensional, but in reality,
the only valid inputs you would ever receive are in a subspace
of this high dimension. In fact, they exist in a manifold of this space, which can be spanned using fewer dimensions, and these dimensions
can have properties that are useful to learn, as they capture some intrinsic/invariant aspect of the input space.</p>
<p>To achieve this dimensionality reduction, the autoencoder was introduced as an unsupervised learning way of attempting
to reconstruct a given input with fewer bits of information.</p>
<h3 id="basic-architecture">Basic Architecture</h3>
<p>Now at this point, the theory starts to involve an understanding of what neural networks are. The prototypical
autoencoder is a neural network which has input and output layers identical in width, and has the property of
“funneling” the input, after a sequence of hidden layers, into a hidden layer less wide than the input, and then
“fanning out” back to the original input dimension, and constructing the output. Typically, the sequence of layers to
the middle layer are repeated in reverse order to scale back up to the output layer. The sequence of funneling layers
are referred to as the “encoder,” and the fanning out layers are called the “deocoder.”</p>
<p>The loss function <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> typically used in these architectures is mean squared error $J(x,z) = \lVert x - z\rVert^2$,
which measures how close the reconstructed input $z$ is to the original input $x$. When the data resembles a vector
binary values or a vector of probabilities (which are both values in the range of $[0,1]$), you can also use the
cross-entropy of reconstruction loss function, which calculates how many “bits” of information are preserved in the
reconstruction compared to the original. This loss function is <script type="math/tex">J(x, z) = -\sum_k^d[x_k \log z_k +
(1-x_k)log(1-z_k)].</script></p>
<p>Once you’ve picked a loss function, you need to consider what activation functions to use on the hidden layers of the
autoencoder. In practice, if using the reconstructed cross-entropy as output, it is important to make sure</p>
<p>(a) your data is binary data/scaled from 0 to 1
(b) you are using sigmoid activation in the last layer</p>
<p>You can also optionally use sigmoid activations for each hidden layer, as that will keep the activation values between
0 and 1, and make it easier to perform linear transformations on the data that keeps it in the range of values that it
is provided in.</p>
<h3 id="application-to-pre-training-networks">Application to pre-training networks</h3>
<p>There are many ways to select the initial weights to a neural network architecture. A common initialization scheme is
random initialization, which sets the biases and weights of all the nodes in each hidden layer randomly, so they are in
a random point of the space, and objective function, and then find a nearby local minima using an algorithm like
<a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a> or <a href="https://arxiv.org/abs/1412.6980">Adam</a>. In
2006-2007, autoencoders were discovered to be a useful way to pre-train networks (in 2012 this was applied to conv
nets), in effect initializing the weights of the network to values that would be closer to the optimal, and therefore
require less epochs to train. While I could try re-explaining how that works here, Quoc Le’s explanation from his
series of Stanford lectures is much better, so I’ll include the links to that below.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> <sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> In particular, look at
section 2.2 of the deep learning tutorial for the part about pre-training with autoencoders.</p>
<p>However, other random initialization schemes have been found more recently to work better than pre-training with
autoencoders. For more on this, see <a href="http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf">Martens</a> for Hessian-free optimization as one of these methods, and
<a href="http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf">Sutskever, Martens et al</a> for an overview of initialization and momentum.</p>
<h3 id="sparsity">Sparsity</h3>
<p>One of the things that I am currently experimenting with is the construction of sparse autoencoders. These can be
implemented in a number of ways, one of which uses sparse, wide hidden layers before the middle layer to make the
network discover properties in the data that are useful for “clustering” and visualization. Typically, however, a
sparse autoencoder creates a sparse encoding by enforcing an l1 constraint on the middle layer. It does this by
including the l1 penalty in the cost function, so, if we are using MSE, the cost function becomes</p>
<script type="math/tex; mode=display">J(x,z,s) = \lVert x - z \rVert^2 + \lambda\lVert s \rVert_1</script>
<p>where $s$ is the sparse coding in the middle layer, and $\lambda$ is a regularization parameter that weights the
influence of the l1 constraint over the entire cost function. For more on these, see <a href="http://deeplearning.stanford.edu/wiki/index.php/Sparse_Coding:_Autoencoder_Interpretation">sparse coding</a></p>
<h2 id="denoising-autoencoders">Denoising Autoencoders</h2>
<h3 id="introduction-1">Introduction</h3>
<ul>
<li>autoencoders to reconstruct noisy data</li>
<li>Useful for weight initialization
<ul>
<li>unsupervised learning criterion for <strong>layer-by-layer initialization</strong> <sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>:
<ul>
<li>each layer is trained to produce higher level representation</li>
<li>with successive layers, representation becomes more abstract</li>
</ul>
</li>
<li>then, <strong>global fine-tuning</strong> of parameters with another training criterion
<ul>
<li><strong>robustness to partial destruction of input</strong></li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="denoising-approach">Denoising Approach</h3>
<ul>
<li>introduce noise into the observed input:
<ul>
<li>to yield almost the same representation</li>
<li>guided by the fact that <em>a good representation captures stable structures in the form of dependencies and regularities characteristic of the unknown distribution of the input</em></li>
</ul>
</li>
<li>goal:
<ul>
<li>minimize <em>average reconstruction error</em> <span></span></li>
</ul>
</li>
</ul>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\theta^*, \theta'^{*} &= \arg\min_{\theta, \theta'} \frac{1}{n} \sum_{i=1}^n L(x^i, z^i) \\
&= \arg\min_{\theta, \theta'} \frac{1}{n}\sum_{i=1}^n L(x^i, g_\theta (f_\theta (x^i)))
\end{align} %]]></script>
<ul>
<li>where $L$ is loss func like squared error</li>
<li>An alternative loss is reconstruction cross entropy, for vectors of bit probabilities</li>
</ul>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
L_H(x,z) &= H(B_x \lVert B_z) \\
&= - \sum_{k=1}^d[x_klogz_k + (1-x_k)log(1-z_k)]
\end{align*} %]]></script>
<ul>
<li>if $x$ is a binary vector, the binary-crossentropy becomes negative log-likelihood for $x$, given by Bernoulli parameters $z$. Eq 1</li>
</ul>
<h3 id="dae-objective-function">DAE objective function</h3>
<ul>
<li>one way to destroy components of the input is by zeroing values of a random number of them. the corrupted input $\widetilde{x}$</li>
<li>then, mapped with a hidden representation $y = f_\theta(\tilde{X}) = s(W\tilde{x} + b)$, and reconstruct $z = g_{\theta’}(y) = s(W’y + b’)$</li>
<li>define the joint distribution
<script type="math/tex">q^0(X, \tilde{X}, Y) = q^0(X)q_D(\tilde{X}|X)\delta_{f_\theta(\tilde{X})}(Y)</script></li>
<li>$\delta_u(v)$ puts mass $0$ when $u \neq v$, Y is a deterministic function of $\tilde{X}$.</li>
<li>objective function minimized by SGD is:
<script type="math/tex">\arg\min_{\theta, \theta'} \mathbb{E}_{q^0(X,\tilde{X})} L_H(X, g_{\theta'}(f_\theta(\tilde{X}))) \tag{3}</script></li>
</ul>
<h3 id="layer-wise-initialization-and-fine-tuning">Layer-wise initialization and fine-tuning</h3>
<ul>
<li>representation of the $k$-th layer used to train $(k+1)$-th layer <sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>
<ul>
<li>used as initialization for network opt wrt supervised training criterion</li>
<li>greedy layer-wise approach is better than local minima than random initialization</li>
</ul>
</li>
</ul>
<h3 id="practical-considerations">Practical Considerations</h3>
<p>So, what does any of this mean? How can I use this? First, it’s important to note what autoencoders are useful for. The main uses today for autoencoders are their generative and denoising capabilities, which is done with variational and denoising autoencoders. A third application is dimensionality reduction for data visualization, as autoencoders find interesting lower-dimensional embeddings of the data.</p>
<h2 id="variational-autoencoders">Variational Autoencoders</h2>
<p>To learn more about the statistical background to VAEs, <a href="http://blog.evjang.com/2016/08/variational-bayes.html">Eric Jang’s post</a> is a great resource to get started.</p>
<p>Variational Autoencoders are a relatively recent application of neural networks to generate ‘samples’ based on the
representations of the input space that they have ‘learned.’ Eric’s article goes in depth into the methods that are
applied in these models, but the key take away is the goal of learning an approximation of an underlying distribution
in the data that allows you to generate samples that are close to the data input into your model. This is done by
optimizing the “encoding” $z \sim Q(Z|X)$ and “decoding” $x \sim P(X|Z)$ distributions to minimize the variational
lower bound $\mathcal{L} = \log p(x) - KL(Q(Z|X)||P(Z|X)) = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z))$</p>
<h2 id="adversarial-autoencoders">Adversarial Autoencoders</h2>
<p><a href="https://arxiv.org/abs/1511.05644">https://arxiv.org/abs/1511.05644</a></p>
<h3 id="references">References</h3>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>http://www.deeplearning.net/tutorial/dA.html <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>http://www.trivedigaurav.com/blog/quoc-les-lectures-on-deep-learning/ <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>http://ai.stanford.edu/~quocle/tutorial2.pdf <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>http://info.usherbrooke.ca/hlarochelle/publications/vincent10a.pdf <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 29 Oct 2016 19:45:00 +0000
http://yaledatascience.github.io/2016/10/29/autoencoders.html
http://yaledatascience.github.io/2016/10/29/autoencoders.htmlautoencoders,neuralnetworks,auto-encoders,tutorial,guideLearning-To-Learn: RNN-based optimization<p>by <a href="http://runopti.github.io/blog/">Yutaro Yamada</a></p>
<p>Around the middle of June, this paper came up: <a href="https://arxiv.org/pdf/1606.04474v1.pdf">Learning to learn by gradient descent by gradient descent</a>. For someone who’s interested in optimization and neural networks, I think this paper is particularly interesting. The main idea is to use neural networks to tune the learning rate for gradient descent.</p>
<h2 id="summary-of-the-paper">Summary of the paper</h2>
<p>Usually, when we want to design learning algorithms for an arbitrary problem, we first analyze the problem, and use the insight from the problem to design learning algorithms. This paper takes a one-level-above approach to algorithm design by considering a class of optimization problems, instead of focusing on one particular optimization problem.</p>
<p>The question is how to learn an optimization algorithm that works on a “class” of optimization problems. The answer is by parameterizing the optimizer. This way, we effectively cast algorithm design as a learning problem, in which we want to learn the parameters of our optimizer (, which we call the optimizee parameters.)</p>
<p>But how do we model the optimizer? We use Recurrent Neural Network. Therefore, the parameters of the optimizer are just the parameters of RNN. The parameters of the original function in question (i.e. the cost function of “one instance” of a problem that is drawn from a class of optimization problems) are referred as “optimizee parameters”, and are updated using the output of our optimizer, just as we update parameters using the gradient in SGD. The final optimizee parameters <script type="math/tex">\theta^*</script> will be a function of the optimizer parameters and the function in question. In summary:</p>
<script type="math/tex; mode=display">\theta^* (\phi, f) \text{: the final optimizee parameters}</script>
<script type="math/tex; mode=display">\phi \text{: the optimizer parameters}</script>
<script type="math/tex; mode=display">f\text{: the function in question}</script>
<p><script type="math/tex">\theta_{t+1} = \theta_t + g_t(\nabla f(\theta_t), \phi) \text{: the update equation of the optimizee parameters}</script>
where <script type="math/tex">g_t</script> is modeled by RNN. So <script type="math/tex">\phi</script>is the parameter of RNN. Because LSTM is better than vanilla RNN in general (citation needed*), the paper uses LSTM. Regular gradient descent algorithms use <script type="math/tex">g_t(\nabla f(\theta_t), \phi) = -\alpha \nabla f(\theta_t)</script>.</p>
<p>RNN is a function of the current hidden state <script type="math/tex">h_t</script>, the current gradient <script type="math/tex">\nabla f(\theta_t)</script>, and the current parameter <script type="math/tex">\phi</script>.</p>
<p>The “goodness” of our optimizer can be measured by the expected loss over the distribution of a function <script type="math/tex">f</script>, which is</p>
<script type="math/tex; mode=display">L(\phi) = \mathbb{E}_f [f(\theta^* (\phi, f))]</script>
<p>(I’m ignoring <script type="math/tex">w_t</script> in the above expression of <script type="math/tex">L(\phi)</script> because in the paper they set <script type="math/tex">w_t = 1</script>.)</p>
<p>For example, suppose we have a function like <script type="math/tex">f(\theta) = a \theta^2 + b\theta + c</script>. If <script type="math/tex">a,b,c</script> are drawn from the Gaussian distribution with some fixed value of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>, the distribution of the function <script type="math/tex">f</script> can be defined. (Here, the class of optimization problem is a function where <script type="math/tex">a,b,c</script> are drawn from Gaussian.) In this example, the optimizee parameter is <script type="math/tex">\theta</script>. The optimizer (i.e. RNN) will be trained by optimizing functions which are randomly drawn from the function distribution, and we want to find the best parameter <script type="math/tex">\theta</script>. If we want to know how good our optimizer is, we can just take the expected value of <script type="math/tex">f</script> to evaluate the goodness, and use gradient descent to optimize this <script type="math/tex">L(\phi)</script>.</p>
<p>After understanding the above basics, all that is left is some implementation/architecture details for computational efficiency and learning capability.</p>
<p>(By the way, there is a typo in page 3 under Equation 3; <script type="math/tex">\nabla_{\theta} h(\theta)</script> should be <script type="math/tex">\nabla_{\theta} f(\theta)</script>. Otherwise it doesn’t make sense.)</p>
<h3 id="coordinatewise-lstm-optimizer">Coordinatewise LSTM optimizer</h3>
<p><img src="http://runopti.github.io/blog/2016/10/17/learningtolearn-1/compgraph.png" alt="compgraph" /></p>
<p>The Figure is from the <a href="https://arxiv.org/pdf/1606.04474v1.pdf">paper</a> : Figure 2 on page 4</p>
<p>To make the learning problem computationally tractable, we update the optimizee parameters <script type="math/tex">\theta</script> coordinate-wise, much like other successful optimization methods such as Adam, RMSprop, and AdaGrad.</p>
<p>To this end, we create <script type="math/tex">n</script> LSTM cells, where <script type="math/tex">n</script> is the number of dimensions of the parameter of the objective function. We setup the architecture so that the parameters for LSTM cells are shared, but each has a different hidden state. This can be achieved by the code below:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">lstm</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">rnn_cell</span><span class="o">.</span><span class="n">BasicLSTMCell</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_coordinates</span><span class="p">):</span>
<span class="n">cell_list</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">rnn_cell</span><span class="o">.</span><span class="n">MultiRNNCell</span><span class="p">([</span><span class="n">lstm_cell</span><span class="p">]</span> <span class="o">*</span> <span class="n">num_layers</span><span class="p">)</span> <span class="c"># num_layers = 2 according to the paper.</span></code></pre></figure>
<h3 id="information-sharing-between-coordinates">Information sharing between coordinates</h3>
<p>The coordinate-wise architecture above treats each dimension independently, which ignore the effect of the correlations between coordinates. To address this issue, the paper introduces more sophisticated methods. The following two models allow different LSTM cells to communicate each other.</p>
<ol>
<li>Global averaging cells: a subset of cells are used to take the average and outputs that value for each cell.</li>
<li>NTM-BFGS optimizer: More sophisticated version of 1., with the external memory that is shared between coordinates.</li>
</ol>
<h2 id="implementation-notes">Implementation Notes</h2>
<h3 id="quadratic-function-31-in-the-paper">Quadratic function (3.1 in the paper)</h3>
<p>Let’s say the objective funtion is <script type="math/tex">f(\theta) = \lVert W \theta - y \rVert^2</script>, where the elements of <script type="math/tex">W</script> and <script type="math/tex">y</script> are drawn from the Gaussian distribution.</p>
<p><script type="math/tex">g</script> (as in <script type="math/tex">\theta_{t+1} = \theta_t + g</script>) has to be the same size as the parameter size. So, it will be something like:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">g</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">lstm</span><span class="p">(</span><span class="n">input_t</span><span class="p">,</span> <span class="n">hidden_state</span><span class="p">)</span> <span class="c"># here, input_t is the gradient of a hidden state at time t w.r.t. the hidden</span></code></pre></figure>
<p>And the update equation will be:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">param</span> <span class="o">=</span> <span class="n">param</span> <span class="o">+</span> <span class="n">g</span></code></pre></figure>
<p>The objective function is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L(\phi) &= \mathbb{E}_f [ \sum_{t=1}^T w_t f(\theta_t) ] \\
\text{where, }\theta_{t+1} &= \theta_t + g_t \\
\left[
\begin{array}{c}
g_t \\
h_{t+1}
\end{array}
\right]
&= RNN(\nabla_t, h_t, \phi)
\end{aligned} %]]></script>
<p>The loss <script type="math/tex">L(\phi)</script> can be computed by double-for loop. For each loop, a different function is randomly sampled from a distribution of <script type="math/tex">f</script>. Then, <script type="math/tex">\theta_t</script> will be computed by the above update equation. So, overall, what we need to implement is the two-layer coordinate-wise LSTM cell. The actual implementation is <a href="https://github.com/runopti/Learning-To-Learn">here</a>.</p>
<h1 id="results">Results</h1>
<p><img src="http://runopti.github.io/blog/2016/10/17/learningtolearn-1/output_38_1.png" alt="results" /></p>
<p>I compared the result with SGD, but SGD tends to work better than our optimizer for now. Need more improvements on the optimization…</p>
Mon, 17 Oct 2016 09:46:43 +0000
http://yaledatascience.github.io/2016/10/17/learningtolearn-1.html
http://yaledatascience.github.io/2016/10/17/learningtolearn-1.htmloptimization,neuralnetworksDecember 2015 Info Session<h2 id="the-gang-is-back">The gang is back</h2>
<p>Hello! In the following months, the Yale Data Science group will be back, and better than ever. We’ve entered
a new partnership with the upcoming Department of Statistics and Data Science, and there will be many
future collaborations to look forward to.</p>
<p>Until then, be sure to check out the new <a href="https://facebook.com/YaleDataScience">Facebook page</a>, as well as
the <a href="https://www.facebook.com/events/524410961054701/">event page</a>, where you can fill out an attendance form so that we know who’s interested and how much food to get! The winter info session will go over some advising on
what courses to enroll in next semester, as well as a brief preview to what the team has in store for everyone
next semester.</p>
<p>Until then, good luck with your final exams!</p>
Wed, 09 Dec 2015 00:00:00 +0000
http://yaledatascience.github.io/2015/12/09/winter-info-session.html
http://yaledatascience.github.io/2015/12/09/winter-info-session.htmlSpring 2015 Preview<p>Hey there, data lovers! Good news: Yale Data Science is back in a big way. We’ve revamped our approach for this semester and can’t wait to get going. Here’s a preview of one of our new focuses: data science vignettes. This post will take you from a question to a cool result, with some data scraping, modeling, and visualization along the way. Added bonus: you can gain insight like this quickly! This took only a couple hours of development. Data science!</p>
<p>As always, <strong><a href="mailto:yaledatascience@gmail.com">email us</a></strong> if you have questions!</p>
<h3 id="code"><a href="https://github.com/YaleDataScience/enroll">Code</a></h3>
<h3 id="introduction">Introduction</h3>
<p>Ever taken a course that you REALLY REALLY want other people to take? Ever been a professor who hasn’t been happy with your course’s enrollment and can pay a bunch of students to write reviews? Well listen up.</p>
<p>There has been a lot of <a href="http://haufler.org/2014/01/19/i-hope-i-dont-get-kicked-out-of-yale-for-this/">effort</a> put into using numerical ratings to improve our understanding of Yale’s courses. However, the review comments - which provide the richest information - have flown under the radar. For a course with high ratings, it’s probably obvious that words like “good” and “recommend” will come up frequently. Similarly, for a course with a high workload, we’d expect to see terms like “hard” and “no sleep”.</p>
<p>But what do highly shopped courses look like? This is the most interesting question, since the actual action someone will take after looking at reviews is to add it to their OCS worksheet (or not). By the end of this post, you’ll know what stuff to write to get people to sign up on OCS. If you aren’t interested in the methods, just skip on down to the <strong><a href="#result">results</a></strong>.</p>
<h3 id="disclaimer">Disclaimer</h3>
<p>Yale’s course catalog has generated quite a bit of <a href="http://yaledailynews.com/blog/2014/01/22/ybb-closure-prompts-questions-about-data-rules/">data controversy</a> over the year. We don’t want to add to that. We won’t display evaluations of individual courses or professors in ways that the University did not intend. If the names of any individuals or courses came up during the course of our analysis, they have been censored. We won’t host any of Yale’s data in our <a href="https://github.com/YaleDataScience/enroll">Github repo</a> in accordance with University policy, but we can tell you how to get it yourselves. And we will. Right now.</p>
<h3 id="the-dataset">The Dataset</h3>
<p>For a given course, we’re interested in the relationship between two data sources: the content of its text reviews and the number of people who have added it to their OCS worksheet. We need to pull both of these from the web.</p>
<h5 id="text-reviews">Text Reviews</h5>
<p>Peter Xu and Harry Yu of <a href="http://coursetable.com">CourseTable</a> developed a <a href="http://coursetable.com/UploadDataFile">crawler</a> to read data from OCS. It’s simple to use (note: you must be a Yale student to do so). It pulls down data as a SQLite database. You can extract the course evaluation table as a comma-separated value file either from the <a href="http://stackoverflow.com/questions/5776660/export-from-sqlite-to-csv-using-shell-script">command line</a> or using a tool like <a href="http://www.speqmath.com/tutorials/sqlite_export/">SQLite Export</a>.</p>
<h5 id="course-demand">Course Demand</h5>
<p>Yale has recently made an effort to up its data presentation game when it comes to Shopping Period. Most notably, they constantly update demand figures on <a href="https://ivy.yale.edu/course-stats/">this site</a>. We developed the Python script <strong><a href="https://github.com/YaleDataScience/enroll/blob/master/py/ocs_demand.py">ocs_demand.py</a></strong> to extract the required data. It makes heavy use of the <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup package</a> to deconstruct HTML source code. On a Unix machine, use the following command to get what you want:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>python ocs_demand.py | sort | uniq > ocs_demand.tsv
</code></pre>
</div>
<p>(Note: there’s some weird stuff going on with the course name field that we aren’t sure how to fix. We can work around it pretty easily.)</p>
<h5 id="bringing-it-together">Bringing It Together</h5>
<p>We will make the following assumption: people view all of the reviews for a course as a giant blob of text, and not individual tidbits. This seems reasonable to us, since people generally scroll through them quickly. This assumption also simplifies inference, so if you disagree with it and want to put in some extra work, that’s great.</p>
<p>So what we want now is a table with one row per course and the following columns: course identifier (a string), OCS demand (an integer), and the concatenated review text (a string). The Python program <strong><a href="https://github.com/YaleDataScience/enroll/blob/master/py/data_aggregate.py">data_aggregate.py</a></strong> will handle this. As input, it takes two CSVs extracted from the CourseTable crawler - one with review data and one with course name data - and the TSV from the OCS demand script. Its design is as follows:</p>
<ol>
<li>Create a dictionary mapping course IDs to their concatenated review text. The review text is treated for natural language processing purposes via tokenizing, lowercasing, stopwording, and stemming. More on this later.</li>
<li>Create a dictionary mapping full course names (e.g. STAT 365) to their course IDs (e.g. 17). Note that many different course names may be mapped to the same ID.</li>
<li>Create a dictionary mapping course IDs to their OCS demand. This is where the dictionary from step 2 comes into play: the demand figures are associated with course names at first, and we need them associated with course IDs.</li>
<li>Combine the dictionaries from steps 1 and 3 to output the desired table.</li>
</ol>
<p>To run it on my machine, I used the following command:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>python data_aggregate.py ~/workspace/enroll/data/ocs_comments.csv ~/workspace/enroll/data/course_names.csv ~/workspace/enroll/data/ocs_demand.csv ~/workspace/enroll/data/enroll_data.csv --stop --wstem
</code></pre>
</div>
<p>Again, we can’t provide you with data. See if you can get this to work on your own.</p>
<h3 id="the-model">The Model</h3>
<p>Let’s recap. Our goal is to see what review content will generate demand for the course on OCS. We have a table giving an ID, the demand, and the review text for every course. A simple approach would be to see which words show up frequently when the demand is high. The more sophisticated approach is to use a topic model.</p>
<p><strong>Topic models</strong> comprise an important subject in machine learning, natural language processing, and graphical modeling. Essentially, a topic model is a statistical method used to identify latent clusters of elements within a collection. If you have a collection of documents composed of text, a topic model might be used to identify groupings of words, otherwise known as a topic. Every topic model relies on co-occurence; words that frequently occur together in documents are presumably of the same topic.</p>
<p>Perhaps the best known topic model is <strong>latent Dirichlet allocation</strong> - frequently referred to as LDA - which was introduced in 2003 by several big names: David Blei, Andrew Ng, and Michael (I.) Jordan. LDA is one of those algorithms that looks like magic when you first see it in application, and then still looks like magic after you study it. Here are some resources that can explain LDA better than we can, and we encourage you to read them.</p>
<ul>
<li><a href="http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf">The original paper</a></li>
<li><a href="http://videolectures.net/mlss09uk_blei_tm/">A video by David Blei</a></li>
<li><a href="http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/">An intuitive explanation by Edwin Chen</a></li>
<li><a href="http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/131214.pdf">A chapter from BRML</a></li>
</ul>
<p>As with many unsupervised algorithms, LDA is frequently used as a form of feature engineering. That is, it takes in “raw” data and returns something more insightful for input into a different model. This prompted David Blei and John McAuliffe to develop <strong><a href="https://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf">supervised latent Dirichlet allocation</a></strong>, or sLDA. By jointly modeling a response variable associated with each document, we can ensure that the topic distribution of a given document will correlate well with its response. Their example had to do with movie reviews and the star ratings associated with them. LDA might pull out topics relating to genre, but by jointly modeling the star ratings, sLDA will identify topics relating to film quality.</p>
<h3 id="implementation">Implementation</h3>
<p>Earlier, we mentioned that we somehow preprocessed the data for modeling. Here’s the motivation. Suppose the words “recommend” and “Recommending” show up in two different documents. Shouldn’t we treat them as the same object? What if the word “and” shows up? Do we even care about that? What about punctuation?</p>
<p>All of these cases are taken care of. Words are separated from one another by whitespace and punctuation, which is then discarded. We also discard words found on a list of common, unmeaningful English words. Words are then lowercased and weakly stemmed (meaning trailing <em>-ing</em>’s, <em>-s</em>’s, and <em>-e</em>’s are removed). The LDA authors themselves suggest only weakly stemming words, rather than using aggresive stemmers like the Porter algorithm.</p>
<p>One last step of preprocessing: we’re only going to consider courses whose demand is above 50. Check out the following <a href="http://docs.ggplot2.org/current/geom_histogram.html">Gaussian kernel density estimate</a> of the distribution of course demand. Yale offers a lot of small courses which aren’t geared towards large crowds. We aren’t interested in these.</p>
<p><img src="/images/posts/demandkde.png" alt="Demand KDE" style="width: 640px;" /></p>
<p>To do the modeling, we’re going to work in R. The main script is <strong><a href="https://github.com/YaleDataScience/enroll/blob/master/enroll.R">enroll.R</a></strong>; note that you’ll need to modify the file paths. The <a href="http://cran.r-project.org/web/packages/lda/lda.pdf"><em>lda</em> package</a> implements sLDA quickly and is endorsed by David Blei. It also comes with tools for text processing, namely the <em>lexicalize</em> function.</p>
<p>LDA and sLDA are bag-of-word models, meaning we only care about a word’s membership to a document, not its place within it. We can, however, recover some of the minute structure of the text by expanding our definition of a “word” from a unigram to an n-gram, or a sequence of n words found in the document. We’ll be using unigrams, bigrams, and trigrams (as implied <a href="http://dl.acm.org/citation.cfm?id=146685">here</a>).</p>
<p>The <em>lexicalize</em> function only supports unigram dictionaries and document-term matrices, so we modified it in <strong><a href="https://github.com/YaleDataScience/enroll/blob/master/nlexicalize.R">nlexicalize.R</a></strong>. It requires RWeka and rjava, which are sometimes difficult to deal with; if those won’t work for you, then just use the standard unigram implementation.</p>
<p>Picking the right parameters for your topic model can seem like more art than science, particularly choosing the number of topics. In fact, that may be the case for any model with an informative prior. With LDA, we only have some approximate measures to do model comparison (e.g. perplexity). However, the supervised version has an obvious objective metric: response prediction performance. First, we partition a training set and a test set. To identify the “best” model, we’ll use a grid search. For every parameter set in the grid, we learn an sLDA model on the training set, predict the response on the test set, and compute the RMSE. Using cross-validation or a tighter grid would give better results, but they’d also take wayyyyy too long. We’re impatient.</p>
<p>We can visualize the model’s performance under different parameter sets by inspecting the plot produced by the script. Here’s an example:</p>
<p><img src="/images/posts/sldaperf.png" alt="sLDA Performance" style="width: 640px;" /></p>
<p>Ah! That’s confusing. Not really. Each color represents a different number of topics, and each point represents the RMSE for a different trial (i.e. parameter set). It looks like a 12 topic model always performs well, and the best Dirichlet priors are 10 for document/topic smoothing and 1 for topic/term smoothing.</p>
<p>We then train a final model using those values, and here’s what we found.</p>
<h3 id="results">Results<a name="result"></a></h3>
<p>Back to the original question: what can you write in a review to get people to sign up for a course on OCS? More specifically: <strong><em>what kind of language or topics separate a course attracting a decent crowd from one that attracts 400 people</em></strong>?</p>
<p>We present our results by analyzing each topic and assessing their effect on a course’s OCS demand. The latter is straightforward: in the linear model, every topic has a coefficient which represents its effect on the response. Say topic X has a large coefficient. Then a course whose reviews are highly weighted on topic X will be expected to have a large demand. Since sLDA is non-deterministic, these coefficients vary from trial to trial. However, we have found that in a 12 topic model, typically four topics strongly affect enrollment negatively and four topics strongly affect enrollment positively.</p>
<p>If you read up on LDA, you’ll recall that a topic is represented by a probability distribution over every term found in the collection of documents. A topic can be represented simply by the highest weighted terms in the distribution. <a href="http://en.wikipedia.org/wiki/Tag_cloud">Word clouds</a> essentially capture the same information, but in a much more visually appealing way. The R package <em><a href="http://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf">wordcloud</a></em> is a simple way to generate world clouds directly from the results of LDA or sLDA. Due to the appearance of names of professors and courses in most of the word clouds, we will only include one example below. However, <strong>enroll.R</strong> generates all of them for your final model.</p>
<p>Recall that LDA and sLDA are non-deterministic, and thus the topics change from trial to trial. However, the general content of the topics are generally stable. Most notably, the most positive topic is always the one represented by the word cloud below. No surprises here. (Note: the terms have not been unstemmed, so they might be missing an <em>s</em>, an <em>ing</em>, or an <em>e</em> at the end.)</p>
<p><img src="/images/posts/cloud.png" alt="Topic cloud" style="width: 640px;" /></p>
<p>Next, we present a list of terms given high weights in topics that strongly effect course demand. Specifically, they are within the top 25 scoring terms in the four strongly negative or three strongly positive topics (leaving out the one presented above). Once again, they have been censored for names of professors and courses. Terms were also manually unstemmed and stopwords were added where it was obvious.</p>
<table>
<thead>
<tr>
<th style="text-align: left">High scoring terms; positive topics</th>
<th style="text-align: right">High scoring terms; negative topics</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">improve</td>
<td style="text-align: right">required a lot</td>
</tr>
<tr>
<td style="text-align: left">grade</td>
<td style="text-align: right">midterm final</td>
</tr>
<tr>
<td style="text-align: left">proud</td>
<td style="text-align: right">if you’re willing</td>
</tr>
<tr>
<td style="text-align: left">enjoyed lecture</td>
<td style="text-align: right">great however</td>
</tr>
<tr>
<td style="text-align: left">take class probably</td>
<td style="text-align: right">readings are long</td>
</tr>
<tr>
<td style="text-align: left">manageable workload</td>
<td style="text-align: right">course interesting</td>
</tr>
<tr>
<td style="text-align: left">gut</td>
<td style="text-align: right">clear and helpful</td>
</tr>
<tr>
<td style="text-align: left">amazing class</td>
<td style="text-align: right">really interesting subject</td>
</tr>
<tr>
<td style="text-align: left">aquired</td>
<td style="text-align: right">blend</td>
</tr>
<tr>
<td style="text-align: left">low stress</td>
<td style="text-align: right">highly recommend class</td>
</tr>
<tr>
<td style="text-align: left">would highly recommend</td>
<td style="text-align: right">incredibly</td>
</tr>
<tr>
<td style="text-align: left">not much work</td>
<td style="text-align: right">lecture disorganized</td>
</tr>
<tr>
<td style="text-align: left">provide good background</td>
<td style="text-align: right">section</td>
</tr>
<tr>
<td style="text-align: left">acquired</td>
<td style="text-align: right">always good</td>
</tr>
<tr>
<td style="text-align: left">hard to know</td>
<td style="text-align: right">use in the future</td>
</tr>
<tr>
<td style="text-align: left">one paper</td>
<td style="text-align: right">make or break</td>
</tr>
<tr>
<td style="text-align: left">grain of salt</td>
<td style="text-align: right">need qr take</td>
</tr>
<tr>
<td style="text-align: left">knowledge</td>
<td style="text-align: right">sense of accomplishment</td>
</tr>
</tbody>
</table>
<p>Similar terms show up on both sides, which is likely due to the fact that they occured very frequently throughout the entire corpus of reviews. So take that with a “<em>grain of salt</em>”. Also note that words like “<em>terrible</em>” aren’t showing up. Recall that we only took courses with over 50 people signed up, implying that they’re already popular.</p>
<p>Ok so let’s draw some conclusions. We’d expect to see many of these terms. Intuition tells us that people like courses which are high quality (“<em>amazing class</em>”, “<em>enjoyed lecture</em>”) and not-too-hard (“<em>manageable workload</em>”, “<em>low stress</em>”). Similarly, people don’t like low quality (“<em>lecture disorganized</em>”) or overly difficult (“<em>required a lot</em>”, “<em>readings are long</em>”) courses.</p>
<p>For those who say there are no guts at Yale, you may want to check the data (“<em>gut</em>”, “<em>not much work</em>”). Perhaps Yale guts aren’t as gutty as other schools’ guts. But a gut is a gut, any way you gut it.</p>
<p>There are also some surprises. It seems like even if people are adamant about how rewarding a class is (“<em>clear and helpful</em>”, “<em>really interesting subject</em>”, “<em>use in future</em>”, “<em>sense of accomplishment</em>”), people won’t take it. On the other hand, people seem to take some courses even if the reviewers are hesitant about it (“<em>hard to know</em>”, “<em>grain of salt</em>”).</p>
<p>Want to draw some more conclusions? Try running our <strong><a href="https://github.com/YaleDataScience/enroll">code</a></strong> for yourself!</p>
<h3 id="whats-next">What’s Next?</h3>
<p>The results section here just scraped the surface of what you can find from this model. For example, try adjusting the Dirichlet priors or adding topics. A model with 25 topics and smoothing priors around 0.1 or 0.01 will give topics related to individual courses.</p>
<p>If you want to go further, there’s always more to be done. Here are some ideas.</p>
<ul>
<li>Data scraping: do something similar for the course descriptions provided by professors</li>
<li>Speed: create a better algorithm to match OCS demand with review text</li>
<li>Debugging: what (probably ineffectual) mistake was made in the n-gram procedure and how can it be fixed?</li>
<li>Modeling: construct a model using each individual review as a separate document</li>
<li>Modeling (harder): jointly model individual reviews for a course and the course’s overall demand</li>
<li>Visualization: generate word clouds using unstemmed words</li>
</ul>
<p>See you at the next Yale Data Science meeting!</p>
<p>♥ YDS</p>
Tue, 13 Jan 2015 00:00:00 +0000
http://yaledatascience.github.io/2015/01/13/spring-2015-preview.html
http://yaledatascience.github.io/2015/01/13/spring-2015-preview.htmlLet's Go - First Meeting on 10/3/14<p>The results are in, and it’s time to have our first meeting. We’ll be getting together on <strong>Friday, October 3rd</strong> from <strong>3:00-5:00 p.m.</strong> in <strong><a href="http://map.yale.edu/map/#building:HLH17">Hillhouse 17, Room 111</a></strong>. This is also the permanent meeting time for the group - at least for the semester. We’ll go over the group’s purpose, wax poetic about data science, get to know each other a bit, then jump into an activity. Can’t wait to see you all there!</p>
<p>##Slide Deck
Coming at you soon.</p>
<p>##Resources
<a href="https://www.kaggle.com/c/titanic-gettingStarted/data">Data Set</a></p>
<p>##FAQ
<strong>Will I miss out on anything if I have to leave early?</strong></br>
<em>We’re planning on front-loading the meeting, so probably not.</em></p>
<p><strong>I don’t have any experience with data science. Will I get left in the dust?</strong></br>
<em>Definitely not. We’re a group for all levels of experience, and we’re focused on education. The activities we have planned will suit toddlers and Leo Breiman (may he rest in peace) alike.</em></p>
<p><strong>I spilled a hot beverage on my computer yesterday and won’t have a replacement before the meeting. Should I come?</strong></br>
<em>Yes! The room we’re in has a ton of workstations loaded with R, Python, and everything else you’ve ever dreamed of.</em></p>
<p><strong>Will there be food/hot beverages?</strong></br>
<em>Probably.</em></p>
Tue, 30 Sep 2014 00:00:00 +0000
http://yaledatascience.github.io/2014/09/30/lets-go.html
http://yaledatascience.github.io/2014/09/30/lets-go.htmlSurvey and First Meeting<p>In preparation for our inaugural meeting, we ask that you fill out a brief <a href="https://docs.google.com/a/yale.edu/forms/d/1kLlXPVgHLrGOP87r1xqk_0nuT1f_yQphNacMwwzsftI/viewform?usp=send_form">survey</a>. This way, we can get a better sense of how many people to expect as well as their background. Filling out the survey also allows you to vote for the day(s) of the week that work best for you.</p>
<p>After reviewing submissions, we’ll send out information about our first official meeting.</p>
<p>Feel free to contact us with any questions.</p>
Mon, 15 Sep 2014 00:00:00 +0000
http://yaledatascience.github.io/2014/09/15/survey-and-first-meeting.html
http://yaledatascience.github.io/2014/09/15/survey-and-first-meeting.html