Skip to content

Conversation

@alex2304
Copy link

@alex2304 alex2304 commented Dec 4, 2017

Hello, we want to contribute to MeTa several new features. Namely, realization of three methods for estimating constant mu of the current realization of Dirichlet prior smoothing.

The ranker based on Dirichlet prior smoothing implemented in MeTa uses parameter mu for smoothing. For now, the only way to use it is to either pass own value for the parameter or to use default mu = 2000. However, it's possible to find optimal value of the parameter for a particular set of documents (see H. Wallach, 2008, p. 18) which will provide the most effective smoothing. In our contribution, we implemented three methods for estimating such optimal value of the parameter mu using given parameters of the documents set.

Implemented methods are originally introduced by (H. Wallach, 2008, pages 26-30). In fact, these methods are based on several modifications of Fixed-Point Iteration method and provide better performance.

Considering project architecture, we implemented each new method as separate ranker (see picture with classes hierarchy). Also, we added ability to use such new rankers by specifying the following in the .toml config file:

[ranker]
method = "dirichlet-digamma-rec"

Full list of methods available:

  • dirichlet-digamma-rec - Fixed-Point Iteration by (Minka, 2003) using digamma recurrence relation
  • digamma-log-approx - Fixed-Point Iteration by (Minka, 2003) using logarithmic approximation of digamma differences
  • digamma-mackay-peto - Fixed-Point Iteration by (MacKay and Peto, 1995) with efficient computing of some inner parameter

We also verified that methods work as expected, i.e. found parameter mu is really optimal. To do this, we generated synthetic data using Dirichlet distribution with predefined parameters, and then compared results with predefined values, as it was done in H. Wallach, 2008. As in the work of H. Wallach, we used three metrics for evaluating methods performance:

  • Execution time
  • Kullback-Leibler Divergence between "true" and computed distributions
  • Relative error of mu

Parameters of synthetic data we used and results of methods comparison are presented here.

1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants