BLLIP Parser
There are several available parsing models for BLLIP Parser. This document is designed to help you determine which one will perform best for your task. Each one of the parsing models discussed includes a pair of Charniak parser and Johnson reranker models designed to work together (this is called a unified parsing model).
Depending on the text that you’d like to parse, there are different optimal parsing models. Here are the current recommendations:
- News text:
WSJ+Gigaword-v2 - Web text:
SANCL2012-Uniform - Biomedical (PubMed) text:
GENIA+PubMed - WSJ section 23 evaluations to replicate papers: For purely supervised parser or parser/reranker results, use either
WSJ-PTB3(for Penn Treebank WSJ) orOntoNotes-WSJ(for the OntoNotes version of WSJ). UseWSJ+Gigawordto replicate self-training results, thoughWSJ+Gigaword-v2performs slightly better. - Everything else: In general, it’s probably best to use
SANCL2012-UniformorWSJ+Gigaword-v2depending on how well-formed your text is (SANCL2012-Uniformfor more informal web/email text).
