BLLIP Parser

There are several available parsing models for BLLIP Parser. This document is designed to help you determine which one will perform best for your task. Each one of the parsing models discussed includes a pair of Charniak parser and Johnson reranker models designed to work together (this is called a unified parsing model).

Depending on the text that you’d like to parse, there are different optimal parsing models. Here are the current recommendations:

News text: WSJ+Gigaword-v2
Web text: SANCL2012-Uniform
Biomedical (PubMed) text: GENIA+PubMed
WSJ section 23 evaluations to replicate papers: For purely supervised parser or parser/reranker results, use either WSJ-PTB3 (for Penn Treebank WSJ) or OntoNotes-WSJ (for the OntoNotes version of WSJ). Use WSJ+Gigaword to replicate self-training results, though WSJ+Gigaword-v2 performs slightly better.
Everything else: In general, it’s probably best to use SANCL2012-Uniform or WSJ+Gigaword-v2 depending on how well-formed your text is (SANCL2012-Uniform for more informal web/email text).