Prior to the release of 3.2, using PDF Highlighter with NLP (natural language processing) tools was inefficient and couldn’t guarantee 100% precision. Before 3.2, it was necessary to parse NLP output to extract key terms and phrases and then feed Highlighter a query file containing potentially hundreds of long phrases (e.g. whole sentences) you wanted marked in PDF. Thus, highlighting a single document with a few hundred pages could take minutes to complete—far from ideal.

From its early days, aside from highlighting for query terms, Highlighter supported document highlighting for Adobe’s PDF Highlight File Format. The highlight file contains character offsets and the length of terms to be marked in PDF. Because output from NLP tools contains text offsets, it seems reasonable to use this offset information to create a highlight file; unfortunately, this wasn’t possible for several reasons, most notably:

  • There are quirks in what Adobe highlight file format considers a character and how many positions a character takes.
  • Highlight files require a page index and text position on the page. However, generally, you feed NLP with text extracted from a multi-page document, and NLP output doesn’t contain page information.

Highlighter 3.2 surpasses Adobe’s highlight file format, supporting extended syntax that allows absolute text positions (offsets in document instead of in page) and highlighting color specified for each term—exactly what we needed for the efficient use of Highlighter with NLP tools!

The new PDF highlighting workflow for NLP output looks like this:

  1. Get the text from the PDF document using the /extract service provided by Highlighter. This is to ensure that the text sent to our NLP tool is the same that Highlighter sees and ensures that the text positions match.
  2. Pass the text through the NLP tool.
  3. Transform the NLP output to the highlights XML file. You will need to write some custom code for this. It’s simple enough, and we give an example for Core NLP below.
  4. Send the PDF and XML files to Highlighter and, depending on the request settings, receive the viewing URL or the new PDF with highlighted text.

Now, coming from NLP output to the highlighted PDF takes fractions of a second— not minutes!

In the exercise below, we will show you how to setup Highlighter and highlight PDF documents for CoreNLP output files.

Setting Up PDF Highlighter

First, we’ll need to install PDF Highlighter.

Get PDF Highlighter installer from its download page and install. The Highlighter needs a license key to run, so make sure to request one while on the download page.

After installation, copy license file (highlighter.lic) to <highlighter>/conf/ directory.

Next, in the <highlighter>/conf/ create application.conf file with the following content:

highlighter {
  service {
    allowFileUris = true
  }
  viewer {
    hitNavigationStrategy = "page-to-page"
  }
}

With this config file, we:

  • Allowed Highlighter to accept PDF and highlight file referenced by file path. By default—Highlighter expects HTTP references only, so this option simplifies our testing.
  • Modified hit navigation strategy in Highlighting PDF Viewer to page-to-page. Because the number of highlighted terms for NLP is generally high, hit-to-hit navigation in viewer would be cumbersome.

After you have installed the key and configuration file, restart Highlighter (“Highlighter Service” in Windows Services or “highlighter-service” service in Linux).
To ensure everything is running properly, open http://localhost:8998/status to access your local Highlighter installation—you should see “status: OK” in the results.

PDF Highlighter exposes /extract method, allowing text extraction for the referenced document. For the complete list of available services, open http://localhost:8998/apidocs/ for live web service API documentation. We will use this service to extract PDF text to be analyzed with CoreNLP.

Instead of writing your own web service client, you can download Highlighter Java Client and run:

java -cp  highlighter-client-0.1.0-jar-with-dependencies.jar com.jobjects.highlighter.ExtractText <highlighter-server-url> <input-pdf> <output-txt>

The ExtractText tool will send the PDF path to Highlighter, get the response, concatenate text from all pages, and save it to the output file.

Now, let’s run it for our sample PDF:

java -cp  highlighter-client-0.1.0-jar-with-dependencies.jar com.jobjects.highlighter.ExtractText http://localhost:8998 /absolute/path/to/GlobalEconomicProspects.pdf GlobalEconomicProspects.txt

We use absolute path to the PDF document (could be URL as well) because the Highlighter server must be able to open it. Depending on the document’s size, text extraction may take several seconds to complete. (Note that Highlighter can be setup to index document folders in the background. In that case, for an already indexed document, getting extracted text is an instant operation.)

With document text in hand, we can proceed to text analysis…

Running CoreNLP

If you don’t already have it, download and extract stanford-corenlp-full-2017-06-09.zip to an empty folder, then open command line and cd to it.

Run analysis with:

java -cp "*" -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -outputDirectory . -file /path/to/GlobalEconomicProspects.txt

Depending on the document’s size and enabled annotators, this can take minutes to complete.

We told CoreNLP to use the current folder as output directory and thus, it will create GlobalEconomicProspects.txt.xml in it.

Transforming CoreNLP Output to Highlight File

Generally, in accordance with your needs, you will need to create a custom program/script to get data from CoreNLP output XML and create a highlights XML file that PDF Highlighter can digest.

In this exercise, we created a Java tool that can create a highlight file for named entities and sentiment data. You can download this tool (for simplicity, choose jar with dependencies included) from its project page.

To see all available options, run:

java -cp highlight-example-corenlp-1.0-jar-with-dependencies.jar CoreNlpToHighlightsXml 
Missing required option: c
Missing required option: h
usage: CoreNlpToHighlightsXml
 -c,--corenlp-file <arg>      CoreNLP XML file
 -date <arg>                  Date term color
 -h,--highlights-file <arg>   Highlights XML output file
 -location <arg>              Location term color
 -negative <arg>              Negative sentiment term color
 -number <arg>                Number term color
 -org <arg>                   Organization term color
 -person <arg>                Person term color
 -positive <arg>              Positive sentiment term color

For an entity or sentiment type we want to highlight, we need to specify our desired color (in RGB format). For example, to highlight personal names, organizations, and locations , use:

java -cp highlight-example-corenlp-1.0-jar-with-dependencies.jar CoreNlpToHighlightsXml -person FF00FF -org FF0000 -location 0000FF -c GlobalEconomicProspects.txt.xml -h GlobalEconomicProspects.highlights-ner.xml

Or if we want to highlight sentiment related keywords:

java -cp highlight-example-corenlp-1.0-jar-with-dependencies.jar CoreNlpToHighlightsXml -negative FF0000 -positive 00FF00 -c GlobalEconomicProspects.txt.xml -h GlobalEconomicProspects.highlights-sentiment.xml

Created highlights XML looks like this:

<xml>
<body units="characters" positions="internal">
<loc pos="2" len="5" color="FF0000" word="World" nlp-pos="NNP" nlp-ner="ORGANIZATION" nlp-sentiment="Neutral" />
<loc pos="8" len="4" color="FF0000" word="Bank" nlp-pos="NNP" nlp-ner="ORGANIZATION" nlp-sentiment="Neutral" />
<loc pos="13" len="5" color="FF0000" word="Group" nlp-pos="NNP" nlp-ner="ORGANIZATION" nlp-sentiment="Neutral" />
<loc pos="100" len="9" color="FF0000" word="Uncertain" nlp-pos="NNP" nlp-ner="ORGANIZATION" nlp-sentiment="Neutral" />
<loc pos="110" len="5" color="FF0000" word="Times" nlp-pos="NNP" nlp-ner="ORGANIZATION" nlp-sentiment="Neutral" />
…
<loc pos="1051219" len="9" color="FF0000" word="Prospects" nlp-pos="NNS" nlp-ner="ORGANIZATION" nlp-sentiment="Neutral" />
</body>
</xml>

Note that highlight location attributes “word”, “nlp-pos”, “nlp-ner”, and “nlp-sentiment” are not actually used by PDF Highlighter. We are adding them in CoreNlpToHighlightsXml just for the reference and result verification— they can be safely removed to make an XML file smaller.

At this point, we are ready to highlight our PDF…

Highlighting PDF

The easiest way to ad-hoc highlighting is to use the live API documentation provided with the Highlighter. Open http://localhost:8998/apidocs/ in your browser, click on “/highlight-for-xml” section, then click the “Try it out” button. You should see something like this:

To the “uri” field, put an absolute path to the PDF file and to the “xml” field absolute path to the highlights XML file that was generated in the previous step. Click “Execute”.

If successful, you should see JSON response containing, among other elements, a field named documentUrl. This is the URL to the result document in PDF Highlighting Viewer.

Copy the URL and open it in the web browser. You should see the PDF with marked entities:

That’s it!

If you prefer having a copy of the PDF document with highlights added to it (so it can be opened in any PDF viewer), change the desired response content type to application/pdf. (Unfortunately, live API documentation is unable to receive and save the generated document, so you will need to use a different REST client, use curl, or write a client application to use it.)

We hope you find this useful. If you have any questions or need assistance with PDF Highlighter integration, feel free to get in touch with us.



comments powered by



Source link