


In particular, the original CoNLL files do not conform to the specification given here.

Note that although we refer to these files as CoNLL files (after the format used in the CoNLL 20 shared tasks), the term CoNLL file has been overloaded in the literature and now is used to refer to a family of related formats. The token number indicates the token offset within the line) (This means several tokens may have the same location. This and the following three fields are the coordinates of the box tightly surrounding the line containing the token on the page assuming a DPI of 216. Because the layout is drawn from a printed newspaper, the previous token is not always the immediately preceding token in the file. This ID encodes the year, month, day, page number, box number, and token number counts left to right (or top to bottom) within the box. Tags are in what Wikipedia calls "IOB2" format. The data are tokenized by character, so a word might span more than one token. The files created through this process (, , and ) are tab-separated, UTF-8-encoded text files.
Renmin trainslation pdf#
We leave it to the user of the collection to convert the pdf files to images in a form that meets their needs this ensures that the image file format, DPI, etc., work with the user's OCR system. The CoNLL files created by create-renmin-collection.py should have the following MD5 checksums: File The scripts use relative paths to find the location of the encoded files and the other scripts, so they should not be moved from the directory in which they were created. This script takes a single argument, the name of the directory into which the collection is to be placed.
Renmin trainslation download#
The create-renmin-collection.py program will download the Renmin pdf pages used by the collection, and use them to decode the tokens listed in the encrypted train, dev, and test files. We assume a DPI of 216 if you use a different DPI, you will need to convert the offsets in the output files. To generate images of the newspaper pages, you will need the ability to convert from pdf to the image format of your choice. You will need Python 3.6 or later to run the scripts in this distribution. You will need http connectivity to this site. The Renmin source data are available from
Renmin trainslation code#
This program, together with the two other scripts it calls ( renmin-downloader.py and renmin-reconstructor.py) are in the code directory.
