Building JOCo corpus#

Jena Organization Corpus (JOCo) [Händschke et al., 2018]

Introduction#

JOCo is a corpus of annual reports and corporate social responsibility reports of American, British and German enterprises. It is intended to be used in a shared community effort to improve natural language processing techniques for the economic language domain. JOCo is provided by the Chair of Organization, Leadership, and Human Resource Management, Prof. Dr. Peter Walgenbach, and the Chair of Computational Linguistics, Prof. Dr. Udo Hahn, of Friedrich-Schiller-Universität Jena, Germany

Terms of Use#

All users of JOCo must apply for a license in order to receive a copy, i.e., you should not have access to this data without having filled out an application first.

Prepare an environment#

# %config InlineBackend.figure_format='retina's
from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)
print("is notebook?", eKonf.is_notebook())
print("is colab?", eKonf.is_colab())
print("environment variables:")
eKonf.print(eKonf.env().dict())
version: 0.1.39+2.gd02be16
is notebook? True
is colab? False
environment variables:
{'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_LOG_LEVEL': 'WARNING',
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_WORKSPACE_ROOT': '/workspace',
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'NUM_WORKERS': 230}

Build a corpus#

ekorpkit \
    dir.workspace=/workspace \
    verbose=true \
    print_config=false \
    num_workers=200 \
    cmd=fetch_builtin_corpus \
    +corpus/builtin=joco \
    corpus.builtin.io.force.summarize=true \
    corpus.builtin.io.force.preprocess=true \
    corpus.builtin.io.force.build=true \
    corpus.builtin.io.force.download=false

Load a corpus#

corpus_cfg = eKonf.compose("corpus")
corpus_cfg.name = "joco"
corpus_cfg.verbose = False
joco = eKonf.instantiate(corpus_cfg)
print(joco)
Corpus : joco
data_dir = "../data/esg_en"