Prediciting Sentiments#

from ekorpkit import eKonf

eKonf.setLogger("WARNING")
print("version:", eKonf.__version__)
print("is notebook?", eKonf.is_notebook())
print("is colab?", eKonf.is_colab())
print("evironment varialbles:")
eKonf.print(eKonf.env().dict())
INFO:ekorpkit.base:IPython version: (6, 9, 0), client: jupyter_client
INFO:ekorpkit.base:Google Colab not detected.
version: 0.1.35+2.g81d5295
is notebook? True
is colab? False
evironment varialbles:
{'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_LOG_LEVEL': 'WARNING',
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_WORKSPACE_ROOT': '/workspace',
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'NUM_WORKERS': 230}

Prepare edgar sampel dataframe#

df_cfg = eKonf.compose('pipeline=blank')
df_cfg.name = 'edgar_sample'
df_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
df_cfg.data_dir = df_cfg.path.cached_path
df_cfg.data_dir += "/edgar"
df_cfg.data_file = 'edgar.parquet'
df_cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
df = eKonf.instantiate(df_cfg)
df.head()
INFO:ekorpkit.base:IPython version: (6, 9, 0), client: jupyter_client
WARNING:ekorpkit.pipelines.pipe:No pipeline specified
id filename item text cik company filing_type filing_date period_of_report sic state_of_inc state_location fiscal_year_end filing_html_index htm_filing_link complete_text_filing_link
1410 1534 1999/320193_10K_1999_0000912057-99-010244.json item_1 ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc... 320193 APPLE COMPUTER INC 10-K 1999-12-22 1999-09-25 3571 CA CA 0930 https://www.sec.gov/Archives/edgar/data/320193... None https://www.sec.gov/Archives/edgar/data/320193...
1560 1697 1999/21344_10K_1999_0000021344-00-000009.json item_1 ITEM 1. \nBUSINESS The Coca-Cola Company (toge... 21344 COCA COLA CO 10-K 2000-03-09 1999-12-31 2080 DE GA 1231 https://www.sec.gov/Archives/edgar/data/21344/... None https://www.sec.gov/Archives/edgar/data/21344/...
2746 2977 1999/70858_10K_1999_0000950168-00-000621.json item_1 Item 1. \nBUSINESS General Bank of America Cor... 70858 BANK OF AMERICA CORP /DE/ 10-K 2000-03-20 1999-12-31 6021 DE NC 1231 https://www.sec.gov/Archives/edgar/data/70858/... None https://www.sec.gov/Archives/edgar/data/70858/...
3762 4088 1999/80424_10K_1999_0000080424-99-000027.json item_1 Item 1. \nBusiness. \n--------- General Develo... 80424 PROCTER & GAMBLE CO 10-K 1999-09-15 1999-06-30 2840 OH OH 0630 https://www.sec.gov/Archives/edgar/data/80424/... None https://www.sec.gov/Archives/edgar/data/80424/...
4806 5211 1999/1018724_10K_1999_0000891020-00-000622.json item_1 ITEM 1. \nBUSINESS This Annual Report on Form ... 1018724 AMAZON COM INC 10-K 2000-03-29 1999-12-31 5961 DE WA 1231 https://www.sec.gov/Archives/edgar/data/101872... None https://www.sec.gov/Archives/edgar/data/101872...

Prepare financial_phrasebank dataset#

ds_cfg = eKonf.compose('dataset')
ds_cfg.name = 'financial_phrasebank'
ds_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/financial_phrasebank.zip'
ds_cfg.data_dir = ds_cfg.path.cached_path
ds = eKonf.instantiate(ds_cfg)
print(ds)
Dataset : financial_phrasebank

Compose a config for the LM sentiment analyser class#

model_cfg = eKonf.compose('model/sentiment=lm')

Instantiating a sentiment analyser class and prediting sentiments of edgar dataset#

cfg = eKonf.compose(config_group='pipeline')
cfg.verbose = True
cfg.name = 'edgar_sentiments'
cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
cfg.data_dir = cfg.path.cached_path
cfg.data_dir += "/edgar"
cfg.data_file = 'edgar.parquet'
cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = "../data/predict"
cfg.predict.output_file = f'{cfg.name}-lm.parquet'
cfg.num_workers = 100
df = eKonf.instantiate(cfg)
df.head()
INFO:ekorpkit.base:instantiating ekorpkit.pipelines.pipe.pipeline...
INFO:ekorpkit.base:instantiating ekorpkit.pipelines.data.Data...
INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f4c5c602820>)
INFO:ekorpkit.base:instantiating ekorpkit.models.sentiment.lbsa.SentimentAnalyser...
INFO:ekorpkit.base:Calling load_candidates
INFO:ekorpkit.base:Using batcher with minibatch size: 16
id filename item text cik company filing_type filing_date period_of_report sic state_of_inc state_location fiscal_year_end filing_html_index htm_filing_link complete_text_filing_link num_tokens polarity polarity_label uncertainty
1410 1534 1999/320193_10K_1999_0000912057-99-010244.json item_1 ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc... 320193 APPLE COMPUTER INC 10-K 1999-12-22 1999-09-25 3571 CA CA 0930 https://www.sec.gov/Archives/edgar/data/320193... None https://www.sec.gov/Archives/edgar/data/320193... 3901 -0.117647 neutral 0.011280
1560 1697 1999/21344_10K_1999_0000021344-00-000009.json item_1 ITEM 1. \nBUSINESS The Coca-Cola Company (toge... 21344 COCA COLA CO 10-K 2000-03-09 1999-12-31 2080 DE GA 1231 https://www.sec.gov/Archives/edgar/data/21344/... None https://www.sec.gov/Archives/edgar/data/21344/... 6755 0.066667 neutral 0.014805
2746 2977 1999/70858_10K_1999_0000950168-00-000621.json item_1 Item 1. \nBUSINESS General Bank of America Cor... 70858 BANK OF AMERICA CORP /DE/ 10-K 2000-03-20 1999-12-31 6021 DE NC 1231 https://www.sec.gov/Archives/edgar/data/70858/... None https://www.sec.gov/Archives/edgar/data/70858/... 3491 -0.219512 negative 0.008595
3762 4088 1999/80424_10K_1999_0000080424-99-000027.json item_1 Item 1. \nBusiness. \n--------- General Develo... 80424 PROCTER & GAMBLE CO 10-K 1999-09-15 1999-06-30 2840 OH OH 0630 https://www.sec.gov/Archives/edgar/data/80424/... None https://www.sec.gov/Archives/edgar/data/80424/... 1259 0.166667 neutral 0.011915
4806 5211 1999/1018724_10K_1999_0000891020-00-000622.json item_1 ITEM 1. \nBUSINESS This Annual Report on Form ... 1018724 AMAZON COM INC 10-K 2000-03-29 1999-12-31 5961 DE WA 1231 https://www.sec.gov/Archives/edgar/data/101872... None https://www.sec.gov/Archives/edgar/data/101872... 12335 -0.104167 neutral 0.020025
print(cfg.predict.output_dir)
print(cfg.predict.output_file)
../data/predict
edgar_sentiments-lm.parquet

Instantiating a transformer classficiation model with financial_phrasebank dataset#

overrides=[
    '+model/transformer=classification',
    '+model/transformer/pretrained=finbert',
]
model_cfg = eKonf.compose('model/transformer=classification', overrides)
model_cfg.dataset = ds_cfg
model_cfg.verbose = False
model_cfg.config.num_train_epochs = 2
model_cfg.config.max_seq_length = 256
model_cfg.config.train_batch_size = 32
model_cfg.config.eval_batch_size = 32
model_cfg.labels = ['positive','neutral','negative']
model_cfg._method_ = ['train']
eKonf.instantiate(model_cfg)
INFO:ekorpkit.base:Calling train
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
wandb: Currently logged in as: entelecheia. Use `wandb login --relogin` to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb version 0.12.21 is available! To upgrade, please run: $ pip install wandb --upgrade
Tracking run with wandb version 0.12.19
Run data is saved locally in /workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000030-2krugt98
Finishing last run (ID:2krugt98) before initializing another...
Waiting for W&B process to finish... (success).

Run history:


Training loss
acc▁█
eval_loss█▁
global_step▁▂█
lr
mcc▁█
train_loss▁█

Run summary:


Training loss0.45026
acc0.85912
eval_loss0.41092
global_step92
lr2e-05
mcc0.73629
train_loss0.24922

Synced stellar-darkness-7: https://wandb.ai/entelecheia/ekorpkit-book-ekorpkit-book/runs/2krugt98
Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at: /workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000030-2krugt98/logs
Successfully finished last run (ID:2krugt98). Initializing new run:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb version 0.12.21 is available! To upgrade, please run: $ pip install wandb --upgrade
Tracking run with wandb version 0.12.19
Run data is saved locally in /workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000053-n5z5mrt4
<ekorpkit.models.transformer.simple.SimpleClassification at 0x7f4c2026ba00>
model_cfg._method_ = []

cfg = eKonf.compose('pipeline')
cfg.name = 'edgar_sentiments'
cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
cfg.data_dir = cfg.path.cached_path
cfg.data_dir += "/edgar"
cfg.data_file = 'edgar.parquet'
cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = "./data/predict"
cfg.predict.output_file = f'{cfg.name}-finbert.parquet'
cfg.num_workers = 1
df = eKonf.instantiate(cfg)
df.head()
INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f4c5c602820>)
INFO:ekorpkit.base:No method defined to call
Token indices sequence length is longer than the specified maximum sequence length for this model (4252 > 512). Running this sequence through the model will result in indexing errors
id filename item text cik company filing_type filing_date period_of_report sic state_of_inc state_location fiscal_year_end filing_html_index htm_filing_link complete_text_filing_link pred_labels raw_preds pred_probs
1410 1534 1999/320193_10K_1999_0000912057-99-010244.json item_1 ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc... 320193 APPLE COMPUTER INC 10-K 1999-12-22 1999-09-25 3571 CA CA 0930 https://www.sec.gov/Archives/edgar/data/320193... None https://www.sec.gov/Archives/edgar/data/320193... positive [2.291260004043579, -0.967369019985199, -2.501... 0.055811
1560 1697 1999/21344_10K_1999_0000021344-00-000009.json item_1 ITEM 1. \nBUSINESS The Coca-Cola Company (toge... 21344 COCA COLA CO 10-K 2000-03-09 1999-12-31 2080 DE GA 1231 https://www.sec.gov/Archives/edgar/data/21344/... None https://www.sec.gov/Archives/edgar/data/21344/... neutral [2.1693193912506104, -0.7590945959091187, -2.5... 0.030452
2746 2977 1999/70858_10K_1999_0000950168-00-000621.json item_1 Item 1. \nBUSINESS General Bank of America Cor... 70858 BANK OF AMERICA CORP /DE/ 10-K 2000-03-20 1999-12-31 6021 DE NC 1231 https://www.sec.gov/Archives/edgar/data/70858/... None https://www.sec.gov/Archives/edgar/data/70858/... neutral [2.2866339683532715, -0.997267484664917, -2.47... 0.066551
3762 4088 1999/80424_10K_1999_0000080424-99-000027.json item_1 Item 1. \nBusiness. \n--------- General Develo... 80424 PROCTER & GAMBLE CO 10-K 1999-09-15 1999-06-30 2840 OH OH 0630 https://www.sec.gov/Archives/edgar/data/80424/... None https://www.sec.gov/Archives/edgar/data/80424/... neutral [2.303800106048584, -1.0463552474975586, -2.49... 0.183069
4806 5211 1999/1018724_10K_1999_0000891020-00-000622.json item_1 ITEM 1. \nBUSINESS This Annual Report on Form ... 1018724 AMAZON COM INC 10-K 2000-03-29 1999-12-31 5961 DE WA 1231 https://www.sec.gov/Archives/edgar/data/101872... None https://www.sec.gov/Archives/edgar/data/101872... positive [-1.7280608415603638, 2.051792621612549, 0.995... 0.017812
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1585 entries, 1410 to 1291201
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         1585 non-null   int64  
 1   filename                   1585 non-null   object 
 2   item                       1585 non-null   object 
 3   text                       1585 non-null   object 
 4   cik                        1585 non-null   object 
 5   company                    1585 non-null   object 
 6   filing_type                1585 non-null   object 
 7   filing_date                1585 non-null   object 
 8   period_of_report           1585 non-null   object 
 9   sic                        1585 non-null   object 
 10  state_of_inc               1585 non-null   object 
 11  state_location             1585 non-null   object 
 12  fiscal_year_end            1585 non-null   object 
 13  filing_html_index          1585 non-null   object 
 14  htm_filing_link            1228 non-null   object 
 15  complete_text_filing_link  1585 non-null   object 
 16  pred_labels                1585 non-null   object 
 17  raw_preds                  1585 non-null   object 
 18  pred_probs                 1585 non-null   float64
dtypes: float64(1), int64(1), object(17)
memory usage: 247.7+ KB