Prediciting Sentiments

Prediciting Sentiments#

from ekorpkit import eKonf

eKonf.setLogger("WARNING")
print("version:", eKonf.__version__)
print("is notebook?", eKonf.is_notebook())
print("is colab?", eKonf.is_colab())
print("evironment varialbles:")
eKonf.print(eKonf.env().dict())

INFO:ekorpkit.base:IPython version: (6, 9, 0), client: jupyter_client
INFO:ekorpkit.base:Google Colab not detected.

version: 0.1.35+2.g81d5295
is notebook? True
is colab? False
evironment varialbles:
{'CUDA_DEVICE_ORDER': None,
 'CUDA_VISIBLE_DEVICES': None,
 'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
 'EKORPKIT_DATA_DIR': None,
 'EKORPKIT_LOG_LEVEL': 'WARNING',
 'EKORPKIT_PROJECT': 'ekorpkit-book',
 'EKORPKIT_WORKSPACE_ROOT': '/workspace',
 'KMP_DUPLICATE_LIB_OK': 'TRUE',
 'NUM_WORKERS': 230}

Prepare `edgar` sampel dataframe#

df_cfg = eKonf.compose('pipeline=blank')
df_cfg.name = 'edgar_sample'
df_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
df_cfg.data_dir = df_cfg.path.cached_path
df_cfg.data_dir += "/edgar"
df_cfg.data_file = 'edgar.parquet'
df_cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
df = eKonf.instantiate(df_cfg)
df.head()

INFO:ekorpkit.base:IPython version: (6, 9, 0), client: jupyter_client
WARNING:ekorpkit.pipelines.pipe:No pipeline specified

	id	filename	item	text	cik	company	filing_type	filing_date	period_of_report	sic	state_of_inc	state_location	fiscal_year_end	filing_html_index	htm_filing_link	complete_text_filing_link
1410	1534	1999/320193_10K_1999_0000912057-99-010244.json	item_1	ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc...	320193	APPLE COMPUTER INC	10-K	1999-12-22	1999-09-25	3571	CA	CA	0930	https://www.sec.gov/Archives/edgar/data/320193...	None	https://www.sec.gov/Archives/edgar/data/320193...
1560	1697	1999/21344_10K_1999_0000021344-00-000009.json	item_1	ITEM 1. \nBUSINESS The Coca-Cola Company (toge...	21344	COCA COLA CO	10-K	2000-03-09	1999-12-31	2080	DE	GA	1231	https://www.sec.gov/Archives/edgar/data/21344/...	None	https://www.sec.gov/Archives/edgar/data/21344/...
2746	2977	1999/70858_10K_1999_0000950168-00-000621.json	item_1	Item 1. \nBUSINESS General Bank of America Cor...	70858	BANK OF AMERICA CORP /DE/	10-K	2000-03-20	1999-12-31	6021	DE	NC	1231	https://www.sec.gov/Archives/edgar/data/70858/...	None	https://www.sec.gov/Archives/edgar/data/70858/...
3762	4088	1999/80424_10K_1999_0000080424-99-000027.json	item_1	Item 1. \nBusiness. \n--------- General Develo...	80424	PROCTER & GAMBLE CO	10-K	1999-09-15	1999-06-30	2840	OH	OH	0630	https://www.sec.gov/Archives/edgar/data/80424/...	None	https://www.sec.gov/Archives/edgar/data/80424/...
4806	5211	1999/1018724_10K_1999_0000891020-00-000622.json	item_1	ITEM 1. \nBUSINESS This Annual Report on Form ...	1018724	AMAZON COM INC	10-K	2000-03-29	1999-12-31	5961	DE	WA	1231	https://www.sec.gov/Archives/edgar/data/101872...	None	https://www.sec.gov/Archives/edgar/data/101872...

Prepare `financial_phrasebank` dataset#

ds_cfg = eKonf.compose('dataset')
ds_cfg.name = 'financial_phrasebank'
ds_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/financial_phrasebank.zip'
ds_cfg.data_dir = ds_cfg.path.cached_path
ds = eKonf.instantiate(ds_cfg)
print(ds)

Dataset : financial_phrasebank

Compose a config for the LM sentiment analyser class#

model_cfg = eKonf.compose('model/sentiment=lm')

Instantiating a sentiment analyser class and prediting sentiments of `edgar` dataset#

cfg = eKonf.compose(config_group='pipeline')
cfg.verbose = True
cfg.name = 'edgar_sentiments'
cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
cfg.data_dir = cfg.path.cached_path
cfg.data_dir += "/edgar"
cfg.data_file = 'edgar.parquet'
cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = "../data/predict"
cfg.predict.output_file = f'{cfg.name}-lm.parquet'
cfg.num_workers = 100
df = eKonf.instantiate(cfg)
df.head()

INFO:ekorpkit.base:instantiating ekorpkit.pipelines.pipe.pipeline...
INFO:ekorpkit.base:instantiating ekorpkit.pipelines.data.Data...
INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f4c5c602820>)
INFO:ekorpkit.base:instantiating ekorpkit.models.sentiment.lbsa.SentimentAnalyser...
INFO:ekorpkit.base:Calling load_candidates
INFO:ekorpkit.base:Using batcher with minibatch size: 16

	id	filename	item	text	cik	company	filing_type	filing_date	period_of_report	sic	state_of_inc	state_location	fiscal_year_end	filing_html_index	htm_filing_link	complete_text_filing_link	num_tokens	polarity	polarity_label	uncertainty
1410	1534	1999/320193_10K_1999_0000912057-99-010244.json	item_1	ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc...	320193	APPLE COMPUTER INC	10-K	1999-12-22	1999-09-25	3571	CA	CA	0930	https://www.sec.gov/Archives/edgar/data/320193...	None	https://www.sec.gov/Archives/edgar/data/320193...	3901	-0.117647	neutral	0.011280
1560	1697	1999/21344_10K_1999_0000021344-00-000009.json	item_1	ITEM 1. \nBUSINESS The Coca-Cola Company (toge...	21344	COCA COLA CO	10-K	2000-03-09	1999-12-31	2080	DE	GA	1231	https://www.sec.gov/Archives/edgar/data/21344/...	None	https://www.sec.gov/Archives/edgar/data/21344/...	6755	0.066667	neutral	0.014805
2746	2977	1999/70858_10K_1999_0000950168-00-000621.json	item_1	Item 1. \nBUSINESS General Bank of America Cor...	70858	BANK OF AMERICA CORP /DE/	10-K	2000-03-20	1999-12-31	6021	DE	NC	1231	https://www.sec.gov/Archives/edgar/data/70858/...	None	https://www.sec.gov/Archives/edgar/data/70858/...	3491	-0.219512	negative	0.008595
3762	4088	1999/80424_10K_1999_0000080424-99-000027.json	item_1	Item 1. \nBusiness. \n--------- General Develo...	80424	PROCTER & GAMBLE CO	10-K	1999-09-15	1999-06-30	2840	OH	OH	0630	https://www.sec.gov/Archives/edgar/data/80424/...	None	https://www.sec.gov/Archives/edgar/data/80424/...	1259	0.166667	neutral	0.011915
4806	5211	1999/1018724_10K_1999_0000891020-00-000622.json	item_1	ITEM 1. \nBUSINESS This Annual Report on Form ...	1018724	AMAZON COM INC	10-K	2000-03-29	1999-12-31	5961	DE	WA	1231	https://www.sec.gov/Archives/edgar/data/101872...	None	https://www.sec.gov/Archives/edgar/data/101872...	12335	-0.104167	neutral	0.020025

print(cfg.predict.output_dir)
print(cfg.predict.output_file)

../data/predict
edgar_sentiments-lm.parquet

Instantiating a transformer classficiation model with `financial_phrasebank` dataset#

overrides=[
    '+model/transformer=classification',
    '+model/transformer/pretrained=finbert',
]
model_cfg = eKonf.compose('model/transformer=classification', overrides)
model_cfg.dataset = ds_cfg
model_cfg.verbose = False
model_cfg.config.num_train_epochs = 2
model_cfg.config.max_seq_length = 256
model_cfg.config.train_batch_size = 32
model_cfg.config.eval_batch_size = 32
model_cfg.labels = ['positive','neutral','negative']
model_cfg._method_ = ['train']
eKonf.instantiate(model_cfg)

INFO:ekorpkit.base:Calling train

/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

wandb: Currently logged in as: entelecheia. Use `wandb login --relogin` to force relogin

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

wandb version 0.12.21 is available! To upgrade, please run: $ pip install wandb --upgrade

Tracking run with wandb version 0.12.19

Run data is saved locally in /workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000030-2krugt98

Syncing run stellar-darkness-7 to Weights & Biases (docs)

Finishing last run (ID:2krugt98) before initializing another...

Waiting for W&B process to finish... (success).

Run history:

Training loss	▁
acc	▁█
eval_loss	█▁
global_step	▁▂█
lr	▁
mcc	▁█
train_loss	▁█

Run summary:

Training loss	0.45026
acc	0.85912
eval_loss	0.41092
global_step	92
lr	2e-05
mcc	0.73629
train_loss	0.24922

Synced stellar-darkness-7: https://wandb.ai/entelecheia/ekorpkit-book-ekorpkit-book/runs/2krugt98
Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)

Find logs at: /workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000030-2krugt98/logs

Successfully finished last run (ID:2krugt98). Initializing new run:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

wandb version 0.12.21 is available! To upgrade, please run: $ pip install wandb --upgrade

Tracking run with wandb version 0.12.19

Run data is saved locally in /workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000053-n5z5mrt4

Syncing run prime-snowflake-8 to Weights & Biases (docs)

<ekorpkit.models.transformer.simple.SimpleClassification at 0x7f4c2026ba00>

model_cfg._method_ = []

cfg = eKonf.compose('pipeline')
cfg.name = 'edgar_sentiments'
cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
cfg.data_dir = cfg.path.cached_path
cfg.data_dir += "/edgar"
cfg.data_file = 'edgar.parquet'
cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = "./data/predict"
cfg.predict.output_file = f'{cfg.name}-finbert.parquet'
cfg.num_workers = 1
df = eKonf.instantiate(cfg)
df.head()

INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f4c5c602820>)
INFO:ekorpkit.base:No method defined to call

Token indices sequence length is longer than the specified maximum sequence length for this model (4252 > 512). Running this sequence through the model will result in indexing errors

	id	filename	item	text	cik	company	filing_type	filing_date	period_of_report	sic	state_of_inc	state_location	fiscal_year_end	filing_html_index	htm_filing_link	complete_text_filing_link	pred_labels	raw_preds	pred_probs
1410	1534	1999/320193_10K_1999_0000912057-99-010244.json	item_1	ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc...	320193	APPLE COMPUTER INC	10-K	1999-12-22	1999-09-25	3571	CA	CA	0930	https://www.sec.gov/Archives/edgar/data/320193...	None	https://www.sec.gov/Archives/edgar/data/320193...	positive	[2.291260004043579, -0.967369019985199, -2.501...	0.055811
1560	1697	1999/21344_10K_1999_0000021344-00-000009.json	item_1	ITEM 1. \nBUSINESS The Coca-Cola Company (toge...	21344	COCA COLA CO	10-K	2000-03-09	1999-12-31	2080	DE	GA	1231	https://www.sec.gov/Archives/edgar/data/21344/...	None	https://www.sec.gov/Archives/edgar/data/21344/...	neutral	[2.1693193912506104, -0.7590945959091187, -2.5...	0.030452
2746	2977	1999/70858_10K_1999_0000950168-00-000621.json	item_1	Item 1. \nBUSINESS General Bank of America Cor...	70858	BANK OF AMERICA CORP /DE/	10-K	2000-03-20	1999-12-31	6021	DE	NC	1231	https://www.sec.gov/Archives/edgar/data/70858/...	None	https://www.sec.gov/Archives/edgar/data/70858/...	neutral	[2.2866339683532715, -0.997267484664917, -2.47...	0.066551
3762	4088	1999/80424_10K_1999_0000080424-99-000027.json	item_1	Item 1. \nBusiness. \n--------- General Develo...	80424	PROCTER & GAMBLE CO	10-K	1999-09-15	1999-06-30	2840	OH	OH	0630	https://www.sec.gov/Archives/edgar/data/80424/...	None	https://www.sec.gov/Archives/edgar/data/80424/...	neutral	[2.303800106048584, -1.0463552474975586, -2.49...	0.183069
4806	5211	1999/1018724_10K_1999_0000891020-00-000622.json	item_1	ITEM 1. \nBUSINESS This Annual Report on Form ...	1018724	AMAZON COM INC	10-K	2000-03-29	1999-12-31	5961	DE	WA	1231	https://www.sec.gov/Archives/edgar/data/101872...	None	https://www.sec.gov/Archives/edgar/data/101872...	positive	[-1.7280608415603638, 2.051792621612549, 0.995...	0.017812

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1585 entries, 1410 to 1291201
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 id                         1585 non-null   int64  
 filename                   1585 non-null   object 
 item                       1585 non-null   object 
 text                       1585 non-null   object 
 cik                        1585 non-null   object 
 company                    1585 non-null   object 
 filing_type                1585 non-null   object 
 filing_date                1585 non-null   object 
 period_of_report           1585 non-null   object 
 sic                        1585 non-null   object 
state_of_inc               1585 non-null   object 
state_location             1585 non-null   object 
fiscal_year_end            1585 non-null   object 
filing_html_index          1585 non-null   object 
htm_filing_link            1228 non-null   object 
complete_text_filing_link  1585 non-null   object 
pred_labels                1585 non-null   object 
raw_preds                  1585 non-null   object 
pred_probs                 1585 non-null   float64
dtypes: float64(1), int64(1), object(17)
memory usage: 247.7+ KB

Prediciting Sentiments

Contents

Prediciting Sentiments#

Prepare edgar sampel dataframe#

Prepare financial_phrasebank dataset#