Prediciting Sentiments#
from ekorpkit import eKonf
eKonf.setLogger("WARNING")
print("version:", eKonf.__version__)
print("is notebook?", eKonf.is_notebook())
print("is colab?", eKonf.is_colab())
print("evironment varialbles:")
eKonf.print(eKonf.env().dict())
INFO:ekorpkit.base:IPython version: (6, 9, 0), client: jupyter_client
INFO:ekorpkit.base:Google Colab not detected.
version: 0.1.35+2.g81d5295
is notebook? True
is colab? False
evironment varialbles:
{'CUDA_DEVICE_ORDER': None,
'CUDA_VISIBLE_DEVICES': None,
'EKORPKIT_CONFIG_DIR': '/workspace/projects/ekorpkit-book/config',
'EKORPKIT_DATA_DIR': None,
'EKORPKIT_LOG_LEVEL': 'WARNING',
'EKORPKIT_PROJECT': 'ekorpkit-book',
'EKORPKIT_WORKSPACE_ROOT': '/workspace',
'KMP_DUPLICATE_LIB_OK': 'TRUE',
'NUM_WORKERS': 230}
Prepare edgar
sampel dataframe#
df_cfg = eKonf.compose('pipeline=blank')
df_cfg.name = 'edgar_sample'
df_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
df_cfg.data_dir = df_cfg.path.cached_path
df_cfg.data_dir += "/edgar"
df_cfg.data_file = 'edgar.parquet'
df_cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
df = eKonf.instantiate(df_cfg)
df.head()
INFO:ekorpkit.base:IPython version: (6, 9, 0), client: jupyter_client
WARNING:ekorpkit.pipelines.pipe:No pipeline specified
id | filename | item | text | cik | company | filing_type | filing_date | period_of_report | sic | state_of_inc | state_location | fiscal_year_end | filing_html_index | htm_filing_link | complete_text_filing_link | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1410 | 1534 | 1999/320193_10K_1999_0000912057-99-010244.json | item_1 | ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc... | 320193 | APPLE COMPUTER INC | 10-K | 1999-12-22 | 1999-09-25 | 3571 | CA | CA | 0930 | https://www.sec.gov/Archives/edgar/data/320193... | None | https://www.sec.gov/Archives/edgar/data/320193... |
1560 | 1697 | 1999/21344_10K_1999_0000021344-00-000009.json | item_1 | ITEM 1. \nBUSINESS The Coca-Cola Company (toge... | 21344 | COCA COLA CO | 10-K | 2000-03-09 | 1999-12-31 | 2080 | DE | GA | 1231 | https://www.sec.gov/Archives/edgar/data/21344/... | None | https://www.sec.gov/Archives/edgar/data/21344/... |
2746 | 2977 | 1999/70858_10K_1999_0000950168-00-000621.json | item_1 | Item 1. \nBUSINESS General Bank of America Cor... | 70858 | BANK OF AMERICA CORP /DE/ | 10-K | 2000-03-20 | 1999-12-31 | 6021 | DE | NC | 1231 | https://www.sec.gov/Archives/edgar/data/70858/... | None | https://www.sec.gov/Archives/edgar/data/70858/... |
3762 | 4088 | 1999/80424_10K_1999_0000080424-99-000027.json | item_1 | Item 1. \nBusiness. \n--------- General Develo... | 80424 | PROCTER & GAMBLE CO | 10-K | 1999-09-15 | 1999-06-30 | 2840 | OH | OH | 0630 | https://www.sec.gov/Archives/edgar/data/80424/... | None | https://www.sec.gov/Archives/edgar/data/80424/... |
4806 | 5211 | 1999/1018724_10K_1999_0000891020-00-000622.json | item_1 | ITEM 1. \nBUSINESS This Annual Report on Form ... | 1018724 | AMAZON COM INC | 10-K | 2000-03-29 | 1999-12-31 | 5961 | DE | WA | 1231 | https://www.sec.gov/Archives/edgar/data/101872... | None | https://www.sec.gov/Archives/edgar/data/101872... |
Prepare financial_phrasebank
dataset#
ds_cfg = eKonf.compose('dataset')
ds_cfg.name = 'financial_phrasebank'
ds_cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/financial_phrasebank.zip'
ds_cfg.data_dir = ds_cfg.path.cached_path
ds = eKonf.instantiate(ds_cfg)
print(ds)
Dataset : financial_phrasebank
Compose a config for the LM sentiment analyser class#
model_cfg = eKonf.compose('model/sentiment=lm')
Instantiating a sentiment analyser class and prediting sentiments of edgar
dataset#
cfg = eKonf.compose(config_group='pipeline')
cfg.verbose = True
cfg.name = 'edgar_sentiments'
cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
cfg.data_dir = cfg.path.cached_path
cfg.data_dir += "/edgar"
cfg.data_file = 'edgar.parquet'
cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = "../data/predict"
cfg.predict.output_file = f'{cfg.name}-lm.parquet'
cfg.num_workers = 100
df = eKonf.instantiate(cfg)
df.head()
INFO:ekorpkit.base:instantiating ekorpkit.pipelines.pipe.pipeline...
INFO:ekorpkit.base:instantiating ekorpkit.pipelines.data.Data...
INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f4c5c602820>)
INFO:ekorpkit.base:instantiating ekorpkit.models.sentiment.lbsa.SentimentAnalyser...
INFO:ekorpkit.base:Calling load_candidates
INFO:ekorpkit.base:Using batcher with minibatch size: 16
id | filename | item | text | cik | company | filing_type | filing_date | period_of_report | sic | state_of_inc | state_location | fiscal_year_end | filing_html_index | htm_filing_link | complete_text_filing_link | num_tokens | polarity | polarity_label | uncertainty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1410 | 1534 | 1999/320193_10K_1999_0000912057-99-010244.json | item_1 | ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc... | 320193 | APPLE COMPUTER INC | 10-K | 1999-12-22 | 1999-09-25 | 3571 | CA | CA | 0930 | https://www.sec.gov/Archives/edgar/data/320193... | None | https://www.sec.gov/Archives/edgar/data/320193... | 3901 | -0.117647 | neutral | 0.011280 |
1560 | 1697 | 1999/21344_10K_1999_0000021344-00-000009.json | item_1 | ITEM 1. \nBUSINESS The Coca-Cola Company (toge... | 21344 | COCA COLA CO | 10-K | 2000-03-09 | 1999-12-31 | 2080 | DE | GA | 1231 | https://www.sec.gov/Archives/edgar/data/21344/... | None | https://www.sec.gov/Archives/edgar/data/21344/... | 6755 | 0.066667 | neutral | 0.014805 |
2746 | 2977 | 1999/70858_10K_1999_0000950168-00-000621.json | item_1 | Item 1. \nBUSINESS General Bank of America Cor... | 70858 | BANK OF AMERICA CORP /DE/ | 10-K | 2000-03-20 | 1999-12-31 | 6021 | DE | NC | 1231 | https://www.sec.gov/Archives/edgar/data/70858/... | None | https://www.sec.gov/Archives/edgar/data/70858/... | 3491 | -0.219512 | negative | 0.008595 |
3762 | 4088 | 1999/80424_10K_1999_0000080424-99-000027.json | item_1 | Item 1. \nBusiness. \n--------- General Develo... | 80424 | PROCTER & GAMBLE CO | 10-K | 1999-09-15 | 1999-06-30 | 2840 | OH | OH | 0630 | https://www.sec.gov/Archives/edgar/data/80424/... | None | https://www.sec.gov/Archives/edgar/data/80424/... | 1259 | 0.166667 | neutral | 0.011915 |
4806 | 5211 | 1999/1018724_10K_1999_0000891020-00-000622.json | item_1 | ITEM 1. \nBUSINESS This Annual Report on Form ... | 1018724 | AMAZON COM INC | 10-K | 2000-03-29 | 1999-12-31 | 5961 | DE | WA | 1231 | https://www.sec.gov/Archives/edgar/data/101872... | None | https://www.sec.gov/Archives/edgar/data/101872... | 12335 | -0.104167 | neutral | 0.020025 |
print(cfg.predict.output_dir)
print(cfg.predict.output_file)
../data/predict
edgar_sentiments-lm.parquet
Instantiating a transformer classficiation model with financial_phrasebank
dataset#
overrides=[
'+model/transformer=classification',
'+model/transformer/pretrained=finbert',
]
model_cfg = eKonf.compose('model/transformer=classification', overrides)
model_cfg.dataset = ds_cfg
model_cfg.verbose = False
model_cfg.config.num_train_epochs = 2
model_cfg.config.max_seq_length = 256
model_cfg.config.train_batch_size = 32
model_cfg.config.eval_batch_size = 32
model_cfg.labels = ['positive','neutral','negative']
model_cfg._method_ = ['train']
eKonf.instantiate(model_cfg)
INFO:ekorpkit.base:Calling train
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
wandb: Currently logged in as: entelecheia. Use `wandb login --relogin` to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb version 0.12.21 is available! To upgrade, please run:
$ pip install wandb --upgrade
Tracking run with wandb version 0.12.19
Run data is saved locally in
/workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000030-2krugt98
Finishing last run (ID:2krugt98) before initializing another...
Waiting for W&B process to finish... (success).
Run history:
Training loss | ▁ |
acc | ▁█ |
eval_loss | █▁ |
global_step | ▁▂█ |
lr | ▁ |
mcc | ▁█ |
train_loss | ▁█ |
Run summary:
Training loss | 0.45026 |
acc | 0.85912 |
eval_loss | 0.41092 |
global_step | 92 |
lr | 2e-05 |
mcc | 0.73629 |
train_loss | 0.24922 |
Synced stellar-darkness-7: https://wandb.ai/entelecheia/ekorpkit-book-ekorpkit-book/runs/2krugt98
Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
Find logs at:
/workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000030-2krugt98/logs
Successfully finished last run (ID:2krugt98). Initializing new run:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb version 0.12.21 is available! To upgrade, please run:
$ pip install wandb --upgrade
Tracking run with wandb version 0.12.19
Run data is saved locally in
/workspace/projects/ekorpkit-book/outputs/ekorpkit-book/finbert/wandb/run-20220708_000053-n5z5mrt4
<ekorpkit.models.transformer.simple.SimpleClassification at 0x7f4c2026ba00>
model_cfg._method_ = []
cfg = eKonf.compose('pipeline')
cfg.name = 'edgar_sentiments'
cfg.path.cache.uri = 'https://github.com/entelecheia/ekorpkit-book/raw/main/assets/data/edgar.zip'
cfg.data_dir = cfg.path.cached_path
cfg.data_dir += "/edgar"
cfg.data_file = 'edgar.parquet'
cfg.data_columns = ['id', 'filename', 'item', 'cik', 'company', 'text']
cfg._pipeline_ = ['predict']
cfg.predict.model = model_cfg
cfg.predict.output_dir = "./data/predict"
cfg.predict.output_file = f'{cfg.name}-finbert.parquet'
cfg.num_workers = 1
df = eKonf.instantiate(cfg)
df.head()
INFO:ekorpkit.base:Applying pipe: functools.partial(<function predict at 0x7f4c5c602820>)
INFO:ekorpkit.base:No method defined to call
Token indices sequence length is longer than the specified maximum sequence length for this model (4252 > 512). Running this sequence through the model will result in indexing errors
id | filename | item | text | cik | company | filing_type | filing_date | period_of_report | sic | state_of_inc | state_location | fiscal_year_end | filing_html_index | htm_filing_link | complete_text_filing_link | pred_labels | raw_preds | pred_probs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1410 | 1534 | 1999/320193_10K_1999_0000912057-99-010244.json | item_1 | ITEM 1. \nBUSINESS GENERAL Apple Computer, Inc... | 320193 | APPLE COMPUTER INC | 10-K | 1999-12-22 | 1999-09-25 | 3571 | CA | CA | 0930 | https://www.sec.gov/Archives/edgar/data/320193... | None | https://www.sec.gov/Archives/edgar/data/320193... | positive | [2.291260004043579, -0.967369019985199, -2.501... | 0.055811 |
1560 | 1697 | 1999/21344_10K_1999_0000021344-00-000009.json | item_1 | ITEM 1. \nBUSINESS The Coca-Cola Company (toge... | 21344 | COCA COLA CO | 10-K | 2000-03-09 | 1999-12-31 | 2080 | DE | GA | 1231 | https://www.sec.gov/Archives/edgar/data/21344/... | None | https://www.sec.gov/Archives/edgar/data/21344/... | neutral | [2.1693193912506104, -0.7590945959091187, -2.5... | 0.030452 |
2746 | 2977 | 1999/70858_10K_1999_0000950168-00-000621.json | item_1 | Item 1. \nBUSINESS General Bank of America Cor... | 70858 | BANK OF AMERICA CORP /DE/ | 10-K | 2000-03-20 | 1999-12-31 | 6021 | DE | NC | 1231 | https://www.sec.gov/Archives/edgar/data/70858/... | None | https://www.sec.gov/Archives/edgar/data/70858/... | neutral | [2.2866339683532715, -0.997267484664917, -2.47... | 0.066551 |
3762 | 4088 | 1999/80424_10K_1999_0000080424-99-000027.json | item_1 | Item 1. \nBusiness. \n--------- General Develo... | 80424 | PROCTER & GAMBLE CO | 10-K | 1999-09-15 | 1999-06-30 | 2840 | OH | OH | 0630 | https://www.sec.gov/Archives/edgar/data/80424/... | None | https://www.sec.gov/Archives/edgar/data/80424/... | neutral | [2.303800106048584, -1.0463552474975586, -2.49... | 0.183069 |
4806 | 5211 | 1999/1018724_10K_1999_0000891020-00-000622.json | item_1 | ITEM 1. \nBUSINESS This Annual Report on Form ... | 1018724 | AMAZON COM INC | 10-K | 2000-03-29 | 1999-12-31 | 5961 | DE | WA | 1231 | https://www.sec.gov/Archives/edgar/data/101872... | None | https://www.sec.gov/Archives/edgar/data/101872... | positive | [-1.7280608415603638, 2.051792621612549, 0.995... | 0.017812 |
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1585 entries, 1410 to 1291201
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1585 non-null int64
1 filename 1585 non-null object
2 item 1585 non-null object
3 text 1585 non-null object
4 cik 1585 non-null object
5 company 1585 non-null object
6 filing_type 1585 non-null object
7 filing_date 1585 non-null object
8 period_of_report 1585 non-null object
9 sic 1585 non-null object
10 state_of_inc 1585 non-null object
11 state_location 1585 non-null object
12 fiscal_year_end 1585 non-null object
13 filing_html_index 1585 non-null object
14 htm_filing_link 1228 non-null object
15 complete_text_filing_link 1585 non-null object
16 pred_labels 1585 non-null object
17 raw_preds 1585 non-null object
18 pred_probs 1585 non-null float64
dtypes: float64(1), int64(1), object(17)
memory usage: 247.7+ KB