The eKorpkit Corpus#

The eKorpkit Corpus is a large, diverse, multilingual (ko/en) language modelling dataset.

ekorpkit corpus

Name

Language

Size

Weight

# Docs

# Sents

# Words

mc4_ko

ko

90.76 GiB

20.22%

15,618,718

665,858,888

8,007,674,274

courtlistener

en

47.92 GiB

10.68%

3,489,298

335,079,871

8,324,277,457

pmc_comm

en

45.26 GiB

10.08%

51,276,102

297,884,818

7,365,607,900

edgar

en

36.94 GiB

8.23%

213,376

177,270,203

6,053,677,897

c4_realnewslike

en

33.79 GiB

7.53%

13,813,090

155,883,681

6,040,207,703

pubmed

en

27.51 GiB

6.13%

22,498,747

190,907,356

4,281,121,705

bigpatent

en

22.46 GiB

5.00%

1,244,053

2,488,106

4,613,882,925

aihub_formal1

ko

19.16 GiB

4.27%

1,073,944

93,148,022

1,993,574,713

enwiki

en

13.85 GiB

3.09%

6,200,658

129,066,417

2,400,717,561

pmc_noncomm

en

11.88 GiB

2.65%

14,142,294

79,748,279

1,923,415,913

kcbert

ko

11.45 GiB

2.55%

82,990,213

82,990,213

1,088,177,367

nikl_news

ko

11.19 GiB

2.49%

4,104,534

42,527,395

1,138,897,337

oscar_ko

ko

11.05 GiB

2.46%

3,673,262

61,833,262

1,122,638,494

aida_paper

ko

8.77 GiB

1.95%

481,389

38,808,105

1,025,422,060

kcc

ko

6.80 GiB

1.51%

46,529,987

46,529,987

703,222,627

nikl_written

ko

6.45 GiB

1.44%

20,128

27,231,846

679,547,033

namuwiki

ko

6.43 GiB

1.43%

571,026

67,315,244

691,537,393

aihub_patent1

ko

6.40 GiB

1.42%

155,939

29,206,198

673,134,598

earnings_call

en

6.30 GiB

1.40%

159,380

32,391,491

1,160,525,933

sec_report

ko

4.70 GiB

1.05%

817,040

32,644,657

495,245,547

hacker_news

en

3.80 GiB

0.85%

818,299

41,573,998

662,524,112

philpapers

en

2.19 GiB

0.49%

31,016

139,518

365,576,851

nih_exporter

en

2.10 GiB

0.47%

1,017,230

13,540,126

326,974,102

bigkinds

ko

1.99 GiB

0.44%

871,304

7,759,115

197,746,184

youtube_subtitles

en

1.61 GiB

0.36%

150,749

16,074,289

303,286,377

respec

en

1.08 GiB

0.24%

1,119,640

7,083,257

169,590,880

nikl_spoken

ko

1002.49 MiB

0.22%

25,614

19,042,013

116,067,432

kowiki

ko

715.39 MiB

0.16%

563,959

5,671,388

70,263,451

us_equities_news

en

714.16 MiB

0.16%

220,976

1,834,664

131,179,752

aihub_law_case

ko

689.96 MiB

0.15%

77,202

1,095,140

66,686,761

aihub_formal2

ko

650.03 MiB

0.14%

95,990

1,650,141

64,523,191

gd_review

en

642.76 MiB

0.14%

1,929,910

6,733,680

112,977,678

aihub_patent2

ko

457.18 MiB

0.10%

147,674

1,879,909

46,045,036

enron_mail

en

428.36 MiB

0.09%

247,586

7,908,959

65,258,456

aihub_paper

ko

370.11 MiB

0.08%

98,344

1,802,883

35,556,261

kaist

ko

304.92 MiB

0.07%

11,157

1,926,901

30,929,508

reuters_financial

en

288.63 MiB

0.06%

101,055

1,983,069

49,495,061

aihub_book

ko

236.66 MiB

0.05%

180,001

1,201,956

23,052,720

aihub_koen_formal

ko

206.37 MiB

0.04%

1,350,000

1,350,000

20,659,619

aihub_koen_ssci

ko

186.49 MiB

0.04%

1,361,845

1,361,845

19,104,237

aihub_koen_sci

ko

164.42 MiB

0.04%

1,344,631

1,344,631

17,720,448

fomc

en

112.66 MiB

0.02%

2,822

950,620

18,640,148

esg_report

ko

24.17 MiB

0.01%

15,561

119,031

2,488,545

aihub_law_kb

ko

9.99 MiB

0.00%

17,373

46,140

934,632

bok_minutes

ko

9.54 MiB

0.00%

163

33,027

918,203

pathobook

en

4.28 MiB

0.00%

28

33,603

648,221

English

en

258.83 GiB

57.66%

Korean

ko

190.04 GiB

42.34%

Total

448.87 GiB

100.00%