CaCl2

March 2nd, 2021

CaCl2

Introduction

What is CaCl2?

CaCl2(CaCl2: Chinese Lexicon V2, Simple Chinese:CA中文语言词库) CaCl2 is originates from a Chinese natural language processing(NLP) researching project sponsored by a Chinese company.
CaCl2 project is an important part of CaOCl (CaOCl: Open Chinese Lexical Analyzer) Project.

How does CaCl2 work?

CaCl2 analyses the existing large volumes of textual data obtain from Internet and reformats data into massive entries
, Catalogues and classifies the entries according to the financial industry classification standard [see reference.1],

What can CaCl2 Lexicon do?

In the natural language processing (NLP) tasks, the CaCl2 lexicon helps break down language into shorter, elemental pieces.(Aka. tokenization)
the CaCl2 lexicon can be used for higher-level NLP tasks such as word segmentation, document summarization, contextual extraction, content categorization, etc.

What is CaCl2 aim at?

CaCl2 project aims to build a consistent, complete and accurate industrial lexicon or dictionary collections for Internet. we make our best effort to achieve higher data integrity, provide a firm foundation for Chinese NLP works, Users would devote more attention to their business and research.

Statistics

Entries

Date All Candidate Released Preview
2021-02-01 21,000,000 3,000,000 2,553,806 280,000

Dictionaries

Date Class Industries Released Preview Closing
2021-02-01 Class-1 28 2 26 0
2021-02-01 Class-2 104 5 99 0

**Detail Statistics data please refer to Statistics

Get Start

1.Clone cacl2 or download dictionaries from GitHub

Clone cacl2

git clone https://github.com/limccn/cacl2.git

or Download a dictionary

wget https://github.com/limccn/cacl2/blob/master/archive/v0.2/[put dictionary code here].zip

2.Import dictionaries to your project & research environment

CaCl2 dictionary has a well formatted, can be use in many lexiconic tools.

import jieba
dict_name = '480000.txt'
jieba.load_userdict(os.path.join(BASE_PATH_TO_DICT), dict_name))

<properties>   
  <entry key="ext_dict">480000.txt;480100.txt;</entry>  
</properties> 

3.Run,Test and Enjoy CaCl2!

Open-Source Schedule

Released

Code Name Entries Date Version Format Download
480000 Banking-Common 40612 2021-02 v0.2 txt 480000.zip
480100 Banking-Bank 224433 2021-02 v0.2 txt 480100.zip
490000 Financials-Common 341235 2021-02 v0.2 txt 490000.zip
490100 Financials-Securities 311121 2021-02 v0.2 txt 490100.zip
490200 Financials-Insurance 31020 2021-02 v0.2 txt 480200.zip

Scheduled Release

Code Name Entries Schedule Version Format Download
490300 Financials-Others 10,000 2Q 2021 v0.2 txt 490300.zip

Technical Preview

Before dictionary finally publish/release, we published a technical preview dictionary contains 10,000 entries for every class-1 industry.
If you need further information about all entries, Please refer to Statistics

**Original raw data, please refer to /dicts
**Detail Class-1 and 2 industries dictionaries, Please refer to Statistics

Comparison and Test

1.Comparsion

(@CaoWJ)Dictionary

Compare Lexicon

(@CaoWJ)

Word segmentation

(@CaoWJ)

Document summarization

2. Test and Score

2.1 industrial test dataset

Word segmentation test use for different industrial textual data

2.1.1 Word segmentation use financial industry(banking industry Only)dictionary

Financial industry(banking industry Only) Word segmentation

2.1.2 Word segmentation use Financial industry(Except banking industry)dictionary

Financial industry(Except banking industry) Word segmentation

2.2 Standard test dataset

Word segmentation test use Standard Chinese test dataset

Score for CTB5

Score for ICWB

History and changelogs

1.Regular releases

Version Date Changelogs
0.2 2021 latest
0.1.1 2020 Catalogues and classifies all entries into 28 class-1 industries and 240 class-2 industries
0.1 2019 First released version,contains overs 20 million entries,data mainly obtain from Baidu baike,Wikipedia

2.Monthly/Quarterly releases

Version Circle Date Changelogs
v0.2.21.01 monthly 2021-02-01 Release: banking and financials dictionary
v0.2.20.12 monthly 2021-01-01 v0.2 Initial version

FAQ

Disclosure

CaCl2 and its data comes from the information published on the Internet. CaCl2 does not guarantee the integrity and correctness of the data. CaCl2 does not constitute any investment suggestion.
As Contributor, we have no positions in any stocks mentioned. We have no business relationship with any company whose stock is mentioned in this article.

Reference

1.Industry Classification Standard of SWSI.2014

Comments are closed.