Skip to content

Keith's Blog

Preprocess

Preprocess

Normalize text (especially CJK character)

text = re.sub(r"\xa0", " ", text)
text = re.sub(r"\xA0", " ", text)
text = re.sub(r"\t", " ", text)

import unicodedata
text = "株式会社ＫＡＤＯＫＡＷＡ\u3000Ｆｕｔｕｒｅ\u3000Ｐｕｂｌｉｓｈｉｎｇ"
text = unicodedata.normalize("NFKC", text)
# it will become 株式会社KADOKAWA Future Publishing

For unicodedata.normalize, more explanation is here

ｱｲｳｴｵ ==(NFC)==> ｱｲｳｴｵ
ｱｲｳｴｵ ==(NFD)==> ｱｲｳｴｵ
ｱｲｳｴｵ ==(NFKC)==> アイウエオ
ｱｲｳｴｵ ==(NFKD)==> アイウエオ
パピプペポ ==(NFC)==> パピプペポ
パピプペポ ==(NFD)==> パピプペポ
パピプペポ ==(NFKC)==> パピプペポ
パピプペポ ==(NFKD)==> パピプペポ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFC)==> ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFD)==> ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFKC)==> パピプペポ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFKD)==> パピプペポ
ａｂｃＡＢＣ ==(NFC)==> ａｂｃＡＢＣ
ａｂｃＡＢＣ ==(NFD)==> ａｂｃＡＢＣ
ａｂｃＡＢＣ ==(NFKC)==> abcABC
ａｂｃＡＢＣ ==(NFKD)==> abcABC
１２３ ==(NFC)==> １２３
１２３ ==(NFD)==> １２３
１２３ ==(NFKC)==> 123
１２３ ==(NFKD)==> 123
＋－．～）｝ ==(NFC)==> ＋－．～）｝
＋－．～）｝ ==(NFD)==> ＋－．～）｝
＋－．～）｝ ==(NFKC)==> +-.~)}
＋－．～）｝ ==(NFKD)==> +-.~)}

Comments