HuggingFaceのモデルLayoutLMV2を利用したOCR

HuggingFaceで提供しているモデルは性能が高いのでOCRに応用してみたいです。

すでに試している人のコードが公開されているぞ

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

動作確認はGoogle Colabで行います。下記に動作方法を記述しています。

必要なライブラリをインストールします。

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q pyyaml==5.1
!pip install -q torch==1.8.0+cu101 torchvision==0.9.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!python -m pip install -q 'git+https://github.com/facebookresearch/detectron2.git'
!pip install -q datasets
! sudo apt install tesseract-ocr
! pip install -q pytesseract

下記コードで必要なデータを取得します。

import requests, zipfile, io

def download_data():
    url = "https://www.dropbox.com/s/kuw05qmc4uy474d/RVL_CDIP_one_example_per_class.zip?dl=1"
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

download_data()

OCR処理をする画像を読み込みます。下記のような画像が読み込めます。

from PIL import Image, ImageDraw, ImageFont

image = Image.open("/content/RVL_CDIP_one_example_per_class/resume/0000157402.tif")
image = image.convert("RGB")
image

下記のコードは従来の手法で画像からOCR処理をかける方法です。

import pytesseract
import numpy as np

ocr_df = pytesseract.image_to_data(image, output_type='data.frame')
ocr_df = ocr_df.dropna().reset_index(drop=True)
float_cols = ocr_df.select_dtypes('float').columns
ocr_df[float_cols] = ocr_df[float_cols].round(0).astype(int)
ocr_df = ocr_df.replace(r'^\s*$', np.nan, regex=True)
words = ' '.join([word for word in ocr_df.text if str(word) != 'nan'])
words

OCR処理の結果は下記になります。

ot uv STATEMENT OF JEAN D, GIBBONS My name 4s Jean Dickinson Gibbons. My current position is Professor of Statistics and Chairman of the Applied Statistics Program at the Graduate School of the University of Alabana, I am currently a Fellow of both the American Statistical Association and the International Statistical Institute and a menber of the Committee on National Statistics of the National, Acadeny of Scdences. I received the bachelor's and master's degrees in mathematics from Duke University and the Ph.D. degree in statistics from Virginia Polytechnic Institute and State University. My previous faculty appointments were at the University of Pennsylvania and the University of Cincinnati. I was a senior Fulbright-Hays scholar at the Indian Statistical Institute in 1973. Twas Associate Editor of The Anertcan Statistician for eight years, currently act as editortal collaborator on many statistical journals, includ~ Technometrics, and serve as a reviewer for grant proposals to the National Science Foundation, I am @ member of several professional societies and have served two terms on the Board of Directors of the American Statistical Ascocdation. My publications include four scholarly books on statistics and over 30 articles in refereed professional and learned journals in my field, T was named Outstanding Scholar in 1981 and Board of Visitors Research Professor in 1974 at the University of Alabama, My current curriculua vita ig attached to this statement. stsepotzs

次にHuggingFaceで提供されているモデルでOCR処理を行います。

LayoutLMV2というモデルが使用されています。Transfomerをベースとしたモデルで画像とテキストのデータ、OCRの結果を入力に使用します。

Transoformerでよく使用されるトークンの一部をMaskして学習します。

行情報があるので、マスクされていないトークンが、その行に含まれているかどうかも学習します。

最後にペアとなっている画像データをランダムに変更して、OCR処理をかけたあと、入力したテキストデータとマッチするかどうかを学習します。

この手法の利点はIIT-CDIP datasetというデータ・セットを用いてラベルなしで学習している点です。

このモデルは事前学習済みモデルのベースラインとして使用でき、有名なダウンストリームタスク(画像から問題とそれに対応する答えを抜き出すタスクなど)で成果を出しています。

FUNSD (0.7895 → 0.8420), CORD
(0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.834 → 0.852),
RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672)

from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor

feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

下記コードでOCR処理が確認できます。

processor.tokenizer.decode(encoded_inputs.input_ids.squeeze().tolist())

OCR処理の結果は下記になります。

[CLS] ot uv statement of jean d, gibbons my name 4s jean dickinson gibbons. my current position is professor of statistics and chairman of the applied statistics program at the graduate school of the university of alabana, i am currently a fellow of both the american statistical association and the international statistical institute and a menber of the committee on national statistics of the national, acadeny of scdences. i received the bachelor's and master's degrees in mathematics from duke university and the ph. d. degree in statistics from virginia polytechnic institute and state university. my previous faculty appointments were at the university of pennsylvania and the university of cincinnati. i was a senior fulbright - hays scholar at the indian statistical institute in 1973. twas associate editor of the anertcan statistician for eight years, currently act as editortal collaborator on many statistical journals, includ ~ technometrics, and serve as a reviewer for grant proposals to the national science foundation, i am @ member of several professional societies and have served two terms on the board of directors of the american statistical ascocdation. my publications include four scholarly books on statistics and over 30 articles in refereed professional and learned journals in my field, t was named outstanding scholar in 1981 and board of visitors research professor in 1974 at the university of alabama, my current curriculua vita ig attached to this statement. stsepotzs [SEP]

HuggingFacaが提供するモデルで学習するメリットはモデルの学習ができるので、精度向上や特定のドメインに特価することが利点です。

簡単に試せました。日本語もやってみたいですね。

データを用意すれば可能だぞ

Close Bitnami banner
Bitnami