Hugging Face のポジネガ、言語モデル、固有表現抽出、要約、翻訳が行えるpipelineを試す

自然言語処理のいろんなタスクを簡単にしたいです。

HuggingFaceのパイプラインを使えば簡単にできるぞ

環境構築

Google Colabで環境構築します。

下記にHuggingFaceのパイプラインの情報があります。

https://huggingface.co/transformers/main_classes/pipelines.html

パイプラインで実行可能なタスクは下記になります。

HuggingFace パイプライン

まず必要なライブラリを導入します。

! pip install transformers

感情分析

感情分析のパイプラインを作成します。

リンク先によるとポジティブとネガティブの感情を分析できます。

https://huggingface.co/transformers/task_summary.html?highlight=sentiment%20analysis#sequence-classification

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

# Sentiment analysis pipeline
sentiment = pipeline('sentiment-analysis')

ポジティブとネガティブの文章を用意して分析してみます。

result = sentiment("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = sentiment("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

下記のように文章の感情分析結果が確認できます。

ネガティブな文章はネガティブ、ポジティブな文章はポジティブと判定できています。

label: NEGATIVE, with score: 0.9991
label: POSITIVE, with score: 0.9999

質問応答

コンテキストとなる文章から質問に対する答えを抽出します。

https://huggingface.co/transformers/task_summary.html?highlight=named#extractive-question-answering

質疑応答のパイプラインを作成して、質問応答の答えを抽出となるコンテキスト文を用意します。

nlp = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""

予め用意したパイプラインに質問文とコンテキストを渡します。

答えと答えの確信度、文章のどの位置から取得しているかを表示します。

result = nlp(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = nlp(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

適切な答えを抽出できていることが確認できます。

Answer: 'the task of extracting an answer from a text given a question', score: 0.6226, start: 34, end: 95
Answer: 'SQuAD dataset', score: 0.5053, start: 147, end: 160

言語モデル

Masked Language Modeling

トークンの一部をマスク化して、そのマスク化された部分を予測するタスクです。

https://huggingface.co/transformers/task_summary.html?highlight=named#masked-language-modeling

まずパイプラインを作成します。

nlp = pipeline("fill-mask")

`nlp.tokenizer.mask_token}`でマスク化する部分を指定します。

作成したパイプラインにマスク化した部分を含めた文章を入力します。

for elment in nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."):
  print(elment)

下記のように確認できます。

scoreで予測したトークンの確信度、token_strで予測したトークン、`token`でトークンのIDが確認できます。

{'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.', 'score': 0.17927460372447968, 'token': 3944, 'token_str': ' tool'}
{'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.', 'score': 0.1134939044713974, 'token': 7208, 'token_str': ' framework'}
{'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.', 'score': 0.05243545398116112, 'token': 5560, 'token_str': ' library'}
{'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.', 'score': 0.03493543714284897, 'token': 8503, 'token_str': ' database'}
{'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.', 'score': 0.02860247902572155, 'token': 17715, 'token_str': ' prototype'}

Causal Language Modeling

言語の生成タスクをやりたい場合はこちらのモデルを使用します。

例えば下記のような入力文に対して

Hugging Face is based in

次のトークンを予測するタスクに使用します。

Hugging Face is based in Tokyo

https://huggingface.co/transformers/task_summary.html?highlight=named#causal-language-modeling

パイプラインを作成し、入力元の文を入力して次のトークンを予測します。max_lengthで最大予測長を指定しています。

do_sampleをTrueにすると予測を確率に変動させて出力のバラエティを上げることができます。

from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=15, do_sample=False))

下記のような予測結果が得られました。

[{'generated_text': 'As far as I am concerned, I will be the first to admit that'}]

do_sampleをTrueにして、出力の変動を確認してみます。

for i in range(3):
    print(text_generator("As far as I am concerned, I will", max_length=15, do_sample=True))

下記のように複数出力を確認できます。

[{'generated_text': 'As far as I am concerned, I will just say that I was on'}]
[{'generated_text': 'As far as I am concerned, I will only be putting up with it'}]
[{'generated_text': 'As far as I am concerned, I will not lose that game.\n'}]

固有表現抽出

固有表現抽出は文章中に商品名やメーカーがある場合に抽出する際に有効な手法です。

例えば下記のような文章が合った時に

通勤中にはiPhone 10でネットをしている。

”iPhone 10″を商品名として抽出できればマーケティングの際に有用な情報になります。

固有表現抽出を行うパイプラインを作成します。使用するモデルは下記の固有表現が抽出できます。

  • O, Outside of a named entity
  • B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
  • I-MIS, Miscellaneous entity
  • B-PER, Beginning of a person’s name right after another person’s name
  • I-PER, Person’s name
  • B-ORG, Beginning of an organisation right after another organisation
  • I-ORG, Organisation
  • B-LOC, Beginning of a location right after another location
  • I-LOC, Location

https://huggingface.co/transformers/task_summary.html?highlight=named#named-entity-recognition

個別にモデルとトークナイザーを作成して、パイプラインを作成することができます。

# Named entity recognition pipeline, passing in a specific model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
ner = pipeline('ner', model=model, tokenizer=tokenizer)

固有表現抽出を行います。

for each_ner in ner(ARTICLE):
    print(each_ner)

出力の一部を確認します。

New YorkはI-LOCの場所と認識できています。

CNNはI-ORGが組織と認識できています。

{'word': 'New', 'score': 0.9959530830383301, 'entity': 'I-LOC', 'index': 1, 'start': 1, 'end': 4}
{'word': 'York', 'score': 0.9968130588531494, 'entity': 'I-LOC', 'index': 2, 'start': 5, 'end': 9}
{'word': 'CNN', 'score': 0.9826537370681763, 'entity': 'I-ORG', 'index': 4, 'start': 11, 'end': 14}

要約

要約のパイプラインを作成します。

https://huggingface.co/transformers/task_summary.html?highlight=named#summarization

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
 A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
 Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
 In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
 Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
 2010 marriage license application, according to court documents.
 Prosecutors said the marriages were part of an immigration scam.
 On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
 After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
 Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
 All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
 Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
 Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
 The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
 Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
 Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
 If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
 """

要約元の文章を入力して要約文を作成します。

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

下記のような要約文を作成します。

[{'summary_text': ' In total, Barrientos has been married to 10 men since 1999 . 
She faces two counts of "offering a false instrument for filing in the first degree" 
Prosecutors say the marriages were part of an immigration scam .'}]

翻訳

翻訳のパイプラインを作成して、翻訳元文を入力してドイツ語に翻訳します。

translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

下記のように出力されます。

[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

パイプラインを使用するといろんな言語タスクができますね。

日本語タスクへの対応はまだの部分もあるが簡単に試せるようになっているぞ

Hugging Faceを用いて日本語のデータのポジネガ分析を行い方は下記をご覧ください。

Close Bitnami banner
Bitnami