This tutorial explains how to use the manifestoberta large language model for automating the annotation of text data with the manifesto codebook. It shows how to use python for importing your text dataset, downloading the model from huggingface and clean the output into a dataset ready for further analyses. You also find some exemplary descriptive analyses for a closer look into how well the prediction performed on your data and how the categories are distributed.
Setup
First, you need to install the python libraries needed for running the model. Run this code only if you have not installed these libraries yet.
Next, load/import the libraries.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import pandas as pd
import os
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from docx import Document
import nltk
# Download the punkt tokenizer for sentence splitting
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
Set your working directory. It should be the directory where your data set is stored.
Importing text data
First, you need to transform your text data (e.g. a word or pdf file) into a data set (e.g. .csv or .xlsx)- so it has to be split into sentences. Here, we take the example of the Labour Party’s manifesto of the 2024 election in the United Kingdom. The example document is in word/docx format.
Importantly, if you use manifestos for substantial analyses, they might need some more preprocessing than we do here. Our usual routines in the Manifesto Project include manual tagging of which parts of the manifesto are not to be coded (such as headings, tables of content, text in the margins). You can either perform this manually beforehand as well, if you have word documents. Or, if you have a large set of pdf data, you can also opt for a more automated approach after the sentence splitting, i.e. removing tables of contents via regular expressions or removing sentences with less then three words etc.
For looking at our example manifesto of Labour 2024, we first write a function to read a Word document and extract text.
def read_word_file(file_path):
doc = Document(file_path)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text)
Next, we write a function to split the text into sentences and create a DataFrame.
def text_to_dataframe(text):
sentences = sent_tokenize(text)
df = pd.DataFrame({'id': range(1, len(sentences) + 1), 'text': sentences})
return df
We can then define the file path to our text file, read it with our abovedefined function and split it into a dataframe with sentences as rows.
# Path to your Word file
file_path = '51320_2024.docx'
# Read the Word file
text = read_word_file(file_path)
# Convert text to DataFrame
df = text_to_dataframe(text)
The dataset should at least have a) a running ID variable, b) a variable containing the text to be classified, e.g. split into sentences.
Load Model from Huggingface
There are two manifestoberta models ready for use. The sentence model classifies statements into one of the 56 different substantial categories available in the Handbook 4 coding scheme. The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results compared to the sentence model version. The context model is superior in performance, however, its use is only advised when there is context to a text, e.g. when the texts to be classified are sentences or paragraphs split from larger texts. When the texts are standalone, such as social media posts, the sentence model is the preferred choice. Here, we download the context model from HuggingFace, because we want to classify a manifesto - so sentences do have a context: their surrounding sentences.
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1", trust_remote_code = True)
Then, we import the AutoTokenizer class from the Hugging Face Transformers library. The tokenizer is an essential component in natural language processing as it converts raw text into a format that the model can process.
We use the from_pretrained method to load a pre-trained tokenizer specifically for the “xlm-roberta-large” model. By loading this tokenizer, we ensure that the text is tokenized in a way that is consistent with how the model was trained.
Define a function to get top n (here: top 3) classes and probabilities for each class
Classes, for manifestoberta, are the 56 manifesto categories. For each text in your dataset, the model will calculate a probability with which it can be assigned to each of these categories. The “top 3 classes” then are the three most likely categories the model predicts for your text.
def get_top_classes(logits, top_n=3):
probabilities = torch.softmax(logits, dim=1)[0].tolist()
classes_and_probs = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
top_classes = dict(sorted(classes_and_probs.items(), key=lambda item: item[1], reverse=True)[:top_n])
return top_classes
Run manifestoberta on your dataset: The context model
At first, we create two empty lists, which we fill gradually with our predictions for each texts. 1. We therefore iterate through each text (i.e. each row/sentence) with a for-loop. 2. We then construct a context string by concatenating the previous sentence, the current sentence, and the next sentence. The strip() function is used to remove any leading or trailing whitespace from the resulting string. The tokenizer converts the sentence and context into tokens that the model can process. 3. In the next step, the tokenized inputs are passed to the model. The model processes these inputs and outputs logits, which are raw prediction scores for each possible class. The logits represent the unnormalized probabilities of each class. 4. Then, we extract the top predicted classes from the logits and format them into a comma-separated string. The same is done for the associated probabilities, which are converted into percentage strings. These formatted strings are then appended to lists that store all the predicted classes and their probabilities for later use.
predicted_classes = []
predicted_probabilities = []
# 1.Create for loop for each sentence and context
for index, row in df.iterrows():
sentence = row['text'] # Current sentence, name of your text variable
previous_sentence = df.iloc[index - 1]['text'] if index > 0 else "" # Sentence before
next_sentence = df.iloc[index + 1]['text'] if index < len(df) - 1 else "" # Sentence after
# 2. Construct the context
context = f"{previous_sentence} {sentence} {next_sentence}".strip()
inputs = tokenizer(sentence,
context,
return_tensors="pt",
max_length=300, # Change to 200 when using the sentence model
padding="max_length",
truncation=True
)
# 3. Pass inputs to model
logits = model(**inputs).logits
# 4. Get top classes and probabilities
top_classes = get_top_classes(logits)
top_classes_str = ', '.join(top_classes.keys())
top_probs_str = ', '.join([str(prob) + '%' for prob in top_classes.values()])
predicted_classes.append(top_classes_str)
predicted_probabilities.append(top_probs_str)
Then, we can add the predicted classes and probabilities to the DataFrame.
Run manifestoberta on your dataset: The sentence model
If you prefer to run a sentence model, the steps would be very similar to the ones above, with slight modifications.
Loading the sentence model and a tokenizer:
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2023-1-1")
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
Define the top_n classes function in the same way as for the context model:
def get_top_classes(logits, top_n=3):
probabilities = torch.softmax(logits, dim=1)[0].tolist()
classes_and_probs = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
top_classes = dict(sorted(classes_and_probs.items(), key=lambda item: item[1], reverse=True)[:top_n])
return top_classes
And lastly, run the sentence model (including tokenizing and appending the logits as described above):
predicted_classes = []
predicted_probabilities = []
for index, row in test.iterrows():
text_input = row['text'] # name of your text variable
inputs = tokenizer(text_input,
return_tensors="pt",
max_length=200,
padding="max_length",
truncation=True
)
logits = model(**inputs).logits
top_classes = get_top_classes(logits)
# Get top classes and probabilities
top_classes_str = ', '.join(top_classes.keys())
Observe dataset classified by the context model
After successfully having run the context model on our Labour 2024 manifesto, now, we can observe the dataset with the newly created columns: