Automated Coding with Manifestoberta

Denise Al-Gaddooa & Felicia Riethmüller

2024-08-15

This tutorial explains how to use the manifestoberta large language model for automating the annotation of text data with the manifesto codebook. It shows how to use python for importing your text dataset, downloading the model from huggingface and clean the output into a dataset ready for further analyses. You also find some exemplary descriptive analyses for a closer look into how well the prediction performed on your data and how the categories are distributed.

Setup

First, you need to install the python libraries needed for running the model. Run this code only if you have not installed these libraries yet.

pip install pandas openpyxl transformers torch numpy scipy seaborn matplotlib
python-docx nltk

Next, load/import the libraries.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import pandas as pd
import os
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from docx import Document
import nltk

# Download the punkt tokenizer for sentence splitting
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

Set your working directory. It should be the directory where your data set is stored.

os.chdir(r"your working directory")

Importing text data

First, you need to transform your text data (e.g. a word or pdf file) into a data set (e.g. .csv or .xlsx)- so it has to be split into sentences. Here, we take the example of the Labour Party’s manifesto of the 2024 election in the United Kingdom. The example document is in word/docx format.

Importantly, if you use manifestos for substantial analyses, they might need some more preprocessing than we do here. Our usual routines in the Manifesto Project include manual tagging of which parts of the manifesto are not to be coded (such as headings, tables of content, text in the margins). You can either perform this manually beforehand as well, if you have word documents. Or, if you have a large set of pdf data, you can also opt for a more automated approach after the sentence splitting, i.e. removing tables of contents via regular expressions or removing sentences with less then three words etc.

For looking at our example manifesto of Labour 2024, we first write a function to read a Word document and extract text.

def read_word_file(file_path):
    doc = Document(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

Next, we write a function to split the text into sentences and create a DataFrame.

def text_to_dataframe(text):
    sentences = sent_tokenize(text)
    df = pd.DataFrame({'id': range(1, len(sentences) + 1), 'text': sentences})
    return df

We can then define the file path to our text file, read it with our abovedefined function and split it into a dataframe with sentences as rows.

# Path to your Word file
file_path = '51320_2024.docx'

# Read the Word file
text = read_word_file(file_path)

# Convert text to DataFrame
df = text_to_dataframe(text)

The dataset should at least have a) a running ID variable, b) a variable containing the text to be classified, e.g. split into sentences.

Load Model from Huggingface

There are two manifestoberta models ready for use. The sentence model classifies statements into one of the 56 different substantial categories available in the Handbook 4 coding scheme. The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results compared to the sentence model version. The context model is superior in performance, however, its use is only advised when there is context to a text, e.g. when the texts to be classified are sentences or paragraphs split from larger texts. When the texts are standalone, such as social media posts, the sentence model is the preferred choice. Here, we download the context model from HuggingFace, because we want to classify a manifesto - so sentences do have a context: their surrounding sentences.

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1", trust_remote_code = True)

Then, we import the AutoTokenizer class from the Hugging Face Transformers library. The tokenizer is an essential component in natural language processing as it converts raw text into a format that the model can process.

We use the from_pretrained method to load a pre-trained tokenizer specifically for the “xlm-roberta-large” model. By loading this tokenizer, we ensure that the text is tokenized in a way that is consistent with how the model was trained.

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large", clean_up_tokenization_spaces=True)

Define a function to get top n (here: top 3) classes and probabilities for each class

Classes, for manifestoberta, are the 56 manifesto categories. For each text in your dataset, the model will calculate a probability with which it can be assigned to each of these categories. The “top 3 classes” then are the three most likely categories the model predicts for your text.

def get_top_classes(logits, top_n=3):
    probabilities = torch.softmax(logits, dim=1)[0].tolist()
    classes_and_probs = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
    top_classes = dict(sorted(classes_and_probs.items(), key=lambda item: item[1], reverse=True)[:top_n])
    return top_classes

Run manifestoberta on your dataset: The context model

At first, we create two empty lists, which we fill gradually with our predictions for each texts. 1. We therefore iterate through each text (i.e. each row/sentence) with a for-loop. 2. We then construct a context string by concatenating the previous sentence, the current sentence, and the next sentence. The strip() function is used to remove any leading or trailing whitespace from the resulting string. The tokenizer converts the sentence and context into tokens that the model can process. 3. In the next step, the tokenized inputs are passed to the model. The model processes these inputs and outputs logits, which are raw prediction scores for each possible class. The logits represent the unnormalized probabilities of each class. 4. Then, we extract the top predicted classes from the logits and format them into a comma-separated string. The same is done for the associated probabilities, which are converted into percentage strings. These formatted strings are then appended to lists that store all the predicted classes and their probabilities for later use.

predicted_classes = []
predicted_probabilities = []

   # 1.Create for loop for each sentence and context
for index, row in df.iterrows():
    sentence = row['text']  # Current sentence, name of your text variable
    previous_sentence = df.iloc[index - 1]['text'] if index > 0 else ""  # Sentence before
    next_sentence = df.iloc[index + 1]['text'] if index < len(df) - 1 else ""  # Sentence after
    
    # 2. Construct the context
    context = f"{previous_sentence} {sentence} {next_sentence}".strip()
    
    inputs = tokenizer(sentence,
                       context,
                       return_tensors="pt",
                       max_length=300,  # Change to 200 when using the sentence model
                       padding="max_length",
                       truncation=True
                       )

    # 3. Pass inputs to model
    logits = model(**inputs).logits
    
    # 4. Get top classes and probabilities
    top_classes = get_top_classes(logits)
    top_classes_str = ', '.join(top_classes.keys())
    top_probs_str = ', '.join([str(prob) + '%' for prob in top_classes.values()])
    
    predicted_classes.append(top_classes_str)
    predicted_probabilities.append(top_probs_str)

Then, we can add the predicted classes and probabilities to the DataFrame.

df['Predicted Classes'] = predicted_classes
df['Predicted Probabilities'] = predicted_probabilities

Run manifestoberta on your dataset: The sentence model

If you prefer to run a sentence model, the steps would be very similar to the ones above, with slight modifications.

Loading the sentence model and a tokenizer:

model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2023-1-1")
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

Define the top_n classes function in the same way as for the context model:

def get_top_classes(logits, top_n=3):
    probabilities = torch.softmax(logits, dim=1)[0].tolist()
    classes_and_probs = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
    top_classes = dict(sorted(classes_and_probs.items(), key=lambda item: item[1], reverse=True)[:top_n])
    return top_classes

And lastly, run the sentence model (including tokenizing and appending the logits as described above):

predicted_classes = []
predicted_probabilities = []

for index, row in test.iterrows():

    text_input = row['text'] # name of your text variable 
    
    inputs = tokenizer(text_input,
                       return_tensors="pt",
                       max_length=200,
                       padding="max_length",
                       truncation=True
                       )

    logits = model(**inputs).logits
    top_classes = get_top_classes(logits)
    
    # Get top classes and probabilities
    top_classes_str = ', '.join(top_classes.keys())

Observe dataset classified by the context model

After successfully having run the context model on our Labour 2024 manifesto, now, we can observe the dataset with the newly created columns:

Clean Output

For further working with the model output, it is helpful to create three separate columns from the predicted classes and their probabilities (and convert the latter to numeric values).

# Creating a DataFrame
df = pd.DataFrame(df)

# Separate "Predicted Classes" into "class_1", "class_2", "class_3"
df[['class_1', 'class_2', 'class_3']] = df['Predicted Classes'].str.split(', ', expand=True)

# Separate "Predicted Probabilities" into "prob_class_1", "prob_class_2", "prob_class_3"
df[['prob_class_1', 'prob_class_2', 'prob_class_3']] = df['Predicted Probabilities'].str.split(', ', expand=True)

# Remove '%' and convert to numeric
df[['prob_class_1', 'prob_class_2', 'prob_class_3']] = df[['prob_class_1', 'prob_class_2', 'prob_class_3']].map(lambda x: float(x.replace('%', '')))

Examine predicted classes and probabilities

Next, you can examine how the three predicted classes and their probabilities are distributed in your data.

# Calculate some descriptive statistics
# Function to calculate confidence intervals
def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    sem = stats.sem(data)  # Standard error of the mean
    margin = sem * stats.t.ppf((1 + confidence) / 2., n - 1)
    return mean - margin, mean + margin

# Descriptive statistics
descriptives = {
    "mean_prob_class_1": df['prob_class_1'].mean(),
    "median_prob_class_1": df['prob_class_1'].median(),
    "min_prob_class_1": df['prob_class_1'].min(),
    "max_prob_class_1": df['prob_class_1'].max(),
    "sd_prob_class_1": df['prob_class_1'].std(),
    "ci_lower_prob_class_1": confidence_interval(df['prob_class_1'])[0],
    "ci_upper_prob_class_1": confidence_interval(df['prob_class_1'])[1],
    
    "mean_prob_class_2": df['prob_class_2'].mean(),
    "median_prob_class_2": df['prob_class_2'].median(),
    "min_prob_class_2": df['prob_class_2'].min(),
    "max_prob_class_2": df['prob_class_2'].max(),
    "sd_prob_class_2": df['prob_class_2'].std(),
    "ci_lower_prob_class_2": confidence_interval(df['prob_class_2'])[0],
    "ci_upper_prob_class_2": confidence_interval(df['prob_class_2'])[1],
    
    "mean_prob_class_3": df['prob_class_3'].mean(),
    "median_prob_class_3": df['prob_class_3'].median(),
    "min_prob_class_3": df['prob_class_3'].min(),
    "max_prob_class_3": df['prob_class_3'].max(),
    "sd_prob_class_3": df['prob_class_3'].std(),
    "ci_lower_prob_class_3": confidence_interval(df['prob_class_3'])[0],
    "ci_upper_prob_class_3": confidence_interval(df['prob_class_3'])[1],
    
    "n_total": len(df)
}

Likewise, we can calculate the distribution for each manifesto category. Here, we take the category with the highest probability (top-1 class) as an example.

# Group by 'class_1' and calculate the statistics
grouped = df.groupby('class_1').agg(
    mean_prob_class_1=('prob_class_1', 'mean'),
    median_prob_class_1=('prob_class_1', 'median'),
    min_prob_class_1=('prob_class_1', 'min'),
    max_prob_class_1=('prob_class_1', 'max'),
    sd_prob_class_1=('prob_class_1', 'std'),
    n=('prob_class_1', 'count')
).reset_index()

# Calculate confidence intervals if n >= 3, else set to NaN
grouped['ci_lower_prob_class_1'] = grouped.apply(
    lambda row: confidence_interval(df[df['class_1'] == row['class_1']]['prob_class_1'])[0] if row['n'] >= 3 else np.nan,
    axis=1
)
grouped['ci_upper_prob_class_1'] = grouped.apply(
    lambda row: confidence_interval(df[df['class_1'] == row['class_1']]['prob_class_1'])[1] if row['n'] >= 3 else np.nan,
    axis=1
)

Heat map

You can use these to create a heat map of the predicted probabilities for class 1. We therefore reshape the data frame into long format first.

descriptives_by_category_long = grouped.melt(
    id_vars=['class_1'],
    value_vars=['mean_prob_class_1', 'ci_lower_prob_class_1', 'ci_upper_prob_class_1'],
    var_name='Statistic',
    value_name='Value'
)

# Map labels
statistic_labels = {
    'mean_prob_class_1': 'Mean',
    'ci_lower_prob_class_1': 'Lower CI',
    'ci_upper_prob_class_1': 'Upper CI'
}

# Pivot the DataFrame for the heatmap
pivot_table = descriptives_by_category_long.pivot(index='class_1', columns='Statistic', values='Value')

# Plotting the heatmap
plt.figure(figsize=(22, 10))
## <Figure size 2200x1000 with 0 Axes>
heatmap = sns.heatmap(
    pivot_table,
    annot=True,
    fmt=".2f",
    cmap=sns.color_palette("YlGnBu", as_cmap=True),
    cbar_kws={'label': 'Value'}
)

# Customizing the plot
heatmap.set_xticklabels([statistic_labels.get(label.get_text(), label.get_text()) for label in heatmap.get_xticklabels()])
## [Text(0.5, 0, 'Lower CI'), Text(1.5, 0, 'Upper CI'), Text(2.5, 0, 'Mean')]
plt.xticks(rotation=45, ha='right', fontsize=10)
## (array([0.5, 1.5, 2.5]), [Text(0.5, 0, 'Lower CI'), Text(1.5, 0, 'Upper CI'), Text(2.5, 0, 'Mean')])
plt.yticks(fontsize=10)
## (array([ 0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5,  8.5,  9.5, 10.5,
##        11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
##        22.5, 23.5, 24.5, 25.5, 26.5, 27.5, 28.5, 29.5, 30.5, 31.5, 32.5,
##        33.5, 34.5, 35.5, 36.5, 37.5, 38.5, 39.5, 40.5, 41.5, 42.5, 43.5]), [Text(0, 0.5, '101 - Foreign Special Relationships: Positive'), Text(0, 1.5, '103 - Anti-Imperialism'), Text(0, 2.5, '104 - Military: Positive'), Text(0, 3.5, '106 - Peace'), Text(0, 4.5, '107 - Internationalism: Positive'), Text(0, 5.5, '108 - European Community/Union: Positive'), Text(0, 6.5, '109 - Internationalism: Negative'), Text(0, 7.5, '110 - European Community/Union: Negative'), Text(0, 8.5, '201 - Freedom and Human Rights'), Text(0, 9.5, '202 - Democracy'), Text(0, 10.5, '204 - Constitutionalism: Negative'), Text(0, 11.5, '301 - Federalism'), Text(0, 12.5, '303 - Governmental and Administrative Efficiency'), Text(0, 13.5, '304 - Political Corruption'), Text(0, 14.5, '305 - Political Authority'), Text(0, 15.5, '401 - Free Market Economy'), Text(0, 16.5, '402 - Incentives'), Text(0, 17.5, '403 - Market Regulation'), Text(0, 18.5, '404 - Economic Planning'), Text(0, 19.5, '405 - Corporatism/ Mixed Economy'), Text(0, 20.5, '406 - Protectionism: Positive'), Text(0, 21.5, '407 - Protectionism: Negative'), Text(0, 22.5, '409 - Keynesian Demand Management'), Text(0, 23.5, '410 - Economic Growth: Positive'), Text(0, 24.5, '411 - Technology and Infrastructure'), Text(0, 25.5, '412 - Controlled Economy'), Text(0, 26.5, '413 - Nationalisation'), Text(0, 27.5, '414 - Economic Orthodoxy'), Text(0, 28.5, '416 - Anti-Growth Economy: Positive'), Text(0, 29.5, '501 - Environmental Protection: Positive'), Text(0, 30.5, '502 - Culture: Positive'), Text(0, 31.5, '503 - Equality: Positive'), Text(0, 32.5, '504 - Welfare State Expansion'), Text(0, 33.5, '505 - Welfare State Limitation'), Text(0, 34.5, '506 - Education Expansion'), Text(0, 35.5, '601 - National Way of Life: Positive'), Text(0, 36.5, '602 - National Way of Life: Negative'), Text(0, 37.5, '603 - Traditional Morality: Positive'), Text(0, 38.5, '605 - Law and Order: Positive'), Text(0, 39.5, '606 - Civic Mindedness: Positive'), Text(0, 40.5, '701 - Labour Groups: Positive'), Text(0, 41.5, '703 - Agriculture and Farmers: Positive'), Text(0, 42.5, '705 - Underprivileged Minority Groups'), Text(0, 43.5, '706 - Non-economic Demographic Groups')])
heatmap.set_xlabel('')  # Remove x axis title
## Text(0.5, 80.5815972222222, '')
heatmap.set_ylabel('')  # Remove y axis title
## Text(245.7222222222222, 0.5, '')
plt.title('Predicted Probabilities for Class 1')
## Text(0.5, 1.0, 'Predicted Probabilities for Class 1')
plt.show()

Check difference of probabilities

Additionally, you can check how close the probabilities of the three predicted classes are. The more similar the probabilities, the less reliable the prediction. Therefore, we define a function that checks if the difference between the first predicted class (i.e., class with the highest probability) and the third predicted class (i.e., class with the lowest probability) is smaller than some threshold you define (here, we take a threshold of 6 percentage points as an example).

# Define the function to check probabilities
def check_probs(row, threshold=6):
    return row.max() - row.min() < threshold

# Apply the function to create the new column "check_probs"
df['check_probs'] = df[['prob_class_1', 'prob_class_2', 'prob_class_3']].apply(check_probs, axis=1)

Aggregate sentence-per-sentence codings to saliences

If you are interested “filling” missing data points of the Manifesto Dataset, or calculate salience in general, of course you need a measure of the percentage that each category makes up of all sentences (Note: Sentences, not quasi-sentences here!). For that, we need to count the sentences per category and divide them each by the total number of sentences. We therefore take the class_1 variable, so the category manifestoberta considers the most likely, as the basis.

# Calculate the total number of rows
total_rows = len(df)

# Group by 'class_1' and calculate the relative frequency
relative_freq = df['class_1'].value_counts() / total_rows

# Convert the series to a DataFrame with one row
relative_freq_df = pd.DataFrame([relative_freq]).reset_index(drop=True)

We then have a dataset similar to the logic of the manifesto dataset:

Save Output

Lastly, you can save the resulting dataset in any format you like, for instance as a csv file.

df.to_csv('output_file.csv', index=False)