LLMs for Axial Coding ECIR 2026 preprint available

Social scientists have been manually coding (labeling text segments to capture their essence, and clustering these labels into groups) large textual corpora for decades. Time that could have been spent doing research, annotating fire hydrants for Google Streetview, or spending time with loved ones.

For this reason and more, in our latest paper we turn to LLMs for automated axial coding of lengthy transcripts (political debates). We extend an ensemble-based open coding pipeline with two axial coding (grouping) strategies: “traditional” clustering with subsequent LLM labeling, and direct LLM-based grouping.

We find a clear trade-off: traditional clustering methods achieve high coverage and structural separation, and direct LLM grouping produces more concise, interpretable labels that are more similar to human-assign group labels, but with much lower coverage. Traditional clustering ensures broad representation; LLMs supply the interpretive layer that makes categories human-readable.

Get the preprint here:

  • [PDF] A. Parfenova, D. Graus, and J. Pfeffer, “From quotes to concepts: axial coding of political debates with ensemble lms,” in European conference on information retrieval (ecir), Delft, The Netherlands, 2026.
    [Bibtex]
    @inproceedings{parfenova2026quotes,
    title = {From Quotes to Concepts: Axial Coding of Political Debates with Ensemble LMs},
    author = {Parfenova, Angelina and Graus, David and Pfeffer, Juergen},
    booktitle = {European Conference on Information Retrieval (ECIR)},
    year = {2026},
    address = {Delft, The Netherlands},
    note = {To appear}
    }

This was work led by Angelina Parfenova. Our full dataset of 5k Dutch parliamentary debate utterances with LLM-assigned codes and categories are publicly available here: https://github.com/Likich/axial_coding_dataset.

Leave a Reply