Evolving Testing Methods

Using Generative AI in Item Development: What Does the Future Hold?

London (UK), July 2024 - Generative artificial intelligence (AI) tools have produced both excitement and uncertainty across many industries, and the global assessments landscape is no exception. As AI has become more powerful and accessible, high-stakes testing methods have evolved in parallel. An increasing number of exam content creators have started actively exploring how AI technologies might be used to develop test items for a range of assessments. 

With AI dominating media headlines around the world, Neil Wilkinson, Manager of Content for EMEA and India at Pearson VUE explores some of the key considerations and what the future holds for item development and AI.

Back in July 2021, the Association of Test Publishers (ATP) released a white paper titled Artificial Intelligence and the Testing Industry: A Primer, exploring a range of potential applications for AI in the credentialing field as well as "the appropriate responsible use of AI".

Shifting forward to today, AI continues to dominate media headlines around the world. An increasing number of exam content creators have started actively exploring how AI technologies might be used to develop test items for a range of assessments.

With AI understandably giving everyone a lot to think about, questions that inevitably come up include

  • What differences between AI and human-driven item quality might we see with item development in the future?
  • With the integration of AI in the item writing process, what ethical or legal implications need to be considered?
  • What is the potential cost of incorporating AI into existing processes? With the level of accuracy depending largely on the quality and quantity of data, what is the cost of human effort required to refine and improve upon the generated content?

Generative AI allows users to submit written text (prompts) specifying a task.

This could include writing a multiple-choice item, crafting a scenario related to a particular profession, suggesting plausible but incorrect response options, or editing existing text according to style guidelines. While AI can do these things, the obvious question is "How well?" closely followed by "How will this fit into our existing test development processes?

Opinions across the global assessments landscape vary widely around the potential for automatic item generation whether using AI or other methods such as template-based approaches. While simple requests for items may produce flawed and relatively low-level content, it is possible to get well-constructed items across a range of cognitive levels using the right prompt.

Incorporating the same instructions provided to human item writers regarding format, structure, distractors, and other item elements is just as necessary for generative AI item development as conventional item writing processes.

Including experts on item development and evaluation through your whole process is key, as is an organized and scientific approach to understanding the results generated.

In 2023, Pearson VUE conducted several studies looking into the quality and characteristics of items generated by popular, free-to-use AI platforms.

We created a series of prompts based on our item writing guidelines, which also included comprehensive instructions and examples of cognitive level, sample item formats, and style guidelines.

Our research helped us to understand the current capabilities (and potential limitations) of AI for item development:

  1. We found that the creation of quality draft test items using large language models (LLMs) can reduce item writing time considerably.
  2. AI-generated items appear comparable to unedited human-written items across a range of categories, including
     
    • Did the item address the appropriate topic?
    • Is the item appropriate for the exam?
    • How much editing does the item require?
    • Does the item contain factual errors?
    • Is the item key correct?
    • Are the incorrect options plausible and incorrect?
       

3. The cognitive level, or type of thinking (remember, apply, analyze) required to answer the item, did not always match the requested level for both AI and human items. Asking the AI to make an item more difficult resulted in longer/wordier items that were not perceived as being more difficult. When using AI to generate numerous items at once from a test content area, duplicate content was produced, and items were uneven in their coverage of topics within that content area.

Further research exploring the instructions (prompts) used to generate items is underway. Studies we will present in 2024 explore which elements of the prompt are necessary to produce quality items and whether quality is maintained across different professional fields. We will study if there are time savings when using AI to investigate any productivity gains. We are also working on methods for inserting certain types of content into the AI item-generation process.

As high stakes testing methods continue to evolve, questions will inevitably arise around content ownership when using generative AI and how test owners can prevent bias or discrimination. These concerns are not unique to the testing industry, and any industry using generative AI will have to navigate them accordingly. As the ATP white paper highlighted, "When inherent bias and a diverse user population are not accounted for in developing and using AI, there are great risks related to bias and discrimination in outcomes".

Exploring the opportunities and challenges of applying this technology to item quantity, quality, and difficulty to help shape the future of our industry is key:

  • AI will likely reduce the time for producing draft test items, sample content, and quality item writing.
  • AI may start to play more of a part in the editorial process, providing suggestions to focus item writing on specific areas, or be part of the feedback process.
  • Subject matter experts and experienced test developers will continue to review and verify test items with technical and ethical oversight through a human-controlled approach.

We embrace the assistive capabilities that artificial intelligence (AI) brings to the testing industry and the immense potential generative AI offers to subject matter experts and content developers in the future.

Our development roadmap is rooted in continuous improvement to help ensure exam integrity, while our focus on providing a superior testing experience for all candidates, test owners, and programs remains unchanged.

As an industry, we need to collectively ensure we're using AI tools with ethical and technical oversight and be transparent about their application.