Stress Testing Factual Consistency Metrics for Long-Document Summarization

Notes

Was recommended this paper as they are doing dataset pertubation, which what I am also interested in doing — pertubation → see effects in terms of preference.

Main metric here is factuality score, which is the log-likelihood scores. Figure 2 is the change in factuality score (perturbed minus original). In the figure, there are obvious shifts. -ve shift means higher score for pertubed. +ve shift means higher original score.

Pertubations used:

Paraphrased — summary is rewritten with alternate phrasings and syntactic structures
Simplified — where complex or compound constructions are rewritten into shorter, more readable sentences
Synonym replaced — where content words are substituted with close synonyms to test for lexical invariance
Less diverse — summaries that reduce vocabulary variation
Negated — introduce logically equivalent negations to prbe sensititvity to syntactic polarity
Summarized — further compresses the summary for brevity
Added source text — inserts a factual sentence directly from the source

Another finding is that claim that are more general (highly similar to many other claims in the original document) are harder to fact-check.

Measured by mean pair-wise cosine similarity between each summary sentence and the entire document.
Higher cosine sim → more general

Thoughts

The pertubation methods should be kept in mind. But running this entire thing in a big SFT dataset seems impossible.

Explorer

Stress Testing Factual Consistency Metrics for Long-Document Summarization

Notes

Thoughts

Graph View

Table of Contents