Skip to content

Baseline and Cross-Model Results

This page summarizes the current paper result set. The plots are generated in paper-results/, and the exact lag-level metrics are stored in benchmark-results/.

The comparison includes five model families:

  • Linear: linear baseline
  • CNN: neural convolution decoder baseline
  • PopT: foundation model
  • BrainBERT: foundation model
  • DIVER: foundation model

The summary below reports the best lag for each model, task, and evaluation condition. Values in parentheses show the best lag and, when present, the relative difference from the best model in that row.

The paper summary currently covers eight tasks. Scalar LLM surprise regression, LLM token decoding, and LLM embedding pretraining remain in benchmark-results/, but they are not included in this cross-model figure set.

Metrics by task:

  • content_noncontent, gpt_surprise_multiclass, iu_boundary, pos, and sentence_onset: ROC-AUC
  • word_embedding: average word-level ROC-AUC
  • whisper_embedding: pairwise accuracy
  • volume_level: correlation

Cross-Model Summary

Best lag summary

condition task Linear CNN PopT BrainBERT DIVER
per_subject content_noncontent 0.548 (0 ms; -4%) 0.572 (0 ms) 0.499 (0 ms; -13%) 0.526 (0 ms; -8%) 0.553 (0 ms; -3%)
per_subject gpt_surprise_multiclass 0.504 (0 ms; -3%) 0.518 (0 ms) 0.502 (0 ms; -3%) 0.507 (0 ms; -2%) 0.514 (0 ms; -1%)
per_subject iu_boundary 0.607 (0 ms; -23%) 0.785 (0 ms) 0.514 (0 ms; -34%) 0.649 (0 ms; -17%) 0.684 (0 ms; -13%)
per_subject pos 0.501 (0 ms; -10%) 0.529 (0 ms; -5%) 0.499 (0 ms; -10%) 0.524 (0 ms; -6%) 0.555 (0 ms)
per_subject sentence_onset 0.674 (0 ms; -17%) 0.809 (0 ms) 0.503 (0 ms; -38%) 0.670 (0 ms; -17%) 0.655 (0 ms; -19%)
per_subject volume_level 0.064 (0 ms; -88%) 0.206 (0 ms; -62%) 0.047 (0 ms; -91%) 0.308 (0 ms; -43%) 0.545 (0 ms)
per_subject whisper_embedding 0.511 (0 ms; -23%) 0.663 (0 ms) 0.502 (0 ms; -24%) 0.573 (0 ms; -14%) 0.638 (0 ms; -4%)
per_subject word_embedding 0.594 (0 ms; -3%) 0.565 (0 ms; -8%) 0.495 (0 ms; -19%) 0.558 (0 ms; -9%) 0.612 (0 ms)
super_subject content_noncontent 0.565 (0 ms; -14%) 0.564 (0 ms; -14%) 0.574 (0 ms; -12%) 0.601 (0 ms; -8%) 0.656 (0 ms)
super_subject gpt_surprise_multiclass 0.500 (0 ms; -9%) 0.518 (0 ms; -5%) 0.528 (0 ms; -4%) 0.522 (0 ms; -5%) 0.548 (0 ms)
super_subject iu_boundary 0.739 (0 ms; -12%) 0.839 (0 ms) 0.732 (0 ms; -13%) 0.770 (0 ms; -8%) 0.793 (0 ms; -6%)
super_subject pos 0.500 (0 ms; -24%) 0.537 (0 ms; -18%) 0.559 (0 ms; -15%) 0.578 (0 ms; -12%) 0.656 (0 ms)
super_subject sentence_onset 0.890 (0 ms) 0.875 (0 ms; -2%) 0.731 (0 ms; -18%) 0.768 (0 ms; -14%) 0.794 (0 ms; -11%)
super_subject volume_level 0.402 (0 ms; -33%) 0.583 (0 ms; -3%) 0.309 (0 ms; -49%) 0.489 (0 ms; -19%) 0.602 (0 ms)
super_subject whisper_embedding 0.632 (0 ms; -20%) 0.684 (0 ms; -14%) 0.684 (0 ms; -14%) 0.715 (0 ms; -10%) 0.792 (0 ms)
super_subject word_embedding 0.681 (0 ms; -9%) 0.572 (0 ms; -24%) 0.601 (0 ms; -20%) 0.655 (0 ms; -12%) 0.748 (0 ms)

At the super-subject level, DIVER has the highest peak score on six of the eight summarized tasks. The CNN remains strongest for intonation-unit boundary detection, while the linear baseline is strongest for sentence onset. In per-subject evaluations, the CNN leads five tasks and DIVER leads three.

Source Files

  • Cross-model configuration: benchmark-results/paper_results.yml
  • Cross-model summary table: paper-results/best_lag_summary.csv
  • Exact lag-level metrics: benchmark-results/<model>/<task>/<condition>/lag_performance.csv