Common mistakes¶

Read this once before you write code against the corpus. Two minutes here saves a debugging session later.

1. Don't derive `has_violence` from typology¶

This will misclassify every NEG clip:

# WRONG — NEG clips will look violent because of their max_intensity
has_violence = typology in ("SV", "IT")

# CORRECT — uses the event-level ground truth
has_violence = any(e["tier1_category"] != "NONE" for e in events)

has_violence in weak_label is derived from strong-label events, not from typology. NEG clips can have max_intensity = 3 (raised voices, distress) and still be has_violence: false because every one of their events lands tier1_category: "NONE" by design. That's the whole point of NEG: hard negatives that sound intense but aren't violent.

2. Don't hardcode speaker directory paths¶

There's already more than one. Delivery-003 has three speaker directories under data/he/:

data/he/agg_m_30-45_001/   # She-Proves, Azure pair
data/he/agg_m_30-45_002/   # She-Proves, Google Chirp HD pair  (new in delivery-003)
data/he/ben_m_40-55_003/   # Elephant in the Room, Azure pair  (new in delivery-003)

Code that hardcodes data/he/agg_m_30-45_001/ will miss two-thirds of the clips. Use manifest.csv (the wav_path column is repo-relative POSIX), or derive the directory from the first entry in speakers[]:

speaker_dir = root / "data" / meta["language"] / meta["speakers"][0]["speaker_id"].lower()

3. Audio peak is ~0.79, not 1.0¶

Clips are normalized to a –2.0 dBFS peak target (not –1.0 dBFS = linear 1.0). Loading a clip and expecting full-range float values will surprise you:

wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
print(np.abs(wav).max())  # ~0.7943, not 1.0

The –2 dBFS target leaves 2 dB of headroom above the safety limiter at –1.0 dBFS. The configured target is recorded in generation_metadata.loudness_target_peak_dbfs; the measured peak is recorded in preprocessing_applied.normalized_dbfs.

4. UPPERCASE in JSON, lowercase on disk¶

The same speaker has two surface forms:

Surface	Form	Example
JSON field (`speaker_id`, `speakers[].speaker_id`)	UPPERCASE	`AGG_M_30-45_001`
Filesystem directory	lowercase	`agg_m_30-45_001/`
`clip_id` (everywhere)	lowercase	`sp_sv_a_0001_00`

If you build a dict keyed on speaker IDs from JSON and then try to look up paths with the same string, you'll get a FileNotFoundError. Always .lower() when converting from JSON to a path.

5. NEG is not "violent at low intensity"¶

The four violence typologies are not an ordered scale.


`SV`	Severe Violence — physical attacks, life-threatening
`IT`	Intimate Terrorism — sustained coercive control, repeated abuse
`NEG`	Negative confusor — sounds intense, no violence (hard negative)
`NEU`	Neutral — mundane conversation

A NEG clip is not "a milder SV." It is acoustic distress that a naive model would mistake for violence. Treating NEG as a positive class will tank your precision.

6. All clips are `split: train` in delivery-003¶

The split column exists in manifest.csv, but there are only 4 unique speaker personas across all 20 clips. Speaker-disjoint train/val/test partitioning isn't feasible at this scale — every clip is therefore assigned split: train. Don't trust the split column as a usable partition. Treat the whole corpus as an unpartitioned pool for now. SynthBanshee will assign meaningful splits once the speaker pool grows.

7. `quality_flags` doesn't mean "broken"¶

About 15 of 20 clips in delivery-003 carry at least one quality_flags entry — usually emotion_downgrade (the TTS produced slightly less intense prosody than the SSML asked for at high-intensity turns). These clips are still validated and spec-compliant; the flag is a soft hint, not a failure. Don't filter them out reflexively.

The hard line is synthbanshee validate — a clip either passes or doesn't. If it's in the corpus, it passed.

8. The 2 Google clips have a `vic_f0_high` flag — that's expected¶

sp_sv_a_0003_00 and sp_it_a_0003_00 use the Google Chirp 3 HD female voice (he-IL-Chirp3-HD-Achernar), whose fundamental-frequency baseline runs higher than the Azure reference voice the QA thresholds were calibrated against. The flag is fired correctly; the audio is fine. Don't exclude these clips on the basis of this flag — your model needs the backend diversity. If you compute F0-derived features, calibrate per backend.

9. Timestamps already account for silence padding¶

Every clip has ≥0.5 s of silence at head and tail. Onset/offset timestamps in .txt and .jsonl are already shifted to refer to positions in the final processed WAV. You don't need to add the pad — read the timestamp, slice the WAV, done.

10. The `.json` and `.jsonl` files aren't the same thing¶

File	Contains	When to load
`{clip_id}.json`	`ClipMetadata` — one object per clip: weak labels, speakers, provenance, acoustic scene	Always
`{clip_id}.jsonl`	`EventLabel` records — one JSON object per line, one per labelled event in the clip	When you need per-event strong labels (onset/offset/category)

If you json.loads() the .jsonl you'll get an error. Read line by line.

Quick verification¶

Use these snippets to confirm a fresh clone is intact:

find data/he -name "*.wav" | wc -l     # expect 20
wc -l data/he/manifest.csv              # expect 21 (header + 20 rows)

import pandas as pd
df = pd.read_csv("data/he/manifest.csv")
assert len(df) == 20
assert set(df["tier"]) == {"A", "B"}
assert set(df["violence_typology"]) == {"SV", "IT", "NEG", "NEU"}
assert df["has_violence"].sum() == 10   # 5 SV + 5 IT