Start here¶
Your first 10 minutes with the corpus. By the end you'll have cloned it, verified the clone, loaded one clip with its labels, and seen what's in a transcript.
1. Clone¶
No Git LFS. Total size is a few hundred megabytes for delivery-003 — the audio lives in data/he/, the SSML caches live in assets/.
2. Verify the clone¶
find data/he -name "*.wav" | wc -l # expect 20
wc -l data/he/manifest.csv # expect 21 (header + 20 rows)
If those numbers don't match, the clone is incomplete — git lfs pull is not the answer (we don't use LFS). Re-clone.
3. Install the minimal Python deps¶
That's enough for everything on this page. pydantic is only needed if you want strict schema validation; jsonlines only if you prefer it to the one-liner that reads .jsonl directly.
When you'd want the full SynthBanshee install
If you want from synthbanshee.labels.schema import ClipMetadata for strict Pydantic validation, or synthbanshee qa-report to re-run QA over the data directory:
json.loads() is fine and is what the examples below use.
4. Load one clip end-to-end¶
The path on disk is lowercase even though the speaker ID in JSON is UPPERCASE — that's a Gotcha #4.
import json
from pathlib import Path
import soundfile as sf
root = Path(".") # repo root
clip_dir = root / "data/he/agg_m_30-45_001"
clip_id = "sp_sv_a_0001_00"
# Audio
wav, sr = sf.read(clip_dir / f"{clip_id}.wav")
assert sr == 16000 and wav.ndim == 1 # always 16 kHz mono
print(f"duration={len(wav)/sr:.1f}s peak={abs(wav).max():.3f}")
# duration=110.5s peak=0.794
Why is the peak 0.794 and not 1.0?
Clips are normalized to a –2.0 dBFS peak target, which is roughly 0.79 linear amplitude. Use generation_metadata.loudness_target_peak_dbfs to read the configured target and preprocessing_applied.normalized_dbfs to read the measured output peak. Full detail: Audio Format.
Clip-level labels (weak labels):
meta = json.loads((clip_dir / f"{clip_id}.json").read_text())
wl = meta["weak_label"]
print(f"typology={meta['violence_typology']} has_violence={wl['has_violence']} "
f"intensity_max={wl['max_intensity']} categories={wl['violence_categories']}")
# typology=SV has_violence=True intensity_max=5 categories=['DIST', 'PHYS', 'VERB']
Event-level labels (strong labels):
events = [
json.loads(line)
for line in (clip_dir / f"{clip_id}.jsonl").read_text().splitlines()
if line.strip()
]
for evt in events[:3]:
print(f"[{evt['onset']:5.1f}s – {evt['offset']:5.1f}s] "
f"{evt['speaker_role']} {evt['tier1_category']}/{evt['tier2_subtype']} "
f"I{evt['intensity']}")
# [ 0.8s – 10.1s] AGG VERB/VERB_SHOUT I2
# [ 10.5s – 18.7s] VIC VERB/VERB_SHOUT I2
# [ 18.3s – 29.7s] AGG VERB/VERB_THREAT I3
The full 14-event escalation arc for this clip is the one visualised on the home page — verbal → distress → physical → settle.
5. Read a transcript¶
.txt files are turn-major with a small header block per turn. They use UTF-8 Hebrew and are intended both for human reading and as ASR reference.
[CLIP_ID: sp_sv_a_0001_00]
[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.76 | OFFSET: 10.07]
מה זה הארוחה הזאת? שאלתי אותך דבר אחד פשוט, לעשות ארוחת ערב נורמלית.
[ACTION: VERB_SHOUT | INTENSITY: 2]
[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 10.49 | OFFSET: 18.74]
עבדתי עד שש היום. עשיתי מה שהספקתי...
Hebrew is right-to-left; some terminals mis-render it
macOS Terminal.app handles it correctly; older Windows consoles don't. If transcripts look reversed or garbled, view the .txt in an editor (VS Code, BBEdit) rather than cat.
Timestamps in the header are already relative to the final processed WAV — they include the 0.5 s silence pad at the head. No shift needed.
6. Work from the manifest, not from hardcoded paths¶
data/he/manifest.csv is one row per clip. It's the fastest entry point for filtering and the safest way to find files (because hardcoded speaker directories will miss two-thirds of the clips).
import pandas as pd, soundfile as sf
df = pd.read_csv("data/he/manifest.csv")
df.columns.tolist()
# ['clip_id', 'project', 'violence_typology', 'tier', 'duration_seconds',
# 'speaker_ids', 'voice_families', 'has_violence', 'max_intensity',
# 'quality_flags', 'split', 'wav_path', 'strong_labels_path']
# Filter
violent = df[df["has_violence"]] # 10 clips
elephant = df[df["project"] == "elephant_in_the_room"] # 8 clips
sv_high = df[(df["violence_typology"] == "SV") & (df["max_intensity"] >= 4)]
# Load audio for any manifest row — wav_path is already repo-relative POSIX
row = df.iloc[0]
wav, sr = sf.read(row["wav_path"])
speakers = row["speaker_ids"].split("|") # pipe-delimited!
voices = row["voice_families"].split("|") # same order as speaker_ids
speaker_ids and voice_families are pipe-delimited strings
They are not CSV-nested lists. Split on |.
7. Where to go next¶
| You're about to… | Read |
|---|---|
| Write data-loading code | Common mistakes (2 min read; saves debugging) |
Look up a field in .json |
Schema Reference |
Understand has_violence semantics |
Label Taxonomy |
| Look up a term (F0, SSML, IR, BEN…) | Glossary |
| Work specifically with phone-app data | She-Proves guide |
| Work specifically with Tier B / room-augmented audio | Elephant in the Room guide |
| Verify a clip is spec-compliant | synthbanshee validate <path> (requires SynthBanshee installed) |