She-Proves Guide¶

She-Proves is a smartphone app that passively monitors audio for domestic-violence incidents and preserves evidence for legal use. Optimisation target: high recall — better to flag for review than to miss.

This page is the differential between She-Proves clips and the rest of the corpus. For shared concepts (schema, labels, audio format) follow the cross-links.

Scene profile¶


Project code	`she_proves` (clip-id prefix `sp_*`)
Tier	A — clean audio, no room/device augmentation
Duration	3–6 min
Pre-incident window	≥ 60% of clip is normal speech before the first violence event
Device	`phone_in_pocket`, `phone_on_table`, `phone_in_hand` (planned; not active in delivery-003)
Room	Apartment (living room, bedroom, kitchen) — planned; not active in delivery-003

The long pre-incident window is intentional. In deployment the app is always listening; incidents are rare. A model trained only on escalation segments will miss the gradual-buildup signal that precedes most domestic-violence events.

What 'Tier A' means here

Tier A audio is the direct TTS-mixer output after preprocessing — peak-normalised, silence-padded, 16 kHz mono PCM. No room IR, no microphone profile, no background noise. acoustic_scene.room_type, device, ir_source, and snr_db_actual are all null for every Tier A clip.

Delivery-003 has no Tier-A device augmentation yet (the phone_in_pocket etc. profiles exist in the pipeline but aren't applied at this stage). When that's added in a future delivery, the acoustic_scene block will start carrying device while keeping room_type null.

Speaker pairs¶

Two pairs in delivery-003, one per TTS backend. Both pairs play the AGG (aggressor, male) + VIC (victim, female) roles.

Azure pair (10 clips)Google Chirp HD pair (2 clips)

Speaker directory: data/he/agg_m_30-45_001/

Role	speaker_id	TTS voice
AGG	`AGG_M_30-45_001`	`he-IL-AvriNeural`
VIC	`VIC_F_25-40_002`	`he-IL-HilaNeural`

Speaker directory: data/he/agg_m_30-45_002/

Role	speaker_id	TTS voice
AGG	`AGG_M_30-45_002`	`he-IL-Chirp3-HD-Achird`
VIC	`VIC_F_25-40_003`	`he-IL-Chirp3-HD-Achernar`

The Google pair was added in delivery-003 specifically to introduce backend diversity. Both clips carry a vic_f0_high flag — see Audio Format.

Gotcha #2: don't hardcode agg_m_30-45_001/ — three speaker directories exist now, including one for Elephant. Filter on manifest.csv["project"] == "she_proves" or on meta["project"].

Clips in delivery-003¶

12 clips · ~20 min · 6 violent (SV + IT), 6 non-violent (NEG + NEU)

Full clip listing

Azure pair, data/he/agg_m_30-45_001/ — 10 clips:

Clip ID	Typology	violent	Duration
`sp_sv_a_0001_00`	SV	✓	1m 50.5s
`sp_sv_a_0002_00`	SV	✓	1m 32.1s
`sp_it_a_0001_00`	IT	✓	2m 23.8s
`sp_it_a_0002_00`	IT	✓	2m 19.7s
`sp_neg_a_0001_00`	NEG	—	1m 58.8s
`sp_neg_a_0002_00`	NEG	—	1m 47.8s
`sp_neg_a_0003_00`	NEG	—	2m 26.3s
`sp_neu_a_0001_00`	NEU	—	1m 59.2s
`sp_neu_a_0002_00`	NEU	—	2m 09.0s
`sp_neu_a_0003_00`	NEU	—	1m 45.1s

Google Chirp HD pair, data/he/agg_m_30-45_002/ — 2 clips:

Clip ID	Typology	violent	Duration	Flags
`sp_sv_a_0003_00`	SV	✓	1m 42.8s	`vic_f0_high`
`sp_it_a_0003_00`	IT	✓	1m 53.9s	`vic_f0_high`

The waveform on the home page is sp_sv_a_0001_00 — a worked example of an SV escalation arc in this project's data.

Loading just the She-Proves clips¶

import pandas as pd, soundfile as sf, json
from pathlib import Path

root = Path(".")
df = pd.read_csv("data/he/manifest.csv")
sp = df[df["project"] == "she_proves"]                   # 12 rows

# Tag backend per row (Google clips have "Chirp" in voice_families)
sp = sp.assign(backend=sp["voice_families"].str.contains("Chirp").map({True: "google", False: "azure"}))
print(sp.groupby("backend")["clip_id"].count())
# azure     10
# google     2

# Load audio for each row
audio = {row.clip_id: sf.read(root / row.wav_path) for row in sp.itertuples()}

Training-time notes (specific to this project)¶

NEG clips are your hardest negatives. sp_neg_a_* clips have raised voices, distress, arguments — and has_violence: false. Recall metrics that fire on these will tank precision. See Gotcha #1 and Gotcha #5.
Use the pre-incident window. The first 60% of each violent clip looks like NEU-grade speech. Train across the full clip, not only on escalation segments — early-warning signal lives in the buildup.
Per-turn intensity is a useful auxiliary objective. EventLabel.intensity gives turn-level supervision beyond binary has_violence. An intensity regressor trained alongside the classifier often boosts the latter.
Only 2 voice families per gender in this delivery (low_voice_diversity_* is flagged at the corpus level). Expect your acoustic features to over-fit to AvriNeural and HilaNeural — track per-voice eval separately when the corpus grows.
No device/room augmentation yet on She-Proves clips. When the phone_in_pocket profile activates in a future delivery, your model will see substantially more high-frequency roll-off and handling noise than what's in delivery-003.

Still a small test batch

12 clips and 4 voices is enough to wire up your data loaders, label parsers, and evaluation harness. It is not enough to train a production model. Build the plumbing; wait for the real batch.