Elephant in the Room Guide¶
Elephant in the Room (הפיל שבחדר) is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. Optimisation target: high precision — false alarms erode trust with the security team and the workers they protect.
This page is the differential between Elephant clips and the rest of the corpus. For shared concepts (schema, labels, audio format) follow the cross-links.
Scene profile¶
| Project code | elephant_in_the_room (clip-id prefix el_*) |
| Tier | B — room IR + device profile + background noise applied |
| Duration | 1–4 min |
| Alert window | Final 40% of the clip — violence events concentrate here |
| Device | pi_budget_mic |
| Room types | clinic_office, welfare_office, open_office (only clinic_office in delivery-003) |
The alert-in-final-40% constraint mirrors real-world deployment: the device picks up normal consultation audio for most of the session before any threat emerges. The model must distinguish escalation from a baseline of routine professional interaction.
What Tier B adds (and why)¶
Tier B clips run through three augmentation stages after preprocessing. This is what separates them from She-Proves (Tier A) clips.
| Stage | What it adds | Where to find it in metadata |
|---|---|---|
| Room IR convolution | Reverb of a real-sounding room | acoustic_scene.room_type, ir_source |
| Device profile | Frequency response of a budget Pi microphone | acoustic_scene.device |
| Background noise injection | HVAC hum + occasional ACOU_* events |
acoustic_scene.background_events |
After augmentation the clip is renormalised to the same peak target (–2.0 dBFS) as Tier A, so the two tiers are comparable on the loudness dimension.
What pyroomacoustics_ism is
The image-source method (ISM) synthesises a room impulse response by simulating a virtual point source reflecting off the walls of a modelled room. pyroomacoustics is the Python library that implements it. The resulting IR, when convolved with clean speech, makes the speech sound like it was recorded in the modelled room — without needing a real recording.
The acoustic_scene field¶
For Tier A clips this is all null / empty. For Elephant clips it's fully populated:
"acoustic_scene": {
"room_type": "clinic_office",
"device": "pi_budget_mic",
"ir_source": "pyroomacoustics_ism",
"snr_db_actual": 11.2,
"speaker_distance_meters": 1.2,
"background_events": [
{"type": "hvac_hum", "onset": 0.000, "offset": 147.031, "level_db": -37.4},
{"type": "ACOU_SLAM", "onset": 72.164, "offset": 72.476, "level_db": 9.9},
{"type": "ACOU_FALL", "onset": 97.570, "offset": 98.473, "level_db": 9.6}
]
}
| Field | What it tells you |
|---|---|
room_type |
Modelled room (clinic_office / welfare_office / open_office) |
device |
Microphone profile applied (pi_budget_mic) |
ir_source |
How the room IR was generated (currently always pyroomacoustics_ism) |
snr_db_actual |
Measured speech-to-noise ratio in dB after mixing — your ground truth for SNR-stratified eval |
speaker_distance_meters |
Simulated distance from speaker to microphone |
background_events |
List of non-speech acoustic events: hvac_hum (constant low-level), ACOU_SLAM / ACOU_FALL (brief, high-level) |
ACOU_* events are double-recorded
Each ACOU_SLAM / ACOU_FALL event lives in both acoustic_scene.background_events (with level_db mixing metadata) and the .jsonl strong-label file (as a regular EventLabel with tier1_category: "ACOU"). The two views are deliberate — the first carries audio-level provenance, the second is the supervision target. If you train an event detector, use the .jsonl view.
Speaker pair¶
One pair in delivery-003. Roles match Israeli welfare/clinic demographics: BEN (client/service-user, male) + SW (social worker, female).
Speaker directory: data/he/ben_m_40-55_003/
| Role | speaker_id | TTS voice |
|---|---|---|
| BEN | BEN_M_40-55_003 |
he-IL-AvriNeural |
| SW | SW_F_30-45_001 |
he-IL-HilaNeural |
Both speakers use the Azure backend. See Glossary — Speaker roles if BEN and SW are new abbreviations.
Clips in delivery-003¶
8 clips · ~17 min · 4 violent (SV + IT), 4 non-violent (NEG + NEU) · all room_type: clinic_office, all device: pi_budget_mic, SNR ~11 dB
Full clip listing
All in data/he/ben_m_40-55_003/:
| Clip ID | Typology | violent | Duration |
|---|---|---|---|
el_sv_b_0001_00 |
SV | ✓ | 2m 27.0s |
el_sv_b_0002_00 |
SV | ✓ | 2m 18.5s |
el_it_b_0001_00 |
IT | ✓ | 2m 30.0s |
el_it_b_0002_00 |
IT | ✓ | 2m 31.6s |
el_neg_b_0001_00 |
NEG | — | 1m 53.8s |
el_neg_b_0002_00 |
NEG | — | 2m 54.6s |
el_neu_b_0001_00 |
NEU | — | 1m 56.9s |
el_neu_b_0002_00 |
NEU | — | 1m 19.7s |
Loading and inspecting an Elephant clip¶
import pandas as pd, soundfile as sf, json
from pathlib import Path
root = Path(".")
df = pd.read_csv("data/he/manifest.csv")
el = df[df["project"] == "elephant_in_the_room"] # 8 rows
# Pick one clip
row = el.iloc[0]
wav, sr = sf.read(root / row.wav_path)
meta = json.loads((root / row.wav_path).with_suffix(".json").read_text())
# Acoustic scene
scene = meta["acoustic_scene"]
print(f"{scene['room_type']} {scene['device']} SNR {scene['snr_db_actual']} dB "
f"dist {scene['speaker_distance_meters']} m")
# Background acoustic events
for evt in scene["background_events"]:
print(f" {evt['type']:10s} {evt['onset']:6.1f}s – {evt['offset']:6.1f}s @ {evt['level_db']:+5.1f} dB")
# Alert window (final 40%) — for sliding-window evaluation
duration = meta["duration_seconds"]
alert_start = 0.60 * duration
events = [json.loads(l) for l in (root / row.strong_labels_path).read_text().splitlines() if l.strip()]
alert_events = [e for e in events if e["onset"] >= alert_start]
print(f"alert window: {alert_start:.1f}s – {duration:.1f}s "
f"{len(alert_events)} events fire in window "
f"(of {len(events)} total)")
Training-time notes (specific to this project)¶
- NEG clips are essential for precision.
el_neg_b_*is intense speech in a clinic room with background noise but no violence. If your detector fires on these, security stops trusting it. Train hard against these. - The alert-in-final-40% structure is exploitable. Consider a sliding-window detector that biases toward the back half of each clip — or use the window structure as a positional feature. Don't reward early firing.
- SNR ~11 dB is challenging. Verify your features (MFCCs, log-mel, etc.) are robust here before comparing with She-Proves Tier A results. SNR is recorded per clip (
acoustic_scene.snr_db_actual) — use it for SNR-stratified eval. ACOU_*events double as strong labels. You can train an event detector onACOU_SLAM/ACOU_FALLseparately from the speech-violence detector and ensemble them.- What delivery-003 doesn't cover: only
clinic_office, only one speaker pair (BEN+SW), only Azure backend, SNR essentially constant at ~11 dB. Plan for room diversity, SNR stratification, and speaker-disjoint splits when scaling.
Still a small test batch
8 clips, 1 room type, 1 speaker pair, 1 SNR is enough to wire up data loaders and acoustic-scene parsing. It is not enough to train a production model. Build the plumbing; wait for the real batch.