Skip to content

Elephant in the Room Guide

Elephant in the Room (הפיל שבחדר) is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. Optimisation target: high precision — false alarms erode trust with the security team and the workers they protect.

This page is the differential between Elephant clips and the rest of the corpus. For shared concepts (schema, labels, audio format) follow the cross-links.


Scene profile

Project code elephant_in_the_room (clip-id prefix el_*)
Tier B — room IR + device profile + background noise applied
Duration 1–4 min
Alert window Final 40% of the clip — violence events concentrate here
Device pi_budget_mic
Room types clinic_office, welfare_office, open_office (only clinic_office in delivery-003)

The alert-in-final-40% constraint mirrors real-world deployment: the device picks up normal consultation audio for most of the session before any threat emerges. The model must distinguish escalation from a baseline of routine professional interaction.


What Tier B adds (and why)

Tier B clips run through three augmentation stages after preprocessing. This is what separates them from She-Proves (Tier A) clips.

Stage What it adds Where to find it in metadata
Room IR convolution Reverb of a real-sounding room acoustic_scene.room_type, ir_source
Device profile Frequency response of a budget Pi microphone acoustic_scene.device
Background noise injection HVAC hum + occasional ACOU_* events acoustic_scene.background_events

After augmentation the clip is renormalised to the same peak target (–2.0 dBFS) as Tier A, so the two tiers are comparable on the loudness dimension.

What pyroomacoustics_ism is

The image-source method (ISM) synthesises a room impulse response by simulating a virtual point source reflecting off the walls of a modelled room. pyroomacoustics is the Python library that implements it. The resulting IR, when convolved with clean speech, makes the speech sound like it was recorded in the modelled room — without needing a real recording.


The acoustic_scene field

For Tier A clips this is all null / empty. For Elephant clips it's fully populated:

"acoustic_scene": {
  "room_type":                "clinic_office",
  "device":                   "pi_budget_mic",
  "ir_source":                "pyroomacoustics_ism",
  "snr_db_actual":            11.2,
  "speaker_distance_meters":  1.2,
  "background_events": [
    {"type": "hvac_hum",  "onset":  0.000, "offset": 147.031, "level_db": -37.4},
    {"type": "ACOU_SLAM", "onset": 72.164, "offset":  72.476, "level_db":   9.9},
    {"type": "ACOU_FALL", "onset": 97.570, "offset":  98.473, "level_db":   9.6}
  ]
}
Field What it tells you
room_type Modelled room (clinic_office / welfare_office / open_office)
device Microphone profile applied (pi_budget_mic)
ir_source How the room IR was generated (currently always pyroomacoustics_ism)
snr_db_actual Measured speech-to-noise ratio in dB after mixing — your ground truth for SNR-stratified eval
speaker_distance_meters Simulated distance from speaker to microphone
background_events List of non-speech acoustic events: hvac_hum (constant low-level), ACOU_SLAM / ACOU_FALL (brief, high-level)

ACOU_* events are double-recorded

Each ACOU_SLAM / ACOU_FALL event lives in both acoustic_scene.background_events (with level_db mixing metadata) and the .jsonl strong-label file (as a regular EventLabel with tier1_category: "ACOU"). The two views are deliberate — the first carries audio-level provenance, the second is the supervision target. If you train an event detector, use the .jsonl view.


Speaker pair

One pair in delivery-003. Roles match Israeli welfare/clinic demographics: BEN (client/service-user, male) + SW (social worker, female).

Speaker directory: data/he/ben_m_40-55_003/

Role speaker_id TTS voice
BEN BEN_M_40-55_003 he-IL-AvriNeural
SW SW_F_30-45_001 he-IL-HilaNeural

Both speakers use the Azure backend. See Glossary — Speaker roles if BEN and SW are new abbreviations.


Clips in delivery-003

8 clips · ~17 min · 4 violent (SV + IT), 4 non-violent (NEG + NEU) · all room_type: clinic_office, all device: pi_budget_mic, SNR ~11 dB

Full clip listing

All in data/he/ben_m_40-55_003/:

Clip ID Typology violent Duration
el_sv_b_0001_00 SV 2m 27.0s
el_sv_b_0002_00 SV 2m 18.5s
el_it_b_0001_00 IT 2m 30.0s
el_it_b_0002_00 IT 2m 31.6s
el_neg_b_0001_00 NEG 1m 53.8s
el_neg_b_0002_00 NEG 2m 54.6s
el_neu_b_0001_00 NEU 1m 56.9s
el_neu_b_0002_00 NEU 1m 19.7s

Loading and inspecting an Elephant clip

import pandas as pd, soundfile as sf, json
from pathlib import Path

root = Path(".")
df = pd.read_csv("data/he/manifest.csv")
el = df[df["project"] == "elephant_in_the_room"]    # 8 rows

# Pick one clip
row = el.iloc[0]
wav, sr = sf.read(root / row.wav_path)
meta = json.loads((root / row.wav_path).with_suffix(".json").read_text())

# Acoustic scene
scene = meta["acoustic_scene"]
print(f"{scene['room_type']}  {scene['device']}  SNR {scene['snr_db_actual']} dB  "
      f"dist {scene['speaker_distance_meters']} m")

# Background acoustic events
for evt in scene["background_events"]:
    print(f"  {evt['type']:10s}  {evt['onset']:6.1f}s – {evt['offset']:6.1f}s  @ {evt['level_db']:+5.1f} dB")

# Alert window (final 40%) — for sliding-window evaluation
duration = meta["duration_seconds"]
alert_start = 0.60 * duration

events = [json.loads(l) for l in (root / row.strong_labels_path).read_text().splitlines() if l.strip()]
alert_events = [e for e in events if e["onset"] >= alert_start]
print(f"alert window: {alert_start:.1f}s – {duration:.1f}s   "
      f"{len(alert_events)} events fire in window  "
      f"(of {len(events)} total)")

Training-time notes (specific to this project)

  • NEG clips are essential for precision. el_neg_b_* is intense speech in a clinic room with background noise but no violence. If your detector fires on these, security stops trusting it. Train hard against these.
  • The alert-in-final-40% structure is exploitable. Consider a sliding-window detector that biases toward the back half of each clip — or use the window structure as a positional feature. Don't reward early firing.
  • SNR ~11 dB is challenging. Verify your features (MFCCs, log-mel, etc.) are robust here before comparing with She-Proves Tier A results. SNR is recorded per clip (acoustic_scene.snr_db_actual) — use it for SNR-stratified eval.
  • ACOU_* events double as strong labels. You can train an event detector on ACOU_SLAM / ACOU_FALL separately from the speech-violence detector and ensemble them.
  • What delivery-003 doesn't cover: only clinic_office, only one speaker pair (BEN+SW), only Azure backend, SNR essentially constant at ~11 dB. Plan for room diversity, SNR stratification, and speaker-disjoint splits when scaling.

Still a small test batch

8 clips, 1 room type, 1 speaker pair, 1 SNR is enough to wire up data loaders and acoustic-scene parsing. It is not enough to train a production model. Build the plumbing; wait for the real batch.