Добавить обработку длинных voice/audio в агент-боте

2026-06-03 00:18:30 +04:00 · 2026-06-03 00:18:30 +04:00 · 9949935bcc
commit 9949935bcc
parent 35fc6ebf62
4 changed files with 302 additions and 9 deletions
--- a/Dev_Docs/Pending_Features/2026-06-03_0013_длинные_voice_audio_telegram_бота.md
+++ b/Dev_Docs/Pending_Features/2026-06-03_0013_длинные_voice_audio_telegram_бота.md
@ -0,0 +1,13 @@
+# Длинные voice/audio в Telegram-боте агента
+
+- краткое описание фичи:
+  Бот теперь умеет обрабатывать длинные voice/audio аккуратнее: учитывает лимит Telegram Bot API на скачивание слишком больших файлов, поддерживает альтернативный `TELEGRAM_API_BASE_URL` для локального `telegram-bot-api`, локально пережимает длинное аудио через `ffmpeg`, режет на куски и отправляет их в OpenAI transcription последовательно.
+- что именно проверять:
+  1. Короткий `voice` по-прежнему распознаётся без заметной задержки.
+  2. Длинный `audio/voice`, который помещается в скачивание Telegram, успешно пережимается, режется на части и даёт цельную расшифровку.
+  3. Очень большой файл через обычный `https://api.telegram.org` даёт понятное сообщение про лимит Telegram.
+  4. После переключения на локальный `telegram-bot-api` такой же большой файл начинает скачиваться и распознаваться.
+- ожидаемый результат:
+  Бот не падает на длинных аудио, даёт либо расшифровку, либо понятное объяснение, какой именно лимит мешает и что нужно включить.
+- статус:
+  pending
--- a/SHiNE-agent-bot-coder/README.md
+++ b/SHiNE-agent-bot-coder/README.md
@ -32,8 +32,15 @@
   - `ALLOWED_TELEGRAM_USERNAME` — пользователь, чьи сообщения выполняются как команды.
   - `ALLOWED_TELEGRAM_PLAYERS` — whitelist игроков в формате `username:Имя,username2:Имя2`.
   - `ALLOWED_TELEGRAM_CHANNEL_USERNAME` — канал, из которого принимаются `channel_post`; обычные group/supergroup-сообщения обрабатываются как `message`.
+   - `TELEGRAM_API_BASE_URL` — базовый URL Bot API; по умолчанию `https://api.telegram.org`. Для очень больших voice/audio можно поднять локальный `telegram-bot-api` и направить бота туда.
   - `TELEGRAM_FILE_DOWNLOAD_TIMEOUT_SECONDS` — тайм-аут скачивания voice/audio из Telegram, по умолчанию 300 секунд.
   - `OPENAI_TRANSCRIBE_TIMEOUT_SECONDS` — тайм-аут распознавания voice/audio в OpenAI, по умолчанию 900 секунд.
+   - `OPENAI_TRANSCRIBE_MAX_UPLOAD_BYTES` — безопасный лимит размера одного куска для OpenAI transcription, по умолчанию `24 MiB`.
+   - `OPENAI_TRANSCRIBE_MAX_CHUNK_SECONDS` — максимальная длина одного куска при длинном аудио, по умолчанию `900` секунд.
+   - `OPENAI_TRANSCRIBE_OVERLAP_SECONDS` — перекрытие соседних кусков для более ровной склейки текста, по умолчанию `2` секунды.
+   - `OPENAI_TRANSCRIBE_REENCODE_BITRATE_KBPS` — битрейт локального пережатия длинного аудио через `ffmpeg`, по умолчанию `24`.
+   - `OPENAI_TRANSCRIBE_FFMPEG_TIMEOUT_SECONDS` — тайм-аут локальной обработки длинного аудио через `ffmpeg`/`ffprobe`, по умолчанию `1800`.
+   - `FFMPEG_BIN` и `FFPROBE_BIN` — пути к локальным бинарям `ffmpeg`/`ffprobe`, если они не лежат в `PATH`.
   - `OPENAI_TTS_MODEL` — модель синтеза речи, по умолчанию `gpt-4o-mini-tts`.
   - `OPENAI_TTS_VOICE` — голос синтеза речи, по умолчанию `alloy`.
   - `OPENAI_TTS_RESPONSE_FORMAT` — аудиоформат для Telegram voice, по умолчанию `opus`.
@ -47,6 +54,11 @@
 python3 SHiNE-agent-bot-coder/py_bot_service.py --selftest-codex "Ответь одной строкой: Codex работает"
 ```

+## Длинные voice/audio
+- Если аудио короткое, бот отправляет его в OpenAI как раньше.
+- Если аудио большое или длинное, бот локально пережимает его через `ffmpeg`, при необходимости режет на куски и распознаёт последовательно.
+- Для очень больших файлов упираемся не только в OpenAI, но и в лимит обычного облачного Telegram Bot API на скачивание файла ботом. Для таких случаев нужно использовать локальный `telegram-bot-api` сервер и указать его через `TELEGRAM_API_BASE_URL`.
+
 ## Запуск как systemd-сервис
 Файлы для установки:
 - `scripts/systemd/shine-agent-bot-coder.service`
--- a/SHiNE-agent-bot-coder/py_bot_service.py
+++ b/SHiNE-agent-bot-coder/py_bot_service.py
@ -9,6 +9,7 @@ import mimetypes
 import os
 import random
 import re
+import shutil
 import string
 import subprocess
 import tempfile
@ -178,8 +179,11 @@ class JsonLineStore:


 class TelegramApi:
-    def __init__(self, token: str):
-        self.base = f"https://api.telegram.org/bot{token}/"
+    def __init__(self, token: str, base_url: str = "https://api.telegram.org"):
+        self.token = token
+        self.api_root = (base_url or "https://api.telegram.org").rstrip("/")
+        self.base = f"{self.api_root}/bot{token}/"
+        self.file_base = f"{self.api_root}/file/bot{token}/"

    def call(self, method: str, payload: dict[str, Any] | None = None, timeout: int = 60) -> dict[str, Any]:
        data = None
@ -325,10 +329,18 @@ class BotConfig:
        self.allowed_players = parse_allowed_players(env.get("ALLOWED_TELEGRAM_PLAYERS", DEFAULT_ALLOWED_PLAYERS))
        self.allowed_channel_username = normalize_username(env.get("ALLOWED_TELEGRAM_CHANNEL_USERNAME", "shine_writing"))
        self.bot_username = env.get("BOT_USERNAME", "aidar_su_bot")
+        self.telegram_api_base_url = env.get("TELEGRAM_API_BASE_URL", "https://api.telegram.org").strip() or "https://api.telegram.org"
        self.openai_api_key = env.get("OPENAI_API_KEY", "").strip()
        self.openai_transcribe_model = env.get("OPENAI_TRANSCRIBE_MODEL", "gpt-4o-mini-transcribe")
        self.telegram_file_download_timeout_seconds = int(env.get("TELEGRAM_FILE_DOWNLOAD_TIMEOUT_SECONDS", "300"))
        self.openai_transcribe_timeout_seconds = int(env.get("OPENAI_TRANSCRIBE_TIMEOUT_SECONDS", "900"))
+        self.openai_transcribe_max_upload_bytes = max(1_000_000, int(env.get("OPENAI_TRANSCRIBE_MAX_UPLOAD_BYTES", str(24 * 1024 * 1024))))
+        self.openai_transcribe_max_chunk_seconds = max(60, int(env.get("OPENAI_TRANSCRIBE_MAX_CHUNK_SECONDS", "900")))
+        self.openai_transcribe_overlap_seconds = max(0, int(env.get("OPENAI_TRANSCRIBE_OVERLAP_SECONDS", "2")))
+        self.openai_transcribe_reencode_bitrate_kbps = max(12, int(env.get("OPENAI_TRANSCRIBE_REENCODE_BITRATE_KBPS", "24")))
+        self.openai_transcribe_ffmpeg_timeout_seconds = max(30, int(env.get("OPENAI_TRANSCRIBE_FFMPEG_TIMEOUT_SECONDS", "1800")))
+        self.ffmpeg_bin = env.get("FFMPEG_BIN", "ffmpeg").strip() or "ffmpeg"
+        self.ffprobe_bin = env.get("FFPROBE_BIN", "ffprobe").strip() or "ffprobe"
        self.openai_tts_model = env.get("OPENAI_TTS_MODEL", "gpt-4o-mini-tts")
        self.openai_tts_voice = env.get("OPENAI_TTS_VOICE", "alloy")
        self.openai_tts_response_format = env.get("OPENAI_TTS_RESPONSE_FORMAT", "opus")
@ -359,7 +371,7 @@ class BotConfig:
 class ShinePyBotService:
    def __init__(self, config: BotConfig):
        self.cfg = config
-        self.telegram = TelegramApi(config.telegram_bot_token)
+        self.telegram = TelegramApi(config.telegram_bot_token, config.telegram_api_base_url)

        self.queue_file = config.data_dir / "py_queue.jsonl"
        self.state_file = config.data_dir / "py_state.json"
@ -1016,6 +1028,8 @@ class ShinePyBotService:
                    message_id,
                    actor_username,
                    message["voice"].get("file_id"),
+                    duration_seconds=message["voice"].get("duration"),
+                    telegram_file_size=message["voice"].get("file_size"),
                    media_type="voice",
                    update_type=update_type,
                    chat_username=chat_username,
@ -1030,6 +1044,8 @@ class ShinePyBotService:
                    message_id,
                    actor_username,
                    message["audio"].get("file_id"),
+                    duration_seconds=message["audio"].get("duration"),
+                    telegram_file_size=message["audio"].get("file_size"),
                    media_type="audio",
                    update_type=update_type,
                    chat_username=chat_username,
@ -1081,6 +1097,8 @@ class ShinePyBotService:
        username: str,
        file_id: str | None,
        *,
+        duration_seconds: int | None = None,
+        telegram_file_size: int | None = None,
        media_type: str = "voice",
        update_type: str = "message",
        chat_username: str = "",
@ -1103,11 +1121,15 @@ class ShinePyBotService:
            "authorSignature": author_signature,
            "fileId": file_id,
            "mediaType": media_type,
+            "durationSeconds": duration_seconds,
+            "fileSize": telegram_file_size,
        })
        job = self._build_job_base(chat_id, message_id, username, str(history_path))
        job["type"] = "voice"
        job["telegram_file_id"] = file_id
        job["telegram_media_type"] = media_type
+        job["telegram_duration_seconds"] = duration_seconds or 0
+        job["telegram_file_size"] = telegram_file_size or 0
        job["update_type"] = update_type
        job["chat_type"] = chat_type
        job["chat_username"] = chat_username
@ -2201,6 +2223,20 @@ class ShinePyBotService:
        job_id = str(job.get("id") or "")[:8]
        job_num = job.get("num", "?")
        media_type = (job.get("telegram_media_type") or "voice").strip()
+        duration_seconds = int(job.get("telegram_duration_seconds") or 0)
+        telegram_file_size = int(job.get("telegram_file_size") or 0)
+        if self._telegram_cloud_download_is_likely_too_big(telegram_file_size):
+            limit_mb = self._bytes_to_mb(20 * 1024 * 1024)
+            actual_mb = self._bytes_to_mb(telegram_file_size)
+            raise VoiceTranscriptionError(
+                (
+                    f"Telegram не даст этому боту скачать такой файл через обычный Bot API "
+                    f"(примерно {actual_mb} MB при лимите около {limit_mb} MB). "
+                    f"Для очень длинных аудио нужен локальный `telegram-bot-api` сервер или другой способ доставки файла."
+                ),
+                stage="telegram_get_file_too_big",
+                retryable=False,
+            )
        started_at = time.time()
        print(f"[py-bot] transcribe start job={job_id} num={job_num} media={media_type}", flush=True)
        file_bytes, filename = self._download_telegram_file(file_id)
@ -2208,7 +2244,29 @@ class ShinePyBotService:
            f"[py-bot] transcribe downloaded job={job_id} filename={filename} size={len(file_bytes)} bytes",
            flush=True,
        )
-        text = self._openai_transcribe(file_bytes, filename).strip()
+        prepared_parts = self._prepare_audio_parts_for_transcription(
+            file_bytes,
+            filename,
+            duration_seconds=duration_seconds,
+            job_id=job_id,
+            job_num=job_num,
+        )
+        print(
+            f"[py-bot] transcribe prepared job={job_id} parts={len(prepared_parts)} duration={duration_seconds}s",
+            flush=True,
+        )
+        parts_text: list[str] = []
+        prompt_tail = ""
+        for index, (part_bytes, part_name) in enumerate(prepared_parts, start=1):
+            print(
+                f"[py-bot] transcribe part job={job_id} index={index}/{len(prepared_parts)} filename={part_name} size={len(part_bytes)} bytes",
+                flush=True,
+            )
+            part_text = self._openai_transcribe(part_bytes, part_name, prompt=prompt_tail).strip()
+            if part_text:
+                parts_text.append(part_text)
+                prompt_tail = self._transcription_prompt_tail("\n".join(parts_text))
+        text = "\n".join(parts_text).strip()
        if not text:
            raise VoiceTranscriptionError(
                "сервис распознавания вернул пустой текст. Возможно, в записи нет слышимой речи или качество звука слишком низкое.",
@ -2229,10 +2287,18 @@ class ShinePyBotService:
                detail=str(e),
            ) from e
        except Exception as e:
+            detail = str(e)
+            if "file is too big" in detail.lower():
+                raise VoiceTranscriptionError(
+                    "Telegram считает файл слишком большим для скачивания через текущий Bot API. Для такого аудио нужен локальный `telegram-bot-api` сервер или другой способ передать файл боту.",
+                    stage="telegram_get_file_too_big",
+                    retryable=False,
+                    detail=detail,
+                ) from e
            raise VoiceTranscriptionError(
                "не удалось получить информацию о файле из Telegram.",
                stage="telegram_get_file",
-                detail=str(e),
+                detail=detail,
            ) from e
        info = result.get("result") or {}
        file_path = info.get("file_path")
@ -2243,7 +2309,7 @@ class ShinePyBotService:
                retryable=True,
                detail=json.dumps(info, ensure_ascii=False)[:1000],
            )
-        file_url = f"https://api.telegram.org/file/bot{self.cfg.telegram_bot_token}/{file_path}"
+        file_url = self.telegram.file_base + file_path.lstrip("/")
        req = request.Request(file_url, method="GET")
        try:
            with request.urlopen(req, timeout=self.cfg.telegram_file_download_timeout_seconds) as resp:
@ -2284,7 +2350,206 @@ class ShinePyBotService:
            normalized = original_name
        return data, normalized

-    def _openai_transcribe(self, file_bytes: bytes, filename: str) -> str:
+    def _prepare_audio_parts_for_transcription(
+        self,
+        file_bytes: bytes,
+        filename: str,
+        *,
+        duration_seconds: int,
+        job_id: str,
+        job_num: Any,
+    ) -> list[tuple[bytes, str]]:
+        needs_duration_chunking = duration_seconds > self.cfg.openai_transcribe_max_chunk_seconds
+        if len(file_bytes) <= self.cfg.openai_transcribe_max_upload_bytes and not needs_duration_chunking:
+            return [(file_bytes, filename)]
+        ffmpeg_path = shutil.which(self.cfg.ffmpeg_bin)
+        ffprobe_path = shutil.which(self.cfg.ffprobe_bin)
+        if not ffmpeg_path or not ffprobe_path:
+            raise VoiceTranscriptionError(
+                "для длинного аудио нужен локальный `ffmpeg`/`ffprobe`, но они не найдены в системе.",
+                stage="audio_prepare_tools_missing",
+                retryable=False,
+            )
+        with tempfile.TemporaryDirectory(prefix="shine-audio-") as tmpdir:
+            tmp = Path(tmpdir)
+            input_suffix = Path(filename).suffix or ".ogg"
+            input_path = tmp / f"source{input_suffix}"
+            input_path.write_bytes(file_bytes)
+            prepared_path = tmp / "prepared.ogg"
+            self._ffmpeg_reencode_audio(input_path, prepared_path)
+            prepared_bytes = prepared_path.read_bytes()
+            prepared_duration = self._ffprobe_duration_seconds(prepared_path)
+            if (
+                len(prepared_bytes) <= self.cfg.openai_transcribe_max_upload_bytes
+                and prepared_duration <= self.cfg.openai_transcribe_max_chunk_seconds
+            ):
+                return [(prepared_bytes, prepared_path.name)]
+            chunk_length = self._choose_transcription_chunk_seconds(prepared_duration, len(prepared_bytes))
+            print(
+                f"[py-bot] audio chunking job={job_id} num={job_num} duration={prepared_duration:.1f}s total_bytes={len(prepared_bytes)} chunk_seconds={chunk_length}",
+                flush=True,
+            )
+            chunks: list[tuple[bytes, str]] = []
+            offset = 0
+            index = 1
+            total_duration = max(1, int(prepared_duration + 0.999))
+            while offset < total_duration:
+                chunk_path = tmp / f"chunk_{index:03d}.ogg"
+                self._ffmpeg_extract_audio_chunk(prepared_path, chunk_path, offset, chunk_length)
+                chunk_bytes = chunk_path.read_bytes()
+                if not chunk_bytes:
+                    break
+                if len(chunk_bytes) > self.cfg.openai_transcribe_max_upload_bytes:
+                    raise VoiceTranscriptionError(
+                        "локальная нарезка аудио дала слишком большой кусок для OpenAI; нужно уменьшить размер чанка.",
+                        stage="audio_chunk_too_large",
+                        retryable=False,
+                    )
+                chunks.append((chunk_bytes, chunk_path.name))
+                step = max(1, chunk_length - self.cfg.openai_transcribe_overlap_seconds)
+                offset += step
+                index += 1
+            if not chunks:
+                raise VoiceTranscriptionError(
+                    "не удалось подготовить куски аудио для распознавания.",
+                    stage="audio_chunk_empty",
+                    retryable=False,
+                )
+            return chunks
+
+    def _ffmpeg_reencode_audio(self, input_path: Path, output_path: Path) -> None:
+        cmd = [
+            self.cfg.ffmpeg_bin,
+            "-y",
+            "-i",
+            str(input_path),
+            "-vn",
+            "-ac",
+            "1",
+            "-ar",
+            "16000",
+            "-c:a",
+            "libopus",
+            "-b:a",
+            f"{self.cfg.openai_transcribe_reencode_bitrate_kbps}k",
+            str(output_path),
+        ]
+        self._run_subprocess_checked(cmd, "audio_reencode_ffmpeg")
+
+    def _ffmpeg_extract_audio_chunk(self, input_path: Path, output_path: Path, offset_seconds: int, chunk_seconds: int) -> None:
+        cmd = [
+            self.cfg.ffmpeg_bin,
+            "-y",
+            "-ss",
+            str(offset_seconds),
+            "-t",
+            str(chunk_seconds),
+            "-i",
+            str(input_path),
+            "-vn",
+            "-acodec",
+            "copy",
+            str(output_path),
+        ]
+        self._run_subprocess_checked(cmd, "audio_chunk_ffmpeg")
+
+    def _ffprobe_duration_seconds(self, audio_path: Path) -> float:
+        cmd = [
+            self.cfg.ffprobe_bin,
+            "-v",
+            "error",
+            "-show_entries",
+            "format=duration",
+            "-of",
+            "default=noprint_wrappers=1:nokey=1",
+            str(audio_path),
+        ]
+        try:
+            result = subprocess.run(
+                cmd,
+                check=True,
+                capture_output=True,
+                text=True,
+                timeout=self.cfg.openai_transcribe_ffmpeg_timeout_seconds,
+            )
+        except subprocess.TimeoutExpired as e:
+            raise VoiceTranscriptionError(
+                f"`ffprobe` не успел определить длительность аудио за {self.cfg.openai_transcribe_ffmpeg_timeout_seconds} секунд.",
+                stage="audio_probe_timeout",
+                retryable=False,
+            ) from e
+        except subprocess.CalledProcessError as e:
+            detail = (e.stderr or e.stdout or "").strip()
+            raise VoiceTranscriptionError(
+                "не удалось определить длительность аудио через `ffprobe`.",
+                stage="audio_probe_failed",
+                retryable=False,
+                detail=detail[:1500],
+            ) from e
+        raw = (result.stdout or "").strip()
+        try:
+            return max(0.0, float(raw))
+        except ValueError as e:
+            raise VoiceTranscriptionError(
+                "`ffprobe` вернул некорректную длительность аудио.",
+                stage="audio_probe_invalid",
+                retryable=False,
+                detail=raw[:300],
+            ) from e
+
+    def _run_subprocess_checked(self, cmd: list[str], stage: str) -> None:
+        try:
+            subprocess.run(
+                cmd,
+                check=True,
+                capture_output=True,
+                text=True,
+                timeout=self.cfg.openai_transcribe_ffmpeg_timeout_seconds,
+            )
+        except subprocess.TimeoutExpired as e:
+            raise VoiceTranscriptionError(
+                f"локальная обработка аудио не успела завершиться за {self.cfg.openai_transcribe_ffmpeg_timeout_seconds} секунд.",
+                stage=f"{stage}_timeout",
+                retryable=False,
+            ) from e
+        except subprocess.CalledProcessError as e:
+            detail = (e.stderr or e.stdout or "").strip()
+            raise VoiceTranscriptionError(
+                "локальная обработка аудио через `ffmpeg` завершилась с ошибкой.",
+                stage=f"{stage}_failed",
+                retryable=False,
+                detail=detail[:1500],
+            ) from e
+
+    def _choose_transcription_chunk_seconds(self, duration_seconds: float, total_bytes: int) -> int:
+        max_chunk = self.cfg.openai_transcribe_max_chunk_seconds
+        safe_seconds = max(60, max_chunk - self.cfg.openai_transcribe_overlap_seconds)
+        if duration_seconds <= 0 or total_bytes <= 0:
+            return safe_seconds
+        bytes_per_second = total_bytes / max(duration_seconds, 1.0)
+        if bytes_per_second <= 0:
+            return safe_seconds
+        size_limited = int((self.cfg.openai_transcribe_max_upload_bytes * 0.9) / bytes_per_second)
+        return max(60, min(safe_seconds, size_limited if size_limited > 0 else safe_seconds))
+
+    @staticmethod
+    def _transcription_prompt_tail(text: str, limit: int = 1000) -> str:
+        source = compact_spaces(text)
+        if len(source) <= limit:
+            return source
+        return source[-limit:]
+
+    def _telegram_cloud_download_is_likely_too_big(self, file_size: int) -> bool:
+        if file_size <= 0:
+            return False
+        using_cloud_api = self.cfg.telegram_api_base_url.rstrip("/") == "https://api.telegram.org"
+        return using_cloud_api and file_size > 20 * 1024 * 1024
+
+    @staticmethod
+    def _bytes_to_mb(value: int) -> str:
+        return f"{value / (1024 * 1024):.1f}"
+
+    def _openai_transcribe(self, file_bytes: bytes, filename: str, prompt: str = "") -> str:
        boundary = "----shine-boundary-" + "".join(random.choices("abcdef0123456789", k=16))
        mime = mimetypes.guess_type(filename)[0] or "application/octet-stream"

@ -2298,6 +2563,9 @@ class ShinePyBotService:
        body = bytearray()
        body.extend(text_part("model", self.cfg.openai_transcribe_model))
        body.extend(text_part("response_format", "text"))
+        prompt = compact_spaces(prompt)
+        if prompt:
+            body.extend(text_part("prompt", prompt[:1000]))
        body.extend(
            (
                f"--{boundary}\r\n"
--- a/VERSION.properties
+++ b/VERSION.properties
@ -1,2 +1,2 @@
-client.version=1.2.114
-server.version=1.2.106
+client.version=1.2.115
+server.version=1.2.107