Methodology

How Job Radar collects, enriches, scores, and displays vacancy data. Full transparency on every step.

Data Sources

Source	Type	Coverage	Frequency
79 Telegram channels	Public + Private	RU analytics, BI, data, product	2×/day
hh.ru API	Job board	30+ search queries (BI, Analytics, Head of)	1×/day
eFinancialCareers	Web scraping	Finance + analytics	1×/day
LinkedIn (EU)	Guest API	Head of BI/Analytics, Director Data — Europe	1×/day
EuroTechJobs	Web scraping	EU tech jobs (BI, Analytics, Data)	1×/day
Relocate.me	Web scraping	Relocation jobs (BI, Data)	1×/day

Telegram channels include job boards (@analysts_hunter, @foranalysts, @evacuatejobs), recruiter channels, and BI/analytics community groups. Private channels are accessed via Telethon API.

Enrichment Pipeline

Each vacancy passes through a 14-step enrichment pipeline. Every step is idempotent — it only processes records that need updating.

Collect

monitor_telegram.py · monitor_telethon.py · monitor_job_boards.py

Fetch new posts from all sources into raw storage (monitor_items table).

Sync & Transform

sync_to_vacancies.py

Convert raw posts → structured vacancies. Includes LLM-based garbage filtering (chat messages, ads, non-vacancies are excluded).

Regex Extraction

enrich_regex.py

Fast pattern-based extraction: salary (60+ patterns for RUB/USD/EUR/GBP/KZT), experience level, language requirements, tech stack, work format. First pass before LLM.

LLM Extraction

extract_salary_llm.py

GPT-4.1-mini extracts salary, language, and location from raw vacancy text. Catches what regex misses — natural language salary mentions, implicit requirements.

Role Standardization

enrich_dds_role_title.py

Standardize role titles to English + classify into 15 role families (Analytics, Engineering, DS/ML, Product, QA, etc.) via LLM.

Company Extraction

enrich_company.py

5-phase pipeline: deterministic rules → hh.ru metadata → LLM fallback → validation. Handles agencies, marketplaces, franchise points.

Location

enrich_location.py

Standardize to "City, CC" format. 3 phases: hh.ru metadata → regex patterns → LLM extraction.

Skills

enrich_skills.py

Extract tech skills from hh.ru key_skills metadata and LLM analysis of raw content.

Work Format

enrich_work_format.py

Classify: remote, office, hybrid, remote_ru (remote but Russia-only). Regex + LLM with 97% coverage.

Company Normalization

normalize_company.py

Alias mapping (brand unification) + trim legal suffixes (ООО, LLC, Inc).

Domain Normalization

normalize_domains.py

Map 448+ raw dds_industry values → 30 canonical categories (fintech, ecommerce, healthtech, etc.).

Role Subfamily

classify_subfamilies.py

Classify into ~60 subspecialties within 15 role families. Deterministic pattern matching.

Salary Estimation

estimate_salary.py

For vacancies without stated salary: estimate using median of similar vacancies (same level + dds_industry + region). Marked as "Estimated" in Salary Type filter.

QA Check

qa_enrichment_integrity.py

Verify all enrichment fields are preserved across scoring cycles. Catch regressions before they reach the frontend.

Salary Types

Every vacancy has a salary classification based on how the data was obtained:

Type	Sources	What it means
Stated	REGEX, LLM, TG, WEB, FIX	Salary explicitly mentioned in the vacancy text. Extracted by regex patterns (60+ rules) or LLM analysis. Most reliable.
Estimated	LITE, RESEARCH, EST	No salary in text. Estimated by comparing with similar vacancies: same level + dds_industry + region. Median of matched cohort.
No salary	—	Neither stated nor estimable (insufficient reference data).

Salary Conversion

All salary filters use EUR equivalent for uniform comparison:

Currency	Conversion	Note
RUB ₽	÷ 95	Central Bank approximate rate
USD $	× 0.92	USD → EUR
GBP £	× 1.15	GBP → EUR
KZT ₸	× 0.002	Tenge → EUR

Annual salaries (USD/EUR/GBP above threshold) are automatically converted to monthly.

Scoring Model

Each vacancy is scored 0–100 based on 6 weighted components:

Level

18%

Seniority match

Stack

18%

Tech stack fit

Format

18%

Remote/office/hybrid

Salary

18%

Compensation level

Language

18%

Language requirements

Domain

10%

Industry preference

Rating	Score	Meaning
[+]	≥ 70	Strong match — review recommended
[~]	50–69	Partial match — worth a look
[−]	< 50	Low relevance

Role Families

15 role families for classification:

Family	Examples
Analytics	Business Analyst, BI Analyst, Data Analyst, Financial Analyst
Data Engineering	Data Engineer, ETL Developer, DWH Architect
DS/ML	Data Scientist, ML Engineer, NLP Researcher
Engineering	Backend Developer, DevOps, Platform Engineer, SRE
Product	Product Manager, Product Owner, Product Analyst
QA	QA Engineer, SDET, Manual QA, AQA
Management	Project Manager, Team Lead, CTO, Head of
Marketing	Performance Marketing, Growth, UA Manager
Sales	Account Executive, BDM, Sales Director
Finance	FP&A, Controller, Risk Analyst
HR	Recruiter, HRBP, Talent Acquisition
Design	UX/UI Designer, Product Designer
Procurement	Procurement Manager, Supply Chain
Multiple	Cross-functional roles
Other	Roles outside standard categories

Data Quality

Field Coverage (current)

Field	Coverage	Method
Level	100%	Regex + LLM
Salary	100%	Regex → LLM → Estimation
Company	100%	5-phase extraction pipeline
Work Format	97%	Regex + LLM classification
Location	72%	3-phase: metadata → regex → LLM
Language	58%	Regex pattern matching
Skills	56%	hh.ru metadata + LLM
Domain	41%	448→30 normalization map

Lower coverage on location/language/skills/dds_industry is expected — many Telegram posts are too short to extract these fields reliably.

Garbage Filtering

Non-vacancy items (chat messages, discussions, ads) are automatically flagged and excluded. Currently ~5% of raw posts are classified as garbage.

Deduplication

Reposts are detected via content similarity and marked with is_repost flag. Source URL deduplication prevents the same vacancy from appearing twice.

Schedule

Time	Pipeline	Steps
07:00	Job boards (hh.ru)	Collect from hh.ru API
09:07	Morning full pipeline	All 14 steps: collect → enrich → score → QA
21:07	Evening pipeline	All 14 steps: collect → enrich → score → QA