Methodology

How Job Radar collects, enriches, scores, and displays vacancy data. Full transparency on every step.

Data Sources

SourceTypeCoverageFrequency
79 Telegram channelsPublic + PrivateRU analytics, BI, data, product2×/day
hh.ru APIJob board30+ search queries (BI, Analytics, Head of)1×/day
eFinancialCareersWeb scrapingFinance + analytics1×/day
LinkedIn (EU)Guest APIHead of BI/Analytics, Director Data — Europe1×/day
EuroTechJobsWeb scrapingEU tech jobs (BI, Analytics, Data)1×/day
Relocate.meWeb scrapingRelocation jobs (BI, Data)1×/day

Telegram channels include job boards (@analysts_hunter, @foranalysts, @evacuatejobs), recruiter channels, and BI/analytics community groups. Private channels are accessed via Telethon API.

Enrichment Pipeline

Each vacancy passes through a 14-step enrichment pipeline. Every step is idempotent — it only processes records that need updating.

1
Collect
monitor_telegram.py · monitor_telethon.py · monitor_job_boards.py
Fetch new posts from all sources into raw storage (monitor_items table).
2
Sync & Transform
sync_to_vacancies.py
Convert raw posts → structured vacancies. Includes LLM-based garbage filtering (chat messages, ads, non-vacancies are excluded).
3
Regex Extraction
enrich_regex.py
Fast pattern-based extraction: salary (60+ patterns for RUB/USD/EUR/GBP/KZT), experience level, language requirements, tech stack, work format. First pass before LLM.
4
LLM Extraction
extract_salary_llm.py
GPT-4.1-mini extracts salary, language, and location from raw vacancy text. Catches what regex misses — natural language salary mentions, implicit requirements.
5
Role Standardization
enrich_dds_role_title.py
Standardize role titles to English + classify into 15 role families (Analytics, Engineering, DS/ML, Product, QA, etc.) via LLM.
6
Company Extraction
enrich_company.py
5-phase pipeline: deterministic rules → hh.ru metadata → LLM fallback → validation. Handles agencies, marketplaces, franchise points.
7
Location
enrich_location.py
Standardize to "City, CC" format. 3 phases: hh.ru metadata → regex patterns → LLM extraction.
8
Skills
enrich_skills.py
Extract tech skills from hh.ru key_skills metadata and LLM analysis of raw content.
9
Work Format
enrich_work_format.py
Classify: remote, office, hybrid, remote_ru (remote but Russia-only). Regex + LLM with 97% coverage.
10
Company Normalization
normalize_company.py
Alias mapping (brand unification) + trim legal suffixes (ООО, LLC, Inc).
11
Domain Normalization
normalize_domains.py
Map 448+ raw dds_industry values → 30 canonical categories (fintech, ecommerce, healthtech, etc.).
12
Role Subfamily
classify_subfamilies.py
Classify into ~60 subspecialties within 15 role families. Deterministic pattern matching.
13
Salary Estimation
estimate_salary.py
For vacancies without stated salary: estimate using median of similar vacancies (same level + dds_industry + region). Marked as "Estimated" in Salary Type filter.
14
QA Check
qa_enrichment_integrity.py
Verify all enrichment fields are preserved across scoring cycles. Catch regressions before they reach the frontend.

Salary Types

Every vacancy has a salary classification based on how the data was obtained:

TypeSourcesWhat it means
Stated REGEX, LLM, TG, WEB, FIX Salary explicitly mentioned in the vacancy text. Extracted by regex patterns (60+ rules) or LLM analysis. Most reliable.
Estimated LITE, RESEARCH, EST No salary in text. Estimated by comparing with similar vacancies: same level + dds_industry + region. Median of matched cohort.
No salary Neither stated nor estimable (insufficient reference data).

Salary Conversion

All salary filters use EUR equivalent for uniform comparison:

CurrencyConversionNote
RUB ₽÷ 95Central Bank approximate rate
USD $× 0.92USD → EUR
GBP £× 1.15GBP → EUR
KZT ₸× 0.002Tenge → EUR

Annual salaries (USD/EUR/GBP above threshold) are automatically converted to monthly.

Scoring Model

Each vacancy is scored 0–100 based on 6 weighted components:

Level
18%
Seniority match
Stack
18%
Tech stack fit
Format
18%
Remote/office/hybrid
Salary
18%
Compensation level
Language
18%
Language requirements
Domain
10%
Industry preference
RatingScoreMeaning
[+]≥ 70Strong match — review recommended
[~]50–69Partial match — worth a look
[−]< 50Low relevance

Role Families

15 role families for classification:

FamilyExamples
AnalyticsBusiness Analyst, BI Analyst, Data Analyst, Financial Analyst
Data EngineeringData Engineer, ETL Developer, DWH Architect
DS/MLData Scientist, ML Engineer, NLP Researcher
EngineeringBackend Developer, DevOps, Platform Engineer, SRE
ProductProduct Manager, Product Owner, Product Analyst
QAQA Engineer, SDET, Manual QA, AQA
ManagementProject Manager, Team Lead, CTO, Head of
MarketingPerformance Marketing, Growth, UA Manager
SalesAccount Executive, BDM, Sales Director
FinanceFP&A, Controller, Risk Analyst
HRRecruiter, HRBP, Talent Acquisition
DesignUX/UI Designer, Product Designer
ProcurementProcurement Manager, Supply Chain
MultipleCross-functional roles
OtherRoles outside standard categories

Data Quality

Field Coverage (current)

FieldCoverageMethod
Level100%Regex + LLM
Salary100%Regex → LLM → Estimation
Company100%5-phase extraction pipeline
Work Format97%Regex + LLM classification
Location72%3-phase: metadata → regex → LLM
Language58%Regex pattern matching
Skills56%hh.ru metadata + LLM
Domain41%448→30 normalization map

Lower coverage on location/language/skills/dds_industry is expected — many Telegram posts are too short to extract these fields reliably.

Garbage Filtering

Non-vacancy items (chat messages, discussions, ads) are automatically flagged and excluded. Currently ~5% of raw posts are classified as garbage.

Deduplication

Reposts are detected via content similarity and marked with is_repost flag. Source URL deduplication prevents the same vacancy from appearing twice.

Schedule

TimePipelineSteps
07:00Job boards (hh.ru)Collect from hh.ru API
09:07Morning full pipelineAll 14 steps: collect → enrich → score → QA
21:07Evening pipelineAll 14 steps: collect → enrich → score → QA