Skip to content

Latest commit

 

History

History
39 lines (32 loc) · 3.83 KB

File metadata and controls

39 lines (32 loc) · 3.83 KB

Schemas

This file contains the data schemas used in the peviitor project.

Job Model Schema

Field Type Required Description and rules
url string Yes Full URL to the job detail page. unique. url must be valid HTTP/HTTPS URL, canonical job detail page
title string Yes Exact position title. title max 200 chars, no HTML, trimmed whitespace, DIACRITICS ACCEPTED (ăâîșțĂÂÎȘȚ)
company string No Name of the hiring company. Real name. Full name. not just a brand or a code. Legal name. company must match exactly Company.name (case insensitive, DIACRITICS REQUIRED)
cif string No CIF/CUI. Due to the fact that Systematic SRL exist with same name in 3 different counties Bihor, Arad, Timis
location string[] No Location or detailed address. location Romanian cities/addresses, DIACRITICS ACCEPTED (ex: "București", "Cluj-Napoca"). multi-valued, stored as array
tags string[] No Tag-uri skills/educație/experiență. tags lowercase, max 20 entries, standardized values only, NO DIACRITICS
workmode string No "remote", "on-site", "hybrid". workmode only: "remote", "on-site", "hybrid"
date date No Data scrape/indexare (ISO8601). date = UTC ISO8601 timestamp of scrape (ex: "2026-01-18T10:00:00Z")
status string No "scraped", "tested", "published", "verified". status starts "scraped", progresses: scraped → tested → published → verified
vdate date No Verified date (ISO8601). vdate set only when validation="verified"
expirationdate date No Data expirare estimată job. expirationdate = vdate + 30 days max, or extract from job page
salary string No Interval salarial + currency (ex: "5000-8000 RON", "4000 EUR"). salary format: "MIN-MAX CURRENCY" ; must be a string not an array.

Note: Fields marked as string[] are multi-valued arrays. In SOLR/OpenSearch these are stored as arrays (e.g., ["București", "Cluj-Napoca"]).

Company Model Schema

Field Type Required Description
id string Yes CIF/CUI of the company (e.g. "12345678"). id = exact CIF/CUI 8 digits (no RO prefix).
company string Yes Exact name for job matching. company = legal name from Trade Register, DIACRITICS REQUIRED (e.g. "Tehnologia Informației").
brand string No Commercial brand name (e.g. "ORANGE", "EPAM"). Used for display purposes.
group string No Parent company group (e.g. "Orange Group", "EPAM Systems").
status string No Status: "activ", "suspendat", "inactiv", "radiat". If company status is not active, remove jobs; also remove company. status only: "activ", "suspendat", "inactiv", "radiat".
location string[] No Location or detailed address. location Romanian cities/addresses, DIACRITICS ACCEPTED (e.g. "București", "Cluj-Napoca"). multi-valued, stored as array
website string[] No Official company website. website must be a valid HTTP/HTTPS URL, preferably canonical, without trailing slash (e.g. "https://www.example.ro"). multi-valued, stored as array
career string[] No Official company career page. career must be a valid HTTP/HTTPS URL, preferably canonical, without trailing slash, pointing to the jobs/careers section (e.g. "https://www.example.ro/careers"). multi-valued, stored as array
lastScraped string No Date of last scrape in ISO8601 format (e.g. "2026-02-20"). Used for tracking.
scraperFile string No Name of the scraper file used (e.g. "epam.md", "orange.md"). Used for reference.

Note: Fields marked as string[] are multi-valued arrays. In SOLR/OpenSearch these are stored as arrays (e.g., ["https://example.com", "https://careers.example.com"]).