Research methodology
How Four-Leaf Research collects, classifies, and analyzes job-market data. We publish this page separately so readers can audit the methods once rather than decoding them from every report.
Data collection
Four-Leaf runs a nightly scraper against public Applicant Tracking System (ATS) feeds for a curated index of AI-native and high-growth employers. The sources covered are:
- Greenhouse public JSON API (one feed per company slug).
- Ashby public JSON API (one feed per org slug).
- Lever public posting API.
Each record is stored with the raw payload plus normalized fields: title, company, location, remote flag, description, apply URL, and posting timestamp. We do not attempt to resolve the same role across ATS vendors; a role listed under two vendors would be counted twice. In practice, the 16 companies in the current index each use a single ATS vendor, so cross-vendor duplication is not a concern for this snapshot.
The 16-company index
The current quarterly index covers: Airbnb, Anthropic, Coinbase, Datadog, Discord, DoorDash, Figma, Instacart, Notion, Pinterest, Plaid, Ramp, Scale AI, Spotify, Stripe, Vercel. These are curated for signal density, specifically AI-native and high-growth technology employers whose hiring volume is informative about where the industry is moving. The index is not a representative sample of the US labor market. Quarterly updates may add or remove companies based on ATS availability; the complete list per quarter is published in each report’s stats JSON file.
Role-family classification
Each role is mapped to one family using a first-match-wins keyword heuristic against the job title. The families are:
- data_ml: machine learning, data science, data engineering, analytics engineering, applied science.
- research: research scientist, research engineer, and explicit research roles that don’t match data_ml.
- engineering: software, infrastructure, SRE, DevOps, platform, security, QA.
- product: product manager, product owner, TPM, technical program.
- design: UX, UI, brand, creative.
- sales_gtm, marketing, operations, finance_legal, people, support: standard GTM and back-office families.
- other: anything the heuristic cannot confidently classify. A sizeable “other” bucket reflects titles like “Staff Business Insights Partner” where function isn’t obvious from keywords alone.
Seniority classification
Seniority bands are assigned from title keywords using a first-match-wins order: intern, exec, director, principal, staff, senior, manager, mid, entry, other. Exec catches VP, Chief, Head of. “II” or “III” suffixes (but not bare “I”, which is ambiguous) map to mid. Titles without any keyword fall into “other”, which is the largest bucket at most companies.
Salary extraction
The salary_text field in most ATS payloads is empty or unreliable. We extract salary ranges directly from description text using a regular expression that matches common US dollar formats:
$120,000 - $180,000$120K - $180K(or$120k-$180k)$120,000 to $180,000USD 120,000 - 180,000
Three guardrails reduce false positives:
- Sanity bounds. Ranges below $30,000 or above $2,000,000 are rejected.
- Plausibility. The max cannot exceed 5x the min, to catch malformed parses like “$100K bonus - $500K total comp.”
- Context filter. A 40-character window before each match is scanned for words like revenue, raised, valuation, funding, ARR, budget, spend, bonus, equity refresh, sign-on; matches in those contexts are skipped.
Companies whose JDs link out to a separate comp page, or pass ranges out-of-band to recruiters, will show 0% salary disclosure under this method. We report those companies as “not in JD” rather than inferring values. When reading the scorecard, treat 0% as “not disclosed at the JD surface,” not as “not disclosed at all.”
Remote-work cross-validation
The boolean remote flag from ATS feeds is unreliable. At some employers the flag is set to true for every role by default; at others it is always false even for fully-remote roles. For every report that covers remote work, we publish two signals:
- ATS flag. The boolean passed through from the source ATS.
- Text signal. A case-insensitive scan of title, location, and description for “remote,” “work from home,” “WFH,” “remote-first,” “fully remote,” “anywhere in the,” or “distributed team.”
Neither signal alone is authoritative. We publish both and flag the largest divergences so readers can interpret directionally.
Skill mention detection
Skill mentions are counted once per JD regardless of how many times the skill appears. The taxonomy is fixed per quarter and published inside each report’s stats JSON (see topSkillsByRole). Skill detection is only run against engineering, data/ML, and research roles; cross-cutting mentions in sales or marketing JDs are not reported to avoid false positives (a JD asking for a “Python fan” in a salesops role would inflate the Python count).
Years-of-experience extraction
Minimum years of experience are extracted from phrases like “5+ years of experience,” “minimum of 7 years,” or “at least 3 years.” Values outside 0–25 are rejected as parse errors. If a JD lists multiple ranges, we take the first match, which is usually the headline requirement.
Known limitations
- Snapshot, not panel. A single snapshot can’t support trend claims. Trend analysis becomes available once we publish two or more quarterly issues built on the same pipeline.
- Curated index bias. The 16 companies are AI-native and high-growth by selection. Headline numbers reflect that selection and shouldn’t be read as broader industry averages.
- US-centric salary parser. Only USD ranges in the formats listed above are extracted. Non-US roles often disclose compensation elsewhere.
- Heuristic classification. Role and seniority are inferred from title keywords. Edge cases (e.g. “Member of Technical Staff” at research labs) fall into “other” and may depress other buckets.
- Posting duplication. A single req listed in multiple locations appears as multiple rows. We do not dedupe across locations because that would lose signal about geographic footprint.
Reproducibility
Every research report is accompanied by a CSV dataset and a JSON stats file under /research/*.csv and /research/*-stats.json. Both files are versioned in the Four-Leaf repository as of the snapshot date, so any reader can reproduce the headline numbers from the dataset with a spreadsheet or a short script. The analysis script for each report is checked into scripts/research/.
Licensing and citation
All Four-Leaf research output, datasets, and this methodology page are released under the Creative Commons Attribution 4.0 International license. Commercial reuse is permitted with attribution. Attribution format:
Four-Leaf Team. (2026). [Report title]. Four-Leaf Research. Retrieved from four-leaf.ai/research.
Corrections
Found a bug in the methodology or a stat that doesn’t reproduce? Email [email protected]. We publish corrections as report revisions with the dateModified field updated accordingly.