Skip to content

Latest commit

 

History

History

README.md

@jscpd/tokenizer

Tokenizer package for @jscpd — converts source code into a list of tokens for duplicate detection.

Supports 223 programming languages and formats via a self-contained reprism-based grammar engine. Grammars are loaded lazily for fast startup, with O(n) hot paths for high-throughput scanning.

Special tokenization modes handle multi-language files:

  • Vue SFC (.vue) — <template>, <script>, and <style> blocks each tokenized by their own language
  • Svelte (.svelte) — per-block tokenization for HTML, JS, and CSS sections
  • Astro (.astro) — frontmatter and template blocks tokenized independently
  • Markdown (.md) — fenced code blocks tokenized by the declared language

This enables cross-format clone detection: a <script lang="ts"> block in a .vue file can match a plain .ts file.

Installation

npm install @jscpd/tokenizer --save

Usage

import { IOptions, ITokensMap } from '@jscpd/core';
import { Tokenizer } from '@jscpd/tokenizer';

const tokenizer = new Tokenizer();
const options: IOptions = {};

const maps: ITokensMap[] = tokenizer.generateMaps('source_id', 'let a = "11"', 'javascript', options);

Supported formats

The full list of 223 supported formats is available in FORMATS.md at the repository root, or at runtime:

jscpd --list

License

MIT © Andrey Kucherenko