filefacts v1.0.0 is the stable line for the parser behind cleave's feature extraction. The headline for ML work is simple: more samples now land in the right format bucket, more package identity becomes structured data, and skipped or truncated source analysis is visible instead of silent.
For supply-chain models, the point is not prettier names. It is fewer collapsed classes, more provenance features, better evidence offsets, and explicit failure signals when static analysis is bounded for corpus-scale safety.
- Package identity. Android and Alpine
.apk, npm.tgz, Cargo.crate, RubyGems, Debian packages, NuGet, VSIX, IPA, conda, egg, and Arch/FreeBSD/macOS packages now get distinct types. - Debian and RubyGems metadata. Names, versions, maintainers or authors, dependencies, licenses, platforms, installed size, and dependency-shape metrics become structured features.
- PE/.NET features. CLR managed resources now report count, maximum entropy, and maximum size.
.relocoverhang measures payload bytes hidden past real relocation data. - Version and signature signals. VERSIONINFO identity text gets entropy and symbol-ratio metrics, and certificate table size is now a metric for direct thresholding.
- Evidence offsets. ELF dynamic imports carry
.dynstroffsets when available. Mach-O dylibs and code signatures carry file offsets. Source member symbols now carry byte offsets. - Corpus-scale AST safety. Deep ASTs, large query outputs, tree-sitter guard skips, and source extractor panics now produce
ast.depth_capped,source.query_limited.*, andsource.ast_unavailable. - Quieter file typing. Unsupported OCaml, Vim, Lisp, SQL, Smali, patches, CSS-like files, and TypeScript baselines are treated as text instead of weakly guessed as JavaScript, Kotlin, or Batch.
- Better content detection. AppleScript and pacman/AUR install scriptlets are detected more reliably without depending on extensions.
- Schema notes.
FileType::Pkgsplit into specific package variants,Symbol::Membergained an optionaloffset, source error stages were added, andpe.cert_table_sizemoved from values to metrics.
brew install atomdrift/tap/filefacts
cargo add filefacts