siegfried 1.10.0 released
Saturday, 25 March, 2023
Version 1.10.0 of siegfried is now available. Get it here.
The major changes in this release are the inclusion of a format classification field in results, a “droid” multi setting for roy, and improvements to the multi-sequence matching algorithm.
New format classification field in results
A new “class” field now appears in results (for the YAML, JSON and CSV outputs). It contains values from the format classification field in the PRONOM database which groups formats into categories such as “audio” and “database”. You can also omit the field when building a signature file with roy build -noclass
. For the background to this change, see the discussion page.
DROID multi setting for roy build command
Roy’s multi flag has a new “droid” mode: roy build -multi droid
.
This mode aims to more closely match DROID results by applying priority relationships after, rather than during, matching. This setting is more likely to show hybrid files than the default. For example, assume there is a file that is both a valid PDF and valid HTML document: in its default mode, siegfried, once it had positively matched either of those formats, would ignore the other because there is no priority relationship between them (e.g. having matched a PDF it will only consider more specific types of PDF). With the “droid” multi setting, both results would be returned as equally valid. For more information on this change see this issue.
Improvements to the multi-sequence matching algorithm
Siegfried uses a modified form of the Aho Corasick multiple-string matching algorithm for byte matching. This release includes a new dynamic version of the algorithm that pauses matching after all strings with maxium offsets have been tested and resumes matching with only the subset of strings that might still result in positive matches. By narrowing the search space, this improves performace for wildcard searches. This change has modestly increased performance for most of the benchmarks and creates scope for further optimizations in future releases.
CHANGELOG v1.10.0 (2023-03-25)
- format classification included as “class” field in PRONOM results. Requested by Robin François. Implemented by Ross Spencer
-noclass
flag added to roy build command. Use this flag to build signatures that omit the new “class” field from results.- glob paths can be used in place of file or directory paths for identification (e.g.
sf *.jpg
). Implemented by Ross Spencer -multi droid
setting for roy build command. Applies priorities after rather than during identificaiton for more DROID-like results. Reported by David Clipsham/update
command for server mode. Requested by Luis Faria- new algorithm for dynamic multi-sequence matching for improved wildcard performance
- update PRONOM to v111
- update LOC to 2023-01-27
- update tika-mimetypes to v2.7.0
- archivematica extensions built into wikidata signatures. Reported by Ross Spencer
- trailing slash for folder paths in URI field in droid output. Reported by Philipp Wittwer
- crash when using
sf -replay
with droid output