Skip to content

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.

License

Notifications You must be signed in to change notification settings

harvard-lil/js-wacz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

js-wacz

Tests npm version JavaScript Style Guide

JavaScript module and CLI tool for working with web archive data using the WACZ format specification, similar to Webrecorder's py-wacz.

It can be used to combine a set of .warc / .warc.gz files into a single .wacz file:

... programmatically (Node.js):

import { WACZ } from '@harvard-lil/js-wacz'

const archive = new WACZ({ 
  input: 'collection/*.warc.gz', 
  output: 'collection.wacz',
})

await archive.process() // "my-collection.wacz" is ready!

... or via the command line:

js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"

js-wacz makes use of workers to process as many WARC files in parallel as the host machine can handle.

Perma Tools


Summary


Install

js-wacz requires Node JS 18+.

npm can be used to install this package and make the js-wacz command accessible system-wide:

npm install -g @harvard-lil/js-wacz

πŸ‘† Back to summary


CLI: create command

The create command helps combine one or multiple .warc or .warc.gz files into a single .wacz file.

js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"

js-wacz accepts the following options and arguments for customizing how the WACZ file is assembled.

--file, -f

This is the only required argument, which indicates what file(s) should be processed and added to the resulting WACZ file.

The target can be a single file, or a glob pattern such as folder/*.warc.gz.

# Single file:
js-wacz create --file archive.warc
# Collection:
js-wacz create --file "collection/*.warc"

Note: When using globs, make sure to surround the path with quotation marks.

--output, -o

Specify where the resulting .wacz file should be created, and what its filename should be.

Defaults to archive.wacz in the current directory if not provided.

js-wacz create --file cool-beans.warc --output cool-beans.wacz

--pages, -p

Path to a folder containing pages.jsonl files (pages.jsonl, extraPages.jsonl ...).

If not provided, js-wacz is going to attempt to detect pages in WARC records to build its own pages.jsonl index.

# Assuming the following file exists: /collections/pages/pages.jsonl
js-wacz create -f "collection/*.warc.gz" --pages collection/pages/

--cdxj

Pass a directory of existing CDXJ files, rather than indexing from WARCs. Must be used in combination with --pages.

js-wacz create -f "collection/*.warc.gz" --pages collection/pages.jsonl --cdxj collection/indexes/

--url

If provided, will be used as the mainPageUrl attribute for datapackage.json.

Must be a valid URL.

js-wacz create -f "collection/*.warc.gz" --url "https://lil.law.harvard.edu"

--ts

If provided, will be used as the mainPageDate attribute for datapackage.json.

Can be any value that can be parsed by JavaScript's Date() constructor.

js-wacz create -f "collection/*.warc.gz" --ts "2023-02-22T12:00:00.000Z"

--title

If provided, will be used as the title attribute for datapackage.json.

js-wacz create -f "collection/*.warc.gz" --title "My collection."

--desc

If provided, will be used as the description attribute for datapackage.json.

js-wacz create -f "collection/*.warc.gz" --desc "My cool collection of web archives."

--signing-url

If provided, will be used as an API endpoint for applying a cryptographic signature to the resulting WACZ file.

This endpoint is expected to be authsign-compatible.

js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign"

--signing-token

Used conjointly with --signing-url if provided, in case the signing server requires authentication.

js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign" --signing-token "FOO-BAR"

--log-level

Can be used to determine how verbose js-wacz needs to be.

  • Possible values are: silent, trace, debug, info, warn, error
  • Default is: info
js-wacz create -f "collection/*.warc.gz" --log-level trace

πŸ‘† Back to summary


Programmatic use

js-wacz's CLI and underlying logic are decoupled, and it can therefore be consumed as a JavaScript module (currently only with Node.js).

Example: Creating a signed WACZ programmatically

import { WACZ } from '@harvard-lil/js-wacz'

try {
  const archive = new WACZ({ 
    file: 'collection/*.warc.gz',
    output: 'collection.wacz',
    signingUrl: 'https://example.com/sign',
    signingToken: 'FOO-BAR',
  }

  await archive.process()

  // collection.wacz is ready
} catch(err) {
  // ...
}

Although a process() convenience method is made available, every step of said process can be run individually and the archive's state inspected / edited throughout.

Notable affordances

  • WACZ.addPage() allows for manually adding an entry to pages.jsonl.
  • WACZ.addFileToZip() allows for manually adding any additional data to the final WACZ file.
  • The datapackageExtras option allows for adding an arbitrary JSON-serializable object to datapackage.json under extras.

References:

πŸ‘† Back to summary


Feature parity with py-wacz

js-wacz is aiming at partial feature parity with webrecorder's py-wacz, similar to Webrecorder's py-wacz.

This section lists notable differences in implementation that might affect interoperability.

Main differences in currently implemented features:

  • CLI: create --detect-pages: --detect-pages is implied in js-wacz unless --pages is provided.
  • CLI: create --file: that argument can be implied in py-wacz, it is always explicit in js-wacz.

πŸ‘† Back to summary


Development

Standard JS

This codebase uses the Standard JS coding style.

  • npm run lint can be used to check formatting.
  • npm run lint-autofix can be used to check formatting and automatically edit files accordingly when possible.
  • Most IDEs can be configured to automatically check and enforce this coding style.

JSDoc

JSDoc is used for both documentation and loose type checking purposes on this project.

Testing

This project uses Node.js' built-in test runner.

npm run test

Tests-specific environment variables

The following environment variables allow for testing features requiring access to a third-party server.

These are optional, and can be added to a local .env file which will be automatically interpreted by the test runner.

Name Description
TEST_SIGNING_URL URL of an authsign-compatible endpoint for signing WACZ files.
To run such an endpoint locally, use npm run dev-signer, which will overwrite .env and set this variable to http://localhost:5000/sign; see .services/signer.
TEST_SIGNING_TOKEN If required by the server at TEST_SIGNING_URL, an authentication token.

Available CLI

# Runs test suite
npm run test

# Runs linter
npm run lint

# Runs linter and attempts to automatically fix issues
npm run lint-autofix

# Step-by-step NPM publishing helper
npm run publish-util

# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer

πŸ‘† Back to summary

About

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.

Resources

License

Stars

Watchers

Forks