Usage

Warning

This code base has been developed to work with the JASMIN Lotus cluster and will need adaptation to work in any other environment.

The command line tool expects you to run from within a collection i.e. a group of files which share a common structure or processing class. The facet scanner maps file paths to handlers, these handlers know how to interact with the files to extract facets which will be useful when searching the data.

The command line entry point:

usage: facet_scanner [-h] [--rerun] [--num-files NUM_FILES] [--conf CONF]
                     path processing_path

Process path for facets and update the index

positional arguments:
  path                  Path to process
  processing_path       Path to output intermediate files

optional arguments:
  -h, --help            show this help message and exit
  --rerun               Disable paging to disk on rerun
  --num-files NUM_FILES
                        Number of files per lotus job
  --conf CONF

The script works by:

  1. Run elasticsearch query to return all the files under the given path

  2. Save each page (size given by --num-files) into an intermediate directory processing_path

  3. Once this process has completed, each page file is submitted to lotus using facet_scanner/scripts/lotus_facet_scanner.py

  4. This runs the facet extraction on the files listed in the page file using lotus, writing to elasticsearch

There may be some files which do not complete in lotus. The lotus script deletes the page file on successful completion of the facet extraction.

On completion of all the scheduled jobs, checking the intermediate directory will show you which files did not run. In most cases, a simple re-run of the facet_scanner script with the --rerun flag will clear them. This --rerun flag ignores the step 1 and 2 above and skips to sending the jobs to lotus.