Usage¶
Warning
This code base has been developed to work with the JASMIN Lotus cluster and will need adaptation to work in any other environment.
The command line tool expects you to run from within a collection i.e. a group of files which share a common structure or processing class. The facet scanner maps file paths to handlers, these handlers know how to interact with the files to extract facets which will be useful when searching the data.
The command line entry point:
usage: facet_scanner [-h] [--rerun] [--num-files NUM_FILES] [--conf CONF]
path processing_path
Process path for facets and update the index
positional arguments:
path Path to process
processing_path Path to output intermediate files
optional arguments:
-h, --help show this help message and exit
--rerun Disable paging to disk on rerun
--num-files NUM_FILES
Number of files per lotus job
--conf CONF
The script works by:
Run elasticsearch query to return all the files under the given path
Save each page (size given by
--num-files) into an intermediate directoryprocessing_pathOnce this process has completed, each page file is submitted to lotus using
facet_scanner/scripts/lotus_facet_scanner.pyThis runs the facet extraction on the files listed in the page file using lotus, writing to elasticsearch
There may be some files which do not complete in lotus. The lotus script deletes the page file on successful completion of the facet extraction.
On completion of all the scheduled jobs, checking the intermediate directory will show you which files did not run. In most
cases, a simple re-run of the facet_scanner script with the --rerun flag will clear them. This --rerun
flag ignores the step 1 and 2 above and skips to sending the jobs to lotus.