Welcome to CEDA Facet Scanner’s documentation!

This documentation describes the CEDA facet scanner. This is the package which is used to extract facets from collections of datasets which can then be fed into OpenSearch.

The extracted data is fed into elasticsearch.

Installation

Install the requirements:

pip install -r requirements.txt

Install the library:

pip install git+https://github.com/cedadev/facet-scanner

Basic Usage

This code can be used to bulk process a dataset for testing and initialisation:

usage: facet_scanner [-h] [--rerun] [--num-files NUM_FILES] [--conf CONF]
                     path processing_path

Process path for facets and update the index

positional arguments:
  path                  Path to process
  processing_path       Path to output intermediate files

optional arguments:
  -h, --help            show this help message and exit
  --rerun               Disable paging to disk on rerun
  --num-files NUM_FILES
                        Number of files per lotus job
  --conf CONF

The script uses your supplied path and queries elasticsearch for all the files under this point. The --num-files flag sets the page size and determines how many files end up in each lotus batch job.

Indices and table