EMPIARreader Examples#
Using the Python interface#
For this example, we open the EMPIAR entry 10943 and load an image dataset from its available directories.
from empiarreader import EmpiarSource, EmpiarCatalog
test_entry = 10943
Every EMPIAR entry has an associated xml file which contains the default order of the directory. This information can be accessed by loading the entry into an EmpiarCatalog.
test_catalog = EmpiarCatalog(test_entry)
To get the dataset from the catalog, one would need to specify which directory to load. In this case, there is only one so we choose the key in the position 0.
test_catalog_dir = list(test_catalog.keys())[0]
dataset_from_catalog = test_catalog[test_catalog_dir]
However, the intended target is not always the directory present in the xml. We can further specify the directory to which directory we would like to get the images from. EMPIARreader can load the dataset from an EmpiarSource, using the EMPIAR entry number and the directory of the images. In this case, we also specify that we want the MRC files from the specified directory.
ds = EmpiarSource(
test_entry,
directory="data/MotionCorr/job003/Tiff/EER/Images-Disc1/GridSquare_11149061/Data",
filename=".*EER\\.mrc",
regexp=True,
)
The dataset is loaded lazily (using Dask), so the images are loaded one at a time when read_partition is called. To choose an image, one can just pick the partition - in this case, it was the partition 10.
part = ds.read_partition(10)
This example can be visualised in the Jupyter Notebook provided in the repository.
Using the command line interface#
You can use the EMPIARreader CLI to search the EMPIAR archive one directory at a time to find what you are looking for before then downloading those files to disk. First, you will need to choose an EMPIAR entry - in this example EMPIAR entry 10934 is used. Here we use a glob wildcard (--select "*") to list every subdirectory and file in a readable format:
empiarreader search --entry 10934 --select "*" --verbose
which returns:
Matching path #0: https://ftp.ebi.ac.uk/empiar/world_availability/10934//10934.xml
Matching path #1: https://ftp.ebi.ac.uk/empiar/world_availability/10934//data/
Subdirectories are: https://ftp.ebi.ac.uk/empiar/world_availability/10934
Subdirectories are: https://ftp.ebi.ac.uk/empiar/world_availability/10934//data
We’ve found the xml containing the metadata for the entry and a subdirectory called data. To look inside you can add the –dir argument and repeat recursively until you find the directory you are interested in:
empiarreader search --entry 10934 --select "*" --dir "data" --verbose
Once you have found one or more files which you want to download from a directory in the EMPIAR archive you can create a list of URLs using the –save_search argument:
empiarreader search --entry 10934 --dir \
"data/CL44-1_20201106_111915/Images-Disc1/GridSquare_6089277/Data" \
--select "*gain.tiff.bz2" --save_search saved_search.txt
Using the workflow described above, a user can quickly search and identify datasets that fulfill their criteria. These can then be downloaded using the download utility of the CLI. A user simply needs to specify the file list and a directory to download the files into:
empiarreader download --download saved_search.txt --save_dir new_dir --verbose