Dataset Curation

This research will involve processing images form imported datasets. Datasets contain a large number of CT images and several metadata files. The following scheme for image file naming has been designed to keep image refences reasonable sort but also system wide unique.  The system has been deigned to have every image in the system, and its derived images, identified by a short user informative code.

Image datasets on this research system will be curated to a standard format as follows:

1.        In most case the images themselves will be saved in the same files structure that was provided for them

2.        Cases and names within a dataset will be labeled as follows:

a.        Each dataset will be provided with a unique, usually two letter code. For example, the NLST dataset may receive the code NL

b.        Each case within a dataset will be assigned a unique numeric ID

c.        Each image (CT image series) will be assigned a short unique code that is case specific e.g. “s”1 for the first series in a case

d.        Nodules or other image features will be assigned a unique code that is unique for a specific image

3.        Referencing data entities in a dataset

a.        The following rules me be followed to maintain file system and study integrity.

b.        When referencing a case in a dataset the dataset code must always be included. For example, NL00023 refers to case 23 in the NLST dataset. To maintain system integrity, a case reference must always include the dataset code.

c.        When referencing an image, the case reference must also be included as a prefix. For example, “NL00023-s2” references the second image in NLST case 23. The main exception to this is in case-image csv lists where the first two columns are labeled “case” and “image” (mainly used in the previous legacy system) for more convenient processing. In this case the two column values should be combined with a “-“ to form a valid image reference.

d.        Derived images computed from an input image, or another derived image should always include the image reference as a prefix. For example, a nifti attention image may include a type identifier and a file type extension as, for example, consider that an attention image is assigned by the user a code “att” the derived image reference would be “NL00023-s2.att.nifti.gz” which indicates that an att type image derived from NL00023-s2.

4.        Referencing dataset images from the dataset original file structure.

a.        In the image curation process the images (typically DICOM directories) have difficult to use names (e.g. DICOM series id’s) in some cases, for small datasets we may just change the names to the to the correct system image names. However, in large datasets that contain a hierarchical image file structure. Such datasets will contain a file called image-path.csv in the dataset directory. This file is a csv file with three columns: “case”, “image” and “path”. This file contains a row for every image in the dataset, the “path” column proved the file path in the dataset image hierarchy to the image file correspond to the specification given by the first two case-image values. This file is to be referenced by applications to access the input file or directory given a standard image refence.

Curated dataset structure

All datasets for the main servers are located on the attached RAID storage device in the directory /mnt/RAID/data/<ds>,  where <ds> is the name of the dataset.

A text file “README” will provide a one-line description of each of the datasets and its curated code. Once curated each dataset will be located in a directory with its system code; for example, NLIST would be in a directory “NL”. In the dataset directory there is typically a directory called images and, if needed, an image-path.csv file, and other metadata files and directories.

Image Lists

It is anticipated that much of the research effort will be related to image subsets from one or more datasets. To manage these subsets, a csv file will be used with the first two columns being labeled “case” and “image” providing convent image references. This will be accepted by the main system apps to run different algorithms and to return results in a csv file with the same two first columns as the input image specification and with additional columns containing the algorithm results.

Results and data lists

Result csv files from applying algorithms to curated image datasets together with the image lists used to create them will be archived in “/mnt/raid/results/<ds>/”,  where <ds> is the dataset code. These files only require a minute amount of storage comparted to the static (read-only) dataset files.

Curation commands (of interest to dataset creators)

Several commands (python programs) are available for assisting the dataset curation and these commands may also be useful for creating study specific image data subsets.

Program

Description

vdseries <name>

Review DICOM series for a dataset, <name> is the path to the dataset. Checks that each image dir. has only on series ID and that all images have different series ids.

mklinks<ds>*

creates image links for vdimage (local customized to dataset)

vdcmtags <list>

creates a csv with main DICOM tag values [given an image <list> ]

vdsels<ds>*

select the cases with analyzable scans from a raw image dataset (requires adequate length)

vdselc<ds>*

Select a single image to represent each case for each scan date.

(considers slice thickness and recon Kernel)

 

*Currently these functions are customized for each dataset

Example Curation (UN dataset)

  1. partition data into directories each with a unique DICOM Series id. Associate a unique image-code with each series and id and label the image directory with the image code (images code should be time sequential) (validate with vdseries)
  2. Create an initial image list (UNlist)  by a simple “ls -1” of the dataset directory
  3. Extract useful DICOM tags of all images (vdcmtags UNlist UNdcmtags.csv)
  4. Extract all CT image codes from the dataset (vselsUN) generates image list UN-Stags.csv and UN-Slist
  5. Extract a single CT image for each case and time instance (vselcUN) reads UN-Stags.csv and creates UN-Ctags.csv and UN-Clist (outputs any issues to stdout).
  6. [Quality check the selected case image dataset and run analysis algorithms]