Dataset Curation

This research will involve processing images form imported datasets. Datasets contain a large number of CT images and a number of metadata files. The following scheme for image file naming has been designed to keep image refences reasonable sort but also system wide unique.  The system has been deigned to have every image in the system, and its derived images, identified by a short user informative code.

Image datasets on this research system will be curated to a standard format as follows:

1.        In most case the images themselves will be saved in the same files structure that was provided for them

2.        Cases and names within a dataset will be labeled as follows:

a.        Each dataset will be provided with a unique, usually two letter code. For example, the NLST dataset may receive the code NL

b.        Each case within a dataset will be assigned a unique numeric ID

c.        Each image (CT image series) will be assigned a short unique code that is case specific e.g. “s”1 for the first series in a case

d.        Nodules or other image features will be assigned a unique code that is unique for a specific image

3.        Referencing data entities in a dataset

a.        The following rules me be followed to maintain file system and study integrity.

b.        When referencing a case in a dataset the dataset code must always be included. For example, NL00023 refers to case 23 in the NLST dataset. To maintain system integrity, a case reference must always include the dataset code.

c.        When referencing an image, the case reference must also be included as a prefix. For example, “NL00023-s2” references the second image in NLST case 23. The main exception to this is in case-image csv lists where the first two columns are labeled “case” and “image” for more convenient processing. In this case the two column values should be combined with a “-“ to form a valid image reference.

d.        Derived images computed from an input image, or another derived image should always include the image reference as a prefix. For example, a nifti attention image may include a type identifier and a file type extension as, for example, consider that an attention image is assigned by the user a code “att” the derived image reference would be “NL00023-s2.att.nifti.gz” which indicates that an att type image derived from NL00023-s2.

4.        Referencing dataset images from the dataset original file structure.

a.        In the image curation process the images (typically DICOM directories) have difficult to use names (e.g. DICOM series id’s) in some cases, for small datasets we may just change the names to the to the correct system image names. However, in large datasets that contain a hierarchical image file structure. Such datasets will contain a file called image-path.csv in the dataset directory. This file is a csv file with three columns: “case”, ”image” and “path”. This file contains a row for every image in the dataset, the “path” column proved the file path in the dataset image hierarchy to the image file correspond to the specification given by the first two case-image values. This file is to be referenced by applications to access the input file or directory given a standard image refence.

Crated dataset structure (not yet fully implemented)

All datasets for the main servers are located on the attached RAID storage device with the reference /mnt/RAID/data/

A text file “README” will provide a one-line description of each of the datasets and its curated code. Once curated each dataset will be located in directory with its system code; for example, NLIST would be in a directory “NL”. In the dataset directory there is typically a directory called images and, if needed, an image-path.csv file, and other metadata files and directories.

Image Lists

It is anticipated that much of the research effort will be related to image subsets from one or more datasets. To manage these subsets, a csv file will be used with the first two columns being labeled “case” and “image” providing convent image references. This will be accepted by the main system apps to run different algorithms and to return results in a csv file with the same two first columns as the input image specification and with additional columns containing the algorithm results.