Research Medical Image Dataset Conventions
Naming conventions
1. Case identifiers: Each case in an image dataset with consist of two fields a dataset prefix followed by a case number. For example, for NLST we may choose NL as the prefix and case 23 would become “NL00023”. Each case code is system-wide unique
2. Image identifiers: In general, a short code will be used to refer to each scan (CT series). For example, “s1” may be used to identify the first scan for a case. The full image reference or (image-code) requires both the case id and scan id, for example “NL00023-s1”
3. For convention scan ids may be in numerically increasing for time order.
4. CSV image sets: In general, a set of images and associated information will be organized as a csv format file in which one column has a heading “image” and the contents of that column specify the image-coded (one per row). For most application the if= option specifies the input csv file and the -csv option needs to be specified to indicate this file format. Two other formats are supported for most commands, (a) a list of image codes (one per row) with no title row, this is the default for the if= argument without any option specification, and, (b) for single-image running, the -ic option when specified indicates that the argument to if+ is the image-code itself.
5. Pulmonary nodules: Multiple entities within an image such as pulmonary nodules or CAC deposits should have their own numerical extension (and column in relevant csv files).
6. Image indexing: For program location within a 3D image the following indexing scheme should be x increases from left to right (patient right to left), y increase from top to bottom (patient from anterior to posterior) and z increases from to bottom (patient cranial to caudle); i.e., the direction most chest CT scans are recorded. Incidentally, by most conventions, the array dimensions are ordered (z,y,x) within programs
Image data storage
CT image series represent a three-dimensional image structure. Most research image datasets will have the images in dicom format which means that each CT image series is represented by a directory containing a set of 2d-image slice files. All that need to be correctly labeled is the name for the DICOM directory. The names of the files are generally irrelevant since they are only used by standard DICOM importing software which uses internal metadata to order the images. Thus, the dicom study needs a unique name for its directory while, single-file formats such as. Nifty and v4 require a unique files name (and perhaps a standard file extension. All cases are addressed by using the standard image name for the input image and an additional single extension for any derived images.
For example, for case “NL00023” image “s1” the input dicom directory would be “NL00023-s1/” while in nifti it would be “NL00023-s1.nni.gz “ and v4 would be “NL00023-s1.vx “. Other derived image files could be distinguished by a different directory, but they should also have their own extension to allow maximum grouping without the need for any file name changes. Thus, an attention file image could have a name NL00023-s1.att.nni.gz
The syntax for a derived image is:
<dataset-case>-<image>.<instance_extension>[.<file_extension> ]
Discussion
The majority of research image datasets involving CT images have the following attributes
1. Numerical case identifiers
2. Frequently they have a prefix such as NLIST which is very useful
3. many datasets use dicom and some use the DICOM series ID as the image identifier. While the dicom series id is worldwide unique. (when correctly generated) it has the disadvantage of being very long which messes up many data files that use it. Thus, a short non-unique code is much preferred. In extreme situations, when the input is in dicom format it is possible to use these identifiers in the image field but that has a number of very bad side effects. A much better solution is to create a conversion csv table with the column headings “case”, “image”,”dicom_series_id” and have applications use that file to access images when necessary. In general, the goal will be to curate each dataset so that DICOM series ids are never needed.
For most datasets the image data will be in DICOM format. Some datasets are in nifti which is a significant program as the DICOM tags are not available (unless in a separate data list (not usually the case). I have a large number
Many applications use nifti for intermediate 3D image files. Nifti is an old data format designed especially for neuro applications (also convenient for lap-top centered image viewers); it has achived some acceptance but not for the right reasons. It has a number of issues including limited metadata (not dicom) v4 is available on all python systems and, in most cases, is a better format to use. Most of my image datasets are archived in v4 format. In time,our system will be made format agnostic. For CT images there is no point in maintaining the DICOM arbitrary image location system. His is critical for Neuro (as are the affine transforms which you will be sure to have problems with) but for CT never needed. When needed, for example, for some viewers, we can just define one at the origin of our image indexing system.