Indexing of Medical Sequence Databases

Review Literature on Potential Candidates for Indexing of Medical Sequence Databases

Posted by Nivin Anton Alexis Lawrence on January 9, 2017

Recently by the same author:


Introduction to Functional Programming [Part 2]

Think Functional


Nivin Anton Alexis Lawrence

Human

Problem Definition

Develop a database that can store large set of medical sequences and also allows to query for similar or exact sequence. Criteria that needs to be addressed:

  1. faster query response.
  2. can take a hit on accuracy.
  3. eventually supports textual and temporal data types.

Data Sets

Datasets is going to be a set of events that occured for an patient over a period of time. So, each record can be treated as a event that consists of textual and temporal (time interval) informations. Events belonging to a patient is merged together to form a sequence.

Scheme of a event:

 Entity ID 
 ICD 9 (Full 3-5)
 ICD 9 3D
 ICD9 4D
 Begin-Visit
 End-Visit
 Visit Gap
 Visit Duration

Details on ICD Code:

  • Digits 1-3 will now refer to the category
  • Digit 1 is always alphabetic
  • Digits 2-3 are always numeric
  • Digits 4-6 will cover clinical details such as severity, etiology, and anatomic site (among others), and are either alphabetic or numeric
  • Digit 7 will serve as an extension when necessary, and will be either alphabetic or numeric
  • ICD-9’s 13,000 codes and limited space for additions.
  • ICD-10 will have roughly 68,000 available codes.

Potential Candidates for indexing the data

Suffix Tree

  1. Ukkonen Algorithm

TOP-Q Motive: is a new low overhead buffer management method which can be used with Ukkonen’s construction algorithm. The goal of these researchers is to invent a buffer management technique that does not require modifying an existing in-memory construction algorithm.

Pros: 
    1. construction of suffix tree happens in O(N) with help of suffix links. 
    2. number of nodes or space required is minimum compared to other techniques. 
Cons:
    1. suitable only if ram size is sufficient enough to fit the entire tree. [1:5]
    2. performance is bad when string size increases due to locality of memory. [1:5]
    3. when alphabet size increases from 4 symbols to 20 symbols (cache misses increases) [2:9]

Overall:
Even TOP-Q, performs bad when compared with TDD or Hunt algorithms.  [2:11]Note that the case where Ukkonen’s and McCreight’smethods will have an advantage over TDD is for short input
strings over a small alphabet size with high skew (repeat
sequences). [6:10]

Constructions: 
    1. The algorithm doesn’t populate all the suffixes. 
    2. It used suffix link if it can extend two suffix to form a new suffix. 
    3. http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
  1. TDD
  2. ST Merge
  3. TRELLS
  4. B2ST
  5. WaveFront
  6. ERA
  7. VLDB