Problem Definition
Develop a database that can store large set of medical sequences and also allows to query for similar or exact sequence. Criteria that needs to be addressed:
- faster query response.
- can take a hit on accuracy.
- eventually supports textual and temporal data types.
Data Sets
Datasets is going to be a set of events that occured for an patient over a period of time. So, each record can be treated as a event that consists of textual and temporal (time interval) informations. Events belonging to a patient is merged together to form a sequence.
Scheme of a event:
Entity ID
ICD 9 (Full 3-5)
ICD 9 3D
ICD9 4D
Begin-Visit
End-Visit
Visit Gap
Visit Duration
Details on ICD Code:
- Digits 1-3 will now refer to the category
- Digit 1 is always alphabetic
- Digits 2-3 are always numeric
- Digits 4-6 will cover clinical details such as severity, etiology, and anatomic site (among others), and are either alphabetic or numeric
- Digit 7 will serve as an extension when necessary, and will be either alphabetic or numeric
- ICD-9’s 13,000 codes and limited space for additions.
- ICD-10 will have roughly 68,000 available codes.
Potential Candidates for indexing the data
Suffix Tree
- Ukkonen Algorithm
TOP-Q Motive: is a new low overhead buffer management method which can be used with Ukkonen’s construction algorithm. The goal of these researchers is to invent a buffer management technique that does not require modifying an existing in-memory construction algorithm.
Pros:
1. construction of suffix tree happens in O(N) with help of suffix links.
2. number of nodes or space required is minimum compared to other techniques.
Cons:
1. suitable only if ram size is sufficient enough to fit the entire tree. [1:5]
2. performance is bad when string size increases due to locality of memory. [1:5]
3. when alphabet size increases from 4 symbols to 20 symbols (cache misses increases) [2:9]
Overall:
Even TOP-Q, performs bad when compared with TDD or Hunt algorithms. [2:11]Note that the case where Ukkonen’s and McCreight’smethods will have an advantage over TDD is for short input
strings over a small alphabet size with high skew (repeat
sequences). [6:10]
Constructions:
1. The algorithm doesn’t populate all the suffixes.
2. It used suffix link if it can extend two suffix to form a new suffix.
3. http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
- TDD
- ST Merge
- TRELLS
- B2ST
- WaveFront
- ERA
- VLDB