Project Pages - An Integrated Scientific Blogging Template

Nivin Anton Alexis Lawrence

Problem Definition
Data Sets
- Scheme of a event:
  - Details on ICD Code:
Potential Candidates for indexing the data
- Suffix Tree

Problem Definition

Develop a database that can store large set of medical sequences and also allows to query for similar or exact sequence. Criteria that needs to be addressed:

faster query response.
can take a hit on accuracy.
eventually supports textual and temporal data types.

Data Sets

Datasets is going to be a set of events that occured for an patient over a period of time. So, each record can be treated as a event that consists of textual and temporal (time interval) informations. Events belonging to a patient is merged together to form a sequence.

Scheme of a event:

 Entity ID 
 ICD 9 (Full 3-5)
 ICD 9 3D
 ICD9 4D
 Begin-Visit
 End-Visit
 Visit Gap
 Visit Duration

Details on ICD Code:

Digits 1-3 will now refer to the category
Digit 1 is always alphabetic
Digits 2-3 are always numeric
Digits 4-6 will cover clinical details such as severity, etiology, and anatomic site (among others), and are either alphabetic or numeric
Digit 7 will serve as an extension when necessary, and will be either alphabetic or numeric
ICD-9’s 13,000 codes and limited space for additions.
ICD-10 will have roughly 68,000 available codes.

Potential Candidates for indexing the data

Suffix Tree

Ukkonen Algorithm

TOP-Q Motive: is a new low overhead buffer management method which can be used with Ukkonen’s construction algorithm. The goal of these researchers is to invent a buffer management technique that does not require modifying an existing in-memory construction algorithm.

Pros: 
    1. construction of suffix tree happens in O(N) with help of suffix links. 
    2. number of nodes or space required is minimum compared to other techniques. 
Cons:
    1. suitable only if ram size is sufficient enough to fit the entire tree. [1:5]
    2. performance is bad when string size increases due to locality of memory. [1:5]
    3. when alphabet size increases from 4 symbols to 20 symbols (cache misses increases) [2:9]

Overall:
Even TOP-Q, performs bad when compared with TDD or Hunt algorithms.  [2:11]Note that the case where Ukkonen’s and McCreight’smethods will have an advantage over TDD is for short input
strings over a small alphabet size with high skew (repeat
sequences). [6:10]

Constructions: 
    1. The algorithm doesn’t populate all the suffixes. 
    2. It used suffix link if it can extend two suffix to form a new suffix. 
    3. http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

TDD
ST Merge
TRELLS
B2ST
WaveFront
ERA
VLDB

Indexing of Medical Sequence Databases

Review Literature on Potential Candidates for Indexing of Medical Sequence Databases

Recently by the same author:

Introduction to Functional Programming [Part 2]

Think Functional

Nivin Anton Alexis Lawrence

Human