DGHS Project

Looking for help?
< All Topics
Print

Dedup

DB design:

CREATE TABLE dedup (
catchment_id varchar,
health_id1 varchar,
health_id2 varchar,
rule_name varchar,
rule_value varchar,
created_at timeuuid,
PRIMARY KEY (catchment_id, health_id1, health_id2, rule_name, rule_value));

Example:

Feed – p1 (create), p2 (create), p3 (create), p1 (updated occupation), p1 (updated nid2), p2 (voided, merged-with p3), h4 (create), p5 (create). pn represents HID.
All have same NID, nid1 but different catchments, c1, c2, c3, c4, c5. Only, h4, h5 have same phone #. c1 – 101112, c2 – 202122, c3 – 303133, so on.
Rules: same nid and same phone # are dedups. Dedups cannot be resolved for just one rule, but for all.

 

Scenarios Action Event in Feed Scheduled job runs Dupes in C1 Dupes in C2 Dupes in C3 Dupes in C4 Dupes in C5
1 Patient created (P1) created:p1 No dupes found.
2 Patient created (P2) created:p2 Finds P1 with same NID P2-P1 (NID) P2-P1(NID)
3 Patient created (P3) created:p3 Finds that P3 shares NID with P1 and P2 P2-P1(NID)
P3-P1(NID) +
P2-P1(NID)
P3-P2(NID) +
P3-P1(NID) +
P3-P2(NID) +
4 Patient updated (P1) updated:p1 (occupation) All entries related to p1 are deleted and recreated. P2-P1(NID)
P3-P1(NID)
P2-P1(NID)
P3-P2(NID)
P3-P1(NID)
P3-P2(NID)
5 Patient updated (P1) updated:p1 (NID) All entries related to p1 are deleted and recreated. Empty P3-P2(NID) P3-P2(NID)
6 Resolution. P2-P3 conflict for all rules.
p2 – voided and merged with p3
voided:P2 (merged with p3) Since P2 does not exist anymore, we must remove all pairs with P2 from all dedup tables. Empty Empty Empty
7 updated:P3 Run deduplication rule anyway. Found no duplicates Empty Empty Empty
8 Patient created (P4) created: P4 P4 matches P3 in NID. (P1’s NID has changed in earlier step.) P4-P3(NID) + P4-P3(NID) +
9 Patient created (P5) created:P5 P5 is detected for NID duplication, matching with P3, P4. And Also with P4 for Ph. num P4-P3(NID)
P5-P3 (NID) +
P4-P3(NID)
P5-P4 (NID+Ph.No) +
P5-P3(NID) +
P5-P4 (NID+Ph.No) +
10 P3-P4 resolved for all rules,
both are unique.
ignore-duplicate: P3, P4 As soon as approver says these must remain separate, we must go back and take them out of all catchments P5-P3 (NID) P5-P4 (NID+Ph.No) P5-P3(NID)
P5-P4 (NID+Ph.No)

Scenario 2 (DB Calls):

#1 select * from nid_mapping where nid = nid1

#2 select catchment from patient where hid in (h1, h2)

#3. Dedup DB:

Catchment HID1 HID2 Rule Name Rule Value Created At
A10B11 h1 h2 nid nid1 t1
A10B11C12 h1 h2 nid nid1 t2
A20B21 h2 h1 nid nid1 t3
A20B21C22 h2 h1 nid nid1 t4

Searches all patients by phone # ph2. No matching patient. No action taken.

Scenario 3:

#1 select * from nid_mapping where nid = nid1

#2 select catchment from patient where hid in (h1, h2, h3)

#3. Dedup DB:

Catchment HID1 HID2 Rule Name Rule Value Created At
A10B11 h1 h2 nid nid1 t1
A10B11C12 h1 h2 nid nid1 t2
A10B11 h1 h3 nid nid1 t9
A10B11C12 h1 h3 nid nid1 t10
A20B21 h2 h1 nid nid1 t3
A20B21C22 h2 h1 nid nid1 t4
A20B21 h2 h3 nid nid1 t11
A20B21C22 h2 h3 nid nid1 t12
A30B31 h3 h1 nid nid1 t5
A30B31C32 h3 h1 nid nid1 t6
A30B31 h3 h2 nid nid1 t7
A30B31C32 h3 h2 nid nid1 t8

Searches all patients by phone # ph3. No matching patient. No action taken.

Scenario 4:

#1 select hid2 from dedup where catchment in (1011, 101112) and hid1 = h1.

#2 select catchment from patient where hid = .. (for each hid returned in #1)

#3

(Batch begin)

delete from dedup where catchment_id in (1011, 101112) and hid1=h1

delete from dedup where catchment_id in (…) and hid1=.. and hid2=h1 (repeated as many times required)

repeat scenario 2.

(Batch ends)

#4 Dedup db looks same as in scenario 3.

Note: If catchment is updated, it will be similar to scenario 3.

Scenario 5:

Similar to scenario 4.

Dedup db:

Catchment HID1 HID2 Rule Name Rule Value Created At
A20B21 h2 h3 nid nid1 t11
A20B21C22 h2 h3 nid nid1 t12
A30B31 h3 h2 nid nid1 t7
A30B31C32 h3 h2 nid nid1 t8

 

Scenario 6:

Similar to scenario 4. Dedup table becomes empty.

Scenario 8:

Similar to scenario 2.

Dedup table:

Catchment HID1 HID2 Rule Name Rule Value Created At
A30B31 h3 h4 nid nid1 t..
A30B31C32 h3 h4 nid nid1 t..
A40B41 h4 h3 nid nid1 t..
A40B41C42 h4 h3 nid nid1 t..

Scenario 9:

Similar to scenario 2.

Dedup table:

Catchment HID1 HID2 Rule Name Rule Value Created At
A30B31 h3 h4 nid nid1 t..
A30B31C32 h3 h4 nid nid1 t..
A30B31 h3 h5 nid nid1 t..
A30B31C32 h3 h5 nid nid1 t..
A40B41 h4 h3 nid nid1 t..
A40B41C42 h4 h3 nid nid1 t..
A40B41 h4 h5 phone # ph1 t..
A40B41C42 h4 h5 phone # ph1 t..
A40B41 h4 h5 nid nid1 t..
A40B41C42 h4 h5 nid nid1 t..
A50B51 h5 h3 nid nid1 t..
A50B51C52 h5 h3 nid nid1 t..
A50B51 h5 h4 phone # ph1 t..
A50B51C52 h5 h4 phone # ph1 t..

 

Scenario 10:

p3 and p4 are no more dupes.

#1. select catchment from patient where hid in (h3, h4)

#2. delete from dedup where catchment in (..) and hid1 in (h3, h4) and hid2 in (h3, h4) [in won’t work. so run separate queries in batch]

Dedup table:

Catchment HID1 HID2 Rule Name Rule Value Created At
A30B31 h3 h5 nid nid1 t..
A30B31C32 h3 h5 nid nid1 t..
A40B41 h4 h5 phone # ph1 t..
A40B41C42 h4 h5 phone # ph1 t..
A40B41 h4 h5 nid nid1 t..
A40B41C42 h4 h5 nid nid1 t..
A50B51 h5 h3 nid nid1 t..
A50B51C52 h5 h3 nid nid1 t..
A50B51 h5 h4 phone # ph1 t..
A50B51C52 h5 h4 phone # nid1 t..

 

Find duplicates by catchment:

select * from dedup where catchment = 102030.

Then sort the result by timeuuid (time component only) in-memory and return.

 

Table of Contents