Dedup
DB design:
CREATE TABLE dedup (
catchment_id varchar,
health_id1 varchar,
health_id2 varchar,
rule_name varchar,
rule_value varchar,
created_at timeuuid,
PRIMARY KEY (catchment_id, health_id1, health_id2, rule_name, rule_value));
Example:
Feed – p1 (create), p2 (create), p3 (create), p1 (updated occupation), p1 (updated nid2), p2 (voided, merged-with p3), h4 (create), p5 (create). pn represents HID.
All have same NID, nid1 but different catchments, c1, c2, c3, c4, c5. Only, h4, h5 have same phone #. c1 – 101112, c2 – 202122, c3 – 303133, so on.
Rules: same nid and same phone # are dedups. Dedups cannot be resolved for just one rule, but for all.
Scenarios | Action | Event in Feed | Scheduled job runs | Dupes in C1 | Dupes in C2 | Dupes in C3 | Dupes in C4 | Dupes in C5 |
1 | Patient created (P1) | created:p1 | No dupes found. | |||||
2 | Patient created (P2) | created:p2 | Finds P1 with same NID | P2-P1 (NID) | P2-P1(NID) | |||
3 | Patient created (P3) | created:p3 | Finds that P3 shares NID with P1 and P2 | P2-P1(NID) P3-P1(NID) + |
P2-P1(NID) P3-P2(NID) + |
P3-P1(NID) + P3-P2(NID) + |
||
4 | Patient updated (P1) | updated:p1 (occupation) | All entries related to p1 are deleted and recreated. | P2-P1(NID) P3-P1(NID) |
P2-P1(NID) P3-P2(NID) |
P3-P1(NID) P3-P2(NID) |
||
5 | Patient updated (P1) | updated:p1 (NID) | All entries related to p1 are deleted and recreated. | Empty | P3-P2(NID) | P3-P2(NID) | ||
6 | Resolution. P2-P3 conflict for all rules. p2 – voided and merged with p3 |
voided:P2 (merged with p3) | Since P2 does not exist anymore, we must remove all pairs with P2 from all dedup tables. | Empty | Empty | Empty | ||
7 | updated:P3 | Run deduplication rule anyway. Found no duplicates | Empty | Empty | Empty | |||
8 | Patient created (P4) | created: P4 | P4 matches P3 in NID. (P1’s NID has changed in earlier step.) | P4-P3(NID) + | P4-P3(NID) + | |||
9 | Patient created (P5) | created:P5 | P5 is detected for NID duplication, matching with P3, P4. And Also with P4 for Ph. num | P4-P3(NID) P5-P3 (NID) + |
P4-P3(NID) P5-P4 (NID+Ph.No) + |
P5-P3(NID) + P5-P4 (NID+Ph.No) + |
||
10 | P3-P4 resolved for all rules, both are unique. |
ignore-duplicate: P3, P4 | As soon as approver says these must remain separate, we must go back and take them out of all catchments | P5-P3 (NID) | P5-P4 (NID+Ph.No) | P5-P3(NID) P5-P4 (NID+Ph.No) |
Scenario 2 (DB Calls):
#1 select * from nid_mapping where nid = nid1
#2 select catchment from patient where hid in (h1, h2)
#3. Dedup DB:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A10B11 | h1 | h2 | nid | nid1 | t1 |
A10B11C12 | h1 | h2 | nid | nid1 | t2 |
A20B21 | h2 | h1 | nid | nid1 | t3 |
A20B21C22 | h2 | h1 | nid | nid1 | t4 |
Searches all patients by phone # ph2. No matching patient. No action taken.
Scenario 3:
#1 select * from nid_mapping where nid = nid1
#2 select catchment from patient where hid in (h1, h2, h3)
#3. Dedup DB:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A10B11 | h1 | h2 | nid | nid1 | t1 |
A10B11C12 | h1 | h2 | nid | nid1 | t2 |
A10B11 | h1 | h3 | nid | nid1 | t9 |
A10B11C12 | h1 | h3 | nid | nid1 | t10 |
A20B21 | h2 | h1 | nid | nid1 | t3 |
A20B21C22 | h2 | h1 | nid | nid1 | t4 |
A20B21 | h2 | h3 | nid | nid1 | t11 |
A20B21C22 | h2 | h3 | nid | nid1 | t12 |
A30B31 | h3 | h1 | nid | nid1 | t5 |
A30B31C32 | h3 | h1 | nid | nid1 | t6 |
A30B31 | h3 | h2 | nid | nid1 | t7 |
A30B31C32 | h3 | h2 | nid | nid1 | t8 |
Searches all patients by phone # ph3. No matching patient. No action taken.
Scenario 4:
#1 select hid2 from dedup where catchment in (1011, 101112) and hid1 = h1.
#2 select catchment from patient where hid = .. (for each hid returned in #1)
#3
(Batch begin)
delete from dedup where catchment_id in (1011, 101112) and hid1=h1
delete from dedup where catchment_id in (…) and hid1=.. and hid2=h1 (repeated as many times required)
repeat scenario 2.
(Batch ends)
#4 Dedup db looks same as in scenario 3.
Note: If catchment is updated, it will be similar to scenario 3.
Scenario 5:
Similar to scenario 4.
Dedup db:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A20B21 | h2 | h3 | nid | nid1 | t11 |
A20B21C22 | h2 | h3 | nid | nid1 | t12 |
A30B31 | h3 | h2 | nid | nid1 | t7 |
A30B31C32 | h3 | h2 | nid | nid1 | t8 |
Scenario 6:
Similar to scenario 4. Dedup table becomes empty.
Scenario 8:
Similar to scenario 2.
Dedup table:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A30B31 | h3 | h4 | nid | nid1 | t.. |
A30B31C32 | h3 | h4 | nid | nid1 | t.. |
A40B41 | h4 | h3 | nid | nid1 | t.. |
A40B41C42 | h4 | h3 | nid | nid1 | t.. |
Scenario 9:
Similar to scenario 2.
Dedup table:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A30B31 | h3 | h4 | nid | nid1 | t.. |
A30B31C32 | h3 | h4 | nid | nid1 | t.. |
A30B31 | h3 | h5 | nid | nid1 | t.. |
A30B31C32 | h3 | h5 | nid | nid1 | t.. |
A40B41 | h4 | h3 | nid | nid1 | t.. |
A40B41C42 | h4 | h3 | nid | nid1 | t.. |
A40B41 | h4 | h5 | phone # | ph1 | t.. |
A40B41C42 | h4 | h5 | phone # | ph1 | t.. |
A40B41 | h4 | h5 | nid | nid1 | t.. |
A40B41C42 | h4 | h5 | nid | nid1 | t.. |
A50B51 | h5 | h3 | nid | nid1 | t.. |
A50B51C52 | h5 | h3 | nid | nid1 | t.. |
A50B51 | h5 | h4 | phone # | ph1 | t.. |
A50B51C52 | h5 | h4 | phone # | ph1 | t.. |
Scenario 10:
p3 and p4 are no more dupes.
#1. select catchment from patient where hid in (h3, h4)
#2. delete from dedup where catchment in (..) and hid1 in (h3, h4) and hid2 in (h3, h4) [in won’t work. so run separate queries in batch]
Dedup table:
Catchment | HID1 | HID2 | Rule Name | Rule Value | Created At |
A30B31 | h3 | h5 | nid | nid1 | t.. |
A30B31C32 | h3 | h5 | nid | nid1 | t.. |
A40B41 | h4 | h5 | phone # | ph1 | t.. |
A40B41C42 | h4 | h5 | phone # | ph1 | t.. |
A40B41 | h4 | h5 | nid | nid1 | t.. |
A40B41C42 | h4 | h5 | nid | nid1 | t.. |
A50B51 | h5 | h3 | nid | nid1 | t.. |
A50B51C52 | h5 | h3 | nid | nid1 | t.. |
A50B51 | h5 | h4 | phone # | ph1 | t.. |
A50B51C52 | h5 | h4 | phone # | nid1 | t.. |
Find duplicates by catchment:
select * from dedup where catchment = 102030.
Then sort the result by timeuuid (time component only) in-memory and return.