120 lines
3.4 KiB
Plaintext
120 lines
3.4 KiB
Plaintext
An [[wp:Join_(SQL)#Inner_join|inner join]] is an operation that combines two data tables into one table, based on matching column values. The simplest way of implementing this operation is the [[wp:Nested loop join|nested loop join]] algorithm, but a more scalable alternative is the [[wp:hash join|hash join]] algorithm.
|
|
|
|
;Task
|
|
|
|
Implement the "hash join" algorithm, and demonstrate that it passes the test-case listed below.
|
|
|
|
You should represent the tables as data structures that feel natural in your programming language.
|
|
|
|
;Guidance
|
|
|
|
The "hash join" algorithm consists of two steps:
|
|
|
|
# '''Hash phase:''' Create a [[wp:Multimap|multimap]] from one of the two tables, mapping from each join column value to all the rows that contain it.<br>
|
|
#* The multimap must support hash-based lookup which scales better than a simple linear search, because that's the whole point of this algorithm.
|
|
#* Ideally we should create the multimap for the ''smaller'' table, thus minimizing its creation time and memory size.
|
|
# '''Join phase:''' Scan the other table, and find matching rows by looking in the multimap created before.
|
|
|
|
<br>
|
|
In pseudo-code, the algorithm could be expressed as follows:
|
|
|
|
'''let''' ''A'' = the first input table (or ideally, the larger one)
|
|
'''let''' ''B'' = the second input table (or ideally, the smaller one)
|
|
'''let''' ''j<sub>A</sub>'' = the join column ID of table ''A''
|
|
'''let''' ''j<sub>B</sub>'' = the join column ID of table ''B''
|
|
'''let''' ''M<sub>B</sub>'' = a multimap for mapping from single values to multiple rows of table ''B'' (starts out empty)
|
|
'''let''' ''C'' = the output table (starts out empty)
|
|
|
|
'''for each''' row ''b'' '''in''' table ''B''''':'''
|
|
'''place''' ''b'' '''in''' multimap ''M<sub>B</sub>'' under key ''b''(''j<sub>B</sub>'')
|
|
|
|
'''for each''' row ''a'' '''in''' table ''A''''':'''
|
|
'''for each''' row ''b'' '''in''' multimap ''M<sub>B</sub>'' under key ''a''(''j<sub>A</sub>'')''':'''
|
|
'''let''' ''c'' = the concatenation of row ''a'' and row ''b''
|
|
'''place''' row ''c'' in table ''C''
|
|
|
|
;Test case
|
|
|
|
{| class="wikitable"
|
|
|-
|
|
! Input
|
|
! Output
|
|
|-
|
|
|
|
|
|
|
{| style="border:none; border-collapse:collapse;"
|
|
|-
|
|
| style="border:none" | ''A'' =
|
|
| style="border:none" |
|
|
|
|
{| class="wikitable"
|
|
|-
|
|
! Age !! Name
|
|
|-
|
|
| 27 || Jonah
|
|
|-
|
|
| 18 || Alan
|
|
|-
|
|
| 28 || Glory
|
|
|-
|
|
| 18 || Popeye
|
|
|-
|
|
| 28 || Alan
|
|
|}
|
|
|
|
| style="border:none; padding-left:1.5em;" rowspan="2" |
|
|
| style="border:none" | ''B'' =
|
|
| style="border:none" |
|
|
|
|
{| class="wikitable"
|
|
|-
|
|
! Character !! Nemesis
|
|
|-
|
|
| Jonah || Whales
|
|
|-
|
|
| Jonah || Spiders
|
|
|-
|
|
| Alan || Ghosts
|
|
|-
|
|
| Alan || Zombies
|
|
|-
|
|
| Glory || Buffy
|
|
|}
|
|
|
|
|-
|
|
| style="border:none" | ''j<sub>A</sub>'' =
|
|
| style="border:none" | <code>Name</code> (i.e. column 1)
|
|
|
|
| style="border:none" | ''j<sub>B</sub>'' =
|
|
| style="border:none" | <code>Character</code> (i.e. column 0)
|
|
|}
|
|
|
|
|
|
|
|
|
{| class="wikitable" style="margin-left:1em"
|
|
|-
|
|
! A.Age !! A.Name !! B.Character !! B.Nemesis
|
|
|-
|
|
| 27 || Jonah || Jonah || Whales
|
|
|-
|
|
| 27 || Jonah || Jonah || Spiders
|
|
|-
|
|
| 18 || Alan || Alan || Ghosts
|
|
|-
|
|
| 18 || Alan || Alan || Zombies
|
|
|-
|
|
| 28 || Glory || Glory || Buffy
|
|
|-
|
|
| 28 || Alan || Alan || Ghosts
|
|
|-
|
|
| 28 || Alan || Alan || Zombies
|
|
|}
|
|
|
|
|}
|
|
|
|
The order of the rows in the output table is not significant.<br>
|
|
If you're using numerically indexed arrays to represent table rows (rather than referring to columns by name), you could represent the output rows in the form <code style="white-space:nowrap">[[27, "Jonah"], ["Jonah", "Whales"]]</code>.
|
|
|
|
<br><hr>
|
|
|