RosettaCodeData/Task/Hash-join/00-TASK.txt

120 lines
3.4 KiB
Plaintext

An [[wp:Join_(SQL)#Inner_join|inner join]] is an operation that combines two data tables into one table, based on matching column values. The simplest way of implementing this operation is the [[wp:Nested loop join|nested loop join]] algorithm, but a more scalable alternative is the [[wp:hash join|hash join]] algorithm.
;Task
Implement the "hash join" algorithm, and demonstrate that it passes the test-case listed below.
You should represent the tables as data structures that feel natural in your programming language.
;Guidance
The "hash join" algorithm consists of two steps:
# '''Hash phase:''' Create a [[wp:Multimap|multimap]] from one of the two tables, mapping from each join column value to all the rows that contain it.<br>
#* The multimap must support hash-based lookup which scales better than a simple linear search, because that's the whole point of this algorithm.
#* Ideally we should create the multimap for the ''smaller'' table, thus minimizing its creation time and memory size.
# '''Join phase:''' Scan the other table, and find matching rows by looking in the multimap created before.
<br>
In pseudo-code, the algorithm could be expressed as follows:
'''let''' ''A'' = the first input table (or ideally, the larger one)
'''let''' ''B'' = the second input table (or ideally, the smaller one)
'''let''' ''j<sub>A</sub>'' = the join column ID of table ''A''
'''let''' ''j<sub>B</sub>'' = the join column ID of table ''B''
'''let''' ''M<sub>B</sub>'' = a multimap for mapping from single values to multiple rows of table ''B'' (starts out empty)
'''let''' ''C'' = the output table (starts out empty)
'''for each''' row ''b'' '''in''' table ''B''''':'''
'''place''' ''b'' '''in''' multimap ''M<sub>B</sub>'' under key ''b''(''j<sub>B</sub>'')
'''for each''' row ''a'' '''in''' table ''A''''':'''
'''for each''' row ''b'' '''in''' multimap ''M<sub>B</sub>'' under key ''a''(''j<sub>A</sub>'')''':'''
'''let''' ''c'' = the concatenation of row ''a'' and row ''b''
'''place''' row ''c'' in table ''C''
;Test case
{| class="wikitable"
|-
! Input
! Output
|-
|
{| style="border:none; border-collapse:collapse;"
|-
| style="border:none" | ''A'' =
| style="border:none" |
{| class="wikitable"
|-
! Age !! Name
|-
| 27 || Jonah
|-
| 18 || Alan
|-
| 28 || Glory
|-
| 18 || Popeye
|-
| 28 || Alan
|}
| style="border:none; padding-left:1.5em;" rowspan="2" |
| style="border:none" | ''B'' =
| style="border:none" |
{| class="wikitable"
|-
! Character !! Nemesis
|-
| Jonah || Whales
|-
| Jonah || Spiders
|-
| Alan || Ghosts
|-
| Alan || Zombies
|-
| Glory || Buffy
|}
|-
| style="border:none" | ''j<sub>A</sub>'' =
| style="border:none" | <code>Name</code> (i.e. column 1)
| style="border:none" | ''j<sub>B</sub>'' =
| style="border:none" | <code>Character</code> (i.e. column 0)
|}
|
{| class="wikitable" style="margin-left:1em"
|-
! A.Age !! A.Name !! B.Character !! B.Nemesis
|-
| 27 || Jonah || Jonah || Whales
|-
| 27 || Jonah || Jonah || Spiders
|-
| 18 || Alan || Alan || Ghosts
|-
| 18 || Alan || Alan || Zombies
|-
| 28 || Glory || Glory || Buffy
|-
| 28 || Alan || Alan || Ghosts
|-
| 28 || Alan || Alan || Zombies
|}
|}
The order of the rows in the output table is not significant.<br>
If you're using numerically indexed arrays to represent table rows (rather than referring to columns by name), you could represent the output rows in the form <code style="white-space:nowrap">[[27, "Jonah"], ["Jonah", "Whales"]]</code>.
<br><hr>