I will share with you a story about hashes, what they're good at and what they're bad at. Most importantly how to use them in a not-so-typical way.
I was faced with a challenge to search a database of questions (about 2 million records) and find duplicates among them. It may look like a pretty simple problem, but doing this efficiently was not trivial. I will explain the algorithms used, discuss their benefits, and show you how I tweaked them to our needs. My main topic will be MinHash and LSH, with a little reminder about general hashing algorithms.