I will share with you a story about hashes, what they're good at and what they're bad at. Most importantly how to use them in a not-so-typical way.

I was faced with a challenge to search a database of questions (about 2 million records) and find duplicates among them. It may look like a pretty simple problem, but doing this efficiently was not trivial. I will explain the algorithms used, discuss their benefits, and show you how I tweaked them to our needs. My main topic will be MinHash and LSH, with a little reminder about general hashing algorithms.


Comments are closed.

Przemyslaw Peron at 18:59 on 4 Nov 2017

I was expecting a bit more clear explanation of the ideas.

It would help to have concrete metrics: what was the volume of data, what were the processing time, how much time was saved with each algorithm and each improvements?

Dyszczo at 21:52 on 5 Nov 2017

It was ok but for me when a lecture is a case study of a problem & solution I need more research, e.g. which other approaches were considered. Simple "I haven't come up with anything else" is not enough for me.