Who's the king; who's the ruler. How to answer the question before it was posted.

Hubert Bryłkowski

Friday 3 November 2017 from 17:00 to 18:00

Talk in English - UK at php Central Europe Conference 2017
Track Name: Guru
Short URL: https://joind.in/talk/b4818 (QR-Code (opens in new window))

Avg. Rating

I will share with you a story about hashes, what they're good at and what they're bad at. Most importantly how to use them in a not-so-typical way.

I was faced with a challenge to search a database of questions (about 2 million records) and find duplicates among them. It may look like a pretty simple problem, but doing this efficiently was not trivial. I will explain the algorithms used, discuss their benefits, and show you how I tweaked them to our needs. My main topic will be MinHash and LSH, with a little reminder about general hashing algorithms.

Comments

Comments are closed.

Przemyslaw Peron at 18:59 on 4 Nov 2017 (via joind.in Android app)

I was expecting a bit more clear explanation of the ideas.

Matthieu Napoli at 23:33 on 4 Nov 2017 (via Web2 LIVE)

It would help to have concrete metrics: what was the volume of data, what were the processing time, how much time was saved with each algorithm and each improvements?

Dyszczo at 21:52 on 5 Nov 2017 (via Web2 LIVE)

It was ok but for me when a lecture is a case study of a problem & solution I need more research, e.g. which other approaches were considered. Simple "I haven't come up with anything else" is not enough for me.