Talk:GLAM/Newsletter/September 2014/Contents/Special story

Hm, almost 6 months to hash the whole Commons? And how about the rest of the images of the Internet? :) I guess it's useful for internal comparisons or to compare one file at a time with the whole Commons dataset (i.e. to produce attribution), but not much for chasing copyvios etc., right? --Nemo 21:50, 11 October 2014 (UTC)Reply

We're cautious not to overload the servers, but we'll start adding more processing nodes to it soon enough. Our aim is to do about a million images per day, which will allow us to tackle other open resources too aside from Commons - thanks very much for noticing and paying attention to that we can't settle on just Commons :) And yes, this is definitely not for chasing copyright violations, we're not particularly interested in that aspect. We feel that if we can help in ensuring that images re-used are attributed accurately, the need to chase down copyright violations might also be less. --Jonasoberg (talk) 22:52, 11 October 2014 (UTC)Reply

An immediate use case that comes to mind is having a version of the "Duplicates of this file" section of the image page, but actually giving what the user expects instead of "Exact" duplicates. Such a db could also be useful to tag images that are similar to previously deleted images as needing review. Bawolff (talk) 22:41, 11 October 2014 (UTC)Reply

That wouldn't be terribly difficult it seems. Remember though that we're talking about more-or-less exact copies, where nothing has changed except for the image size and perhaps the file format. For more advanced matching, capturing also when images have been cropped or otherwise changed, it may be easy enough to just direct the user to a Google Image Reverse Search with the parameter "site:commons.wikimedia.org". As long as we stay in the more-or-less exact copies space, our db has a public API so once we have a bit more of Commons hashed, it wouldn't be a problem to send queries through it. (For the moment, we'd have to make do with our API/service. We'll eventually dedicate the hashes back to Commons -- we're carefully following the Structured Data discussions -- but searching them requires a few more tricks since we define matches as images which has hashes with hamming distance of at most 10 bits. It's a near-match search, rather than a exact search, so just having the hashes in WikiData or on Commons doesn't mean that you can easily search them) --Jonasoberg (talk) 22:52, 11 October 2014 (UTC)Reply

Add topic