Dear Redump team,
I recently got into emulation and found myself having huge amounts of roms and many duplicates in different langues for example:
- 2 Games in 1 - Disney-Pixar Les Indestructibles + Disney-Pixar Le Monde de Nemo (France) (Disc 1) (Les Indestructibles)
- 2 Games in 1 - Disney-Pixar Die Unglaublichen + Disney-Pixar Findet Nemo (Germany) (Disc 1)
Since my profession is in developing automation tools. I wanted to automat that process of sorting and verifiying.
I got already so far to use ur's and No-Intros Dat Files for hash verification on large scale (15k+) in an 'short' time.
That allows validating releases there also is a filter in my tool to filter based on region, language and release type.
However with the duplicate detection I ran into issues since, If I'd nativley compare all serials against each other it would require to use external tools (not prefered) as the language I choose to develop in doesnt nativley support for example rvz files.
Therefore I did push that idea for now to the side then I saw you had serials on your page which would allow to identify releases as far as I could tell, those I couldn't find an offline list from. (Therefore reaching out) Since I did not want to scrape your side via html requests, seemed not so nice.
The other solution I found but seems more unreliable is to use a language index I've build from Wikidata for most common systems with language translations. However that is not as reliable as having serials.
Since my matching currently is solid, roms map to hashes those hashes than map to your dat files reliabley and those would match either to serials which allow duplicate detection despite different names and then one could decide based on preference what to keep identified by naming scheme French or Englisch or whatever. But Mapping Rom Names to Wikidata via their names seems less confident.
Which would not want to sort manually, if one had such many roms.
So since I'm not so well knowledged into all the rom stuff and there seem to be many exception maybe you guys could give me some insight on how to identify such duplicates as mentioned above reliably. Would appreciate any help or direction you could point me to.
Best Regards
CA