We've made some significant changes to our previously published Killed in Gaza list, which has the names of those known to have been killed in Gaza since October 7th. This post provides more detail on our new methodology and what to expect about the changes.
Prior Methodโ
Our prior list relied heavily on an existing library (arabic-names-to-en) which first tried to translate a name segment using a dictionary mapping, then fell back to a character-by-character lookup. We then had some volunteers do a visual review and incorporated manual changes. For a list of over 14 thousand names, this proved hard to manage.
New Methodโ
We've since built our own dictionary mapping with more name coverage, and the process now looks like this:
- we clean arabic names in the original list of formatting issues (using dict_ar_ar.csv)
- we lookup / translate each name part into english (using dict_ar_en.csv)
- we run final transformations when converting to JSON (see JSON export script)
The final step includes a fallback step to rely on the old library for remaining arabic translations that are not yet in our curated dict_ar_ar.csv
. Currently there are less than 2% of the names partially handled by this fallback mechanism, and we'll be working to reduce that number.
Notable Changesโ
We've avoided what we believe would have been breaking changes to the dataset per our versioning guide, but we did add 21 new records from the original official list released in November 2023. The IDs that were introduced from that November list include:
- 401771530
- 401844790
- 405424524
- 407194836
- 411518053
- 425923364
- 436788202
- 437391725
- 438240293
- 438445371
- 441199296
- 800328817
- 802335927
- 803827518
- 804662112
- 804669000
- 901494161
- 930025457
- 932076094
- 942125832
- 95270068
The list before this change can be found on Github:
Here are some additional details about the current list & the latest revision:
- there are 14,140 names
- english name changes between this and the last published list, using Levenshtein distance:
- 24% of names had no change
- 60% of names had differences of between 1-4 edits, inclusive
- 15% of names had differences of between 5-9 edits, inclusive
- 1.9% of names had differences of 10 or more edits
- 92 records (0.65%) had age changes from the prior release (all 1 year less than before)
- 29 names have "unknown" for part or all of the name, and those are now represented in the english translation as
?
We're continually working to improve translations and the list in general. If you have ideas or want to contribute a change, please see our contributing guide.