Skip to main content

2 posts tagged with "killed-in-gaza"

View All Tags

· 6 min read

On April 3rd, we received an updated list of names of those killed in Gaza up to March 29th. We've incorporated new records and existing record changes from that update.

The new list was in a PDF format that differed slightly from the initial lists that were distributed in CSV format. It also included the source of the record being one of either the Ministry of Health ("سجالت وزارة الصحة"), or a submission made from the public ("تبيلغ ذوي الشهداء"). You can download the Ministry of Health PDF here. You can also download our Killed in Gaza list from before in CSV format here to compare how individual records may have changed.

We've added a new source field to the records to indicate the reported source of the record as noted above.

Change Summary

The following tables summarize the demographic changes in our Killed in Gaza list following its merge with the abovenoted Ministry list:

Demographics of Our List Before Merge

Senior Men3282.3
Senior Women2822.0
Male (no age)5654.0
Female (no age)4323.1
Total Persons14,140

Demographics of Newly Added Records

Sr. Men2333.6
Sr. Women1241.9
Male (no age)510.8
Female (no age)310.5
Total Persons6,445

Demographics of Our Updated List After Merge

Sr. Men5662.8
Sr. Women4102.0
Male (no age)1820.9
Female (no age)1620.8
Total Persons20,390

Demographics of Removed Records

195 records in our prior list were not present in the latest list release (by identification number) so we removed them.

Sr. Men42.1
Sr. Women21.0
Male (no age)4322.1
Female (no age)126.2
Total Persons195

We believe the higher ratio of Men in this revised list reflects the addition of community-reported sourcing. These records likely include more of those lost or missing for which remains were not received by health authorities as was the case for most of the records in the initial list distributed in January.

Merge Methodology / Commentary

Incorporating the new list required a few steps:

  1. Parse the tabular data from the PDF
  2. Clean the parsed data for data format inconsistencies
  3. Render the data in a format comparable to our existing list
  4. Reconcile record conflicts & changes
  5. Merge and rewrite our existing source list

Commentary on our approach for some of the steps follows.

Cleaning the Data

At this stage we worked to determine common issues with the parsed data and found the following cases:

  • date of birth formats were not standardized (ie: long year vs. short year)
    • we worked to normalize these, and if the format was hard to decipher we validated against the provided age
  • age was sometimes repeated in both the age and date of birth columns (no date of birth)
    • we removed any age values from the date of birth column
  • identification number field sometimes had non-number values or was clearly invalid
    • we dropped these records
  • date of birth field was full of hashes (#)
    • we removed these and left the date of birth empty

This was an iterative process of gathering stats, updating cleaning logic, and reviewing the output in our standard format to assess how to repeat with refined logic.

Reconciling Conflicts & Changes

We focused on assessing record conflicts based on the provided identification number only. If our existing list had a record with the same identification value, we checked the field changes (the "diff") to determine whether the change was acceptable using the following methodology:

  • if the age only changed by a year, we allowed the change as it's likely a reference date or rounding issue (the initial Ministry list was provided in a form that had an unfixed reference date of the current day and our prior list fixed that to January 5, 2024 per source dating)
  • if a comparison of names using Levenshtein Distance led to a change amounting to less than 30% of the original name's length, we allowed the change, but only if the new name didn't rely more on our fallback auto translation library than it did before
  • if an age or date of birth was not on the existing record and it was on the incoming one, we accepted it

This process helped us narrow in on specific record sets to refine our approach.

Where there were changes in names for existing records by identification ID within our accepted threshold of 30%, the breakdown was as follows:

change % upper boundnumber of occurrences

(the change threshold upper bound means that 20% would include a 12% or 18% change to the original name)

In terms of overall types of record changes across those already in our list at the time of merge, the breakdown was as follows:

fields affectednumber of occurences
None (Duplicate)4,089
Age and Name2,113
Only Age1,557
Age, Birth Date, and Name10
Age and Birth Date12
Birth Date and Name1

· 3 min read

We've made some significant changes to our previously published Killed in Gaza list, which has the names of those known to have been killed in Gaza since October 7th. This post provides more detail on our new methodology and what to expect about the changes.

Prior Method

Our prior list relied heavily on an existing library (arabic-names-to-en) which first tried to translate a name segment using a dictionary mapping, then fell back to a character-by-character lookup. We then had some volunteers do a visual review and incorporated manual changes. For a list of over 14 thousand names, this proved hard to manage.

New Method

We've since built our own dictionary mapping with more name coverage, and the process now looks like this:

  1. we clean arabic names in the original list of formatting issues (using dict_ar_ar.csv)
  2. we lookup / translate each name part into english (using dict_ar_en.csv)
  3. we run final transformations when converting to JSON (see JSON export script)

The final step includes a fallback step to rely on the old library for remaining arabic translations that are not yet in our curated dict_ar_ar.csv. Currently there are less than 2% of the names partially handled by this fallback mechanism, and we'll be working to reduce that number.

Notable Changes

We've avoided what we believe would have been breaking changes to the dataset per our versioning guide, but we did add 21 new records from the original official list released in November 2023. The IDs that were introduced from that November list include:

  • 401771530
  • 401844790
  • 405424524
  • 407194836
  • 411518053
  • 425923364
  • 436788202
  • 437391725
  • 438240293
  • 438445371
  • 441199296
  • 800328817
  • 802335927
  • 803827518
  • 804662112
  • 804669000
  • 901494161
  • 930025457
  • 932076094
  • 942125832
  • 95270068

The list before this change can be found on Github:

Here are some additional details about the current list & the latest revision:

  • there are 14,140 names
  • english name changes between this and the last published list, using Levenshtein distance:
    • 24% of names had no change
    • 60% of names had differences of between 1-4 edits, inclusive
    • 15% of names had differences of between 5-9 edits, inclusive
    • 1.9% of names had differences of 10 or more edits
  • 92 records (0.65%) had age changes from the prior release (all 1 year less than before)
  • 29 names have "unknown" for part or all of the name, and those are now represented in the english translation as ?

We're continually working to improve translations and the list in general. If you have ideas or want to contribute a change, please see our contributing guide.