HashTable.rst 3.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
  1. The PDB Serialized Hash Table Format
  2. ====================================
  3. .. contents::
  4. :local:
  5. .. _hash_intro:
  6. Introduction
  7. ============
  8. One of the design goals of the PDB format is to provide accelerated access to
  9. debug information, and for this reason there are several occasions where hash
  10. tables are serialized and embedded directly to the file, rather than requiring
  11. a consumer to read a list of values and reconstruct the hash table on the fly.
  12. The serialization format supports hash tables of arbitrarily large size and
  13. capacity, as well as value types and hash functions. The only supported key
  14. value type is a uint32. The only requirement is that the producer and consumer
  15. agree on the hash function. As such, the hash function can is not discussed
  16. further in this document, it is assumed that for a particular instance of a PDB
  17. file hash table, the appropriate hash function is being used.
  18. On-Disk Format
  19. ==============
  20. .. code-block:: none
  21. .--------------------.-- +0
  22. | Size |
  23. .--------------------.-- +4
  24. | Capacity |
  25. .--------------------.-- +8
  26. | Present Bit Vector |
  27. .--------------------.-- +N
  28. | Deleted Bit Vector |
  29. .--------------------.-- +M ─╮
  30. | Key | │
  31. .--------------------.-- +M+4 │
  32. | Value | │
  33. .--------------------.-- +M+4+sizeof(Value) │
  34. ... ├─ |Capacity| Bucket entries
  35. .--------------------. │
  36. | Key | │
  37. .--------------------. │
  38. | Value | │
  39. .--------------------. ─╯
  40. - **Size** - The number of values contained in the hash table.
  41. - **Capacity** - The number of buckets in the hash table. Producers should
  42. maintain a load factor of no greater than ``2/3*Capacity+1``.
  43. - **Present Bit Vector** - A serialized bit vector which contains information
  44. about which buckets have valid values. If the bucket has a value, the
  45. corresponding bit will be set, and if the bucket doesn't have a value (either
  46. because the bucket is empty or because the value is a tombstone value) the bit
  47. will be unset.
  48. - **Deleted Bit Vector** - A serialized bit vector which contains information
  49. about which buckets have tombstone values. If the entry in this bucket is
  50. deleted, the bit will be set, otherwise it will be unset.
  51. - **Keys and Values** - A list of ``Capacity`` hash buckets, where the first
  52. entry is the key (always a uint32), and the second entry is the value. The
  53. state of each bucket (valid, empty, deleted) can be determined by examining
  54. the present and deleted bit vectors.
  55. .. _hash_bit_vectors:
  56. Present and Deleted Bit Vectors
  57. ===============================
  58. The bit vectors indicating the status of each bucket are serialized as follows:
  59. .. code-block:: none
  60. .--------------------.-- +0
  61. | Word Count |
  62. .--------------------.-- +4
  63. | Word_0 | ─╮
  64. .--------------------.-- +8 │
  65. | Word_1 | │
  66. .--------------------.-- +12 ├─ |Word Count| values
  67. ... │
  68. .--------------------. │
  69. | Word_N | │
  70. .--------------------. ─╯
  71. The words, when viewed as a contiguous block of bytes, represent a bit vector
  72. with the following layout:
  73. .. code-block:: none
  74. .------------. .------------.------------.
  75. | Word_N | ... | Word_1 | Word_0 |
  76. .------------. .------------.------------.
  77. | | | | |
  78. +N*32 +(N-1)*32 +64 +32 +0
  79. where the k'th bit of this bit vector represents the status of the k'th bucket
  80. in the hash table.