XRayFDRFormat.rst 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401
  1. ======================================
  2. XRay Flight Data Recorder Trace Format
  3. ======================================
  4. :Version: 1 as of 2017-07-20
  5. .. contents::
  6. :local:
  7. Introduction
  8. ============
  9. When gathering XRay traces in Flight Data Recorder mode, each thread of an
  10. application will claim buffers to fill with trace data, which at some point
  11. is finalized and flushed.
  12. A goal of the profiler is to minimize overhead, the flushed data directly
  13. corresponds to the buffer.
  14. This document describes the format of a trace file.
  15. General
  16. =======
  17. Each trace file corresponds to a sequence of events in a particular thread.
  18. The file has a header followed by a sequence of discriminated record types.
  19. The endianness of byte fields matches the endianess of the platform which
  20. produced the trace file.
  21. Header Section
  22. ==============
  23. A trace file begins with a 32 byte header.
  24. +-------------------+-----------------+----------------------------------------+
  25. | Field | Size (bytes) | Description |
  26. +===================+=================+========================================+
  27. | version | ``2`` | Anticipates versioned readers. This |
  28. | | | document describes the format when |
  29. | | | version == 1 |
  30. +-------------------+-----------------+----------------------------------------+
  31. | type | ``2`` | An enumeration encoding the type of |
  32. | | | trace. Flight Data Recorder mode |
  33. | | | traces have type == 1 |
  34. +-------------------+-----------------+----------------------------------------+
  35. | bitfield | ``4`` | Holds parameters that are not aligned |
  36. | | | to bytes. Further described below. |
  37. +-------------------+-----------------+----------------------------------------+
  38. | cycle_frequency | ``8`` | The frequency in hertz of the CPU |
  39. | | | oscillator used to measure duration of |
  40. | | | events in ticks. |
  41. +-------------------+-----------------+----------------------------------------+
  42. | buffer_size | ``8`` | The size in bytes of the data portion |
  43. | | | of the trace following the header. |
  44. +-------------------+-----------------+----------------------------------------+
  45. | reserved | ``8`` | Reserved for future use. |
  46. +-------------------+-----------------+----------------------------------------+
  47. The bitfield parameter of the file header is composed of the following fields.
  48. +-------------------+----------------+-----------------------------------------+
  49. | Field | Size (bits) | Description |
  50. +===================+================+=========================================+
  51. | constant_tsc | ``1`` | Whether the platform's timestamp |
  52. | | | counter used to record ticks between |
  53. | | | events ticks at a constant frequency |
  54. | | | despite CPU frequency changes. |
  55. | | | 0 == non-constant. 1 == constant. |
  56. +-------------------+----------------+-----------------------------------------+
  57. | nonstop_tsc | ``1`` | Whether the tsc continues to count |
  58. | | | despite whether the CPU is in a low |
  59. | | | power state. 0 == stop. 1 == non-stop. |
  60. +-------------------+----------------+-----------------------------------------+
  61. | reserved | ``30`` | Not meaningful. |
  62. +-------------------+----------------+-----------------------------------------+
  63. Data Section
  64. ============
  65. Following the header in a trace is a data section with size matching the
  66. buffer_size field in the header.
  67. The data section is a stream of elements of different types.
  68. There are a few categories of data in the sequence.
  69. - ``Function Records``: Function Records contain the timing of entry into and
  70. exit from function execution. Function Records have 8 bytes each.
  71. - ``Metadata Records``: Metadata records serve many purposes. Mostly, they
  72. capture information that may be too costly to record for each function, but
  73. that is required to contextualize the fine-grained timings. They also are used
  74. as markers for user-defined Event Data payloads. Metadata records have 16
  75. bytes each.
  76. - ``Event Data``: Free form data may be associated with events that are traced
  77. by the binary and encode data defined by a handler function. Event data is
  78. always preceded with a marker record which indicates how large it is.
  79. - ``Function Arguments``: The arguments to some functions are included in the
  80. trace. These are either pointer addresses or primitives that are read and
  81. logged independently of their types in a high level language. To the tracer,
  82. they are all numbers. Function Records that have attached arguments will
  83. indicate their presence on the function entry record. We only support logging
  84. contiguous function argument sequences starting with argument zero, which will
  85. be the "this" pointer for member function invocations. For example, we don't
  86. support logging the first and third argument.
  87. A reader of the memory format must maintain a state machine. The format makes no
  88. attempt to pad for alignment, and it is not seekable.
  89. Function Records
  90. ----------------
  91. Function Records have an 8 byte layout. This layout encodes information to
  92. reconstruct a call stack of instrumented function and their durations.
  93. +---------------+--------------+-----------------------------------------------+
  94. | Field | Size (bits) | Description |
  95. +===============+==============+===============================================+
  96. | discriminant | ``1`` | Indicates whether a reader should read a |
  97. | | | Function or Metadata record. Set to ``0`` for |
  98. | | | Function records. |
  99. +---------------+--------------+-----------------------------------------------+
  100. | action | ``3`` | Specifies whether the function is being |
  101. | | | entered, exited, or is a non-standard entry |
  102. | | | or exit produced by optimizations. |
  103. +---------------+--------------+-----------------------------------------------+
  104. | function_id | ``28`` | A numeric ID for the function. Resolved to a |
  105. | | | name via the xray instrumentation map. The |
  106. | | | instrumentation map is built by xray at |
  107. | | | compile time into an object file and pairs |
  108. | | | the function ids to addresses. It is used for |
  109. | | | patching and as a lookup into the binary's |
  110. | | | symbols to obtain names. |
  111. +---------------+--------------+-----------------------------------------------+
  112. | tsc_delta | ``32`` | The number of ticks of the timestamp counter |
  113. | | | since a previous record recorded a delta or |
  114. | | | other TSC resetting event. |
  115. +---------------+--------------+-----------------------------------------------+
  116. On little-endian machines, the bitfields are ordered from least significant bit
  117. bit to most significant bit. A reader can read an 8 bit value and apply the mask
  118. ``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned
  119. shift right by ``0x04`` to obtain the function_id field.
  120. On big-endian machine, the bitfields are written in order from most significant
  121. bit to least significant bit. A reader would read an 8 bit value and unsigned
  122. shift right by 7 bits for the discriminant. The function_id field could be
  123. obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``.
  124. Function action types are as follows.
  125. +---------------+--------------+-----------------------------------------------+
  126. | Type | Number | Description |
  127. +===============+==============+===============================================+
  128. | Entry | ``0`` | Typical function entry. |
  129. +---------------+--------------+-----------------------------------------------+
  130. | Exit | ``1`` | Typical function exit. |
  131. +---------------+--------------+-----------------------------------------------+
  132. | Tail_Exit | ``2`` | An exit from a function due to tail call |
  133. | | | optimization. |
  134. +---------------+--------------+-----------------------------------------------+
  135. | Entry_Args | ``3`` | A function entry that records arguments. |
  136. +---------------+--------------+-----------------------------------------------+
  137. Entry_Args records do not contain the arguments themselves. Instead, metadata
  138. records for each of the logged args follow the function record in the stream.
  139. Metadata Records
  140. ----------------
  141. Interspersed throughout the buffer are 16 byte Metadata records. For typically
  142. instrumented binaries, they will be sparser than Function records, and they
  143. provide a fuller picture of the binary execution state.
  144. Metadata record layout is partially record dependent, but they share a common
  145. structure.
  146. The same bit field rules described for function records apply to the first byte
  147. of MetadataRecords. Within this byte, little endian machines use lsb to msb
  148. ordering and big endian machines use msb to lsb ordering.
  149. +---------------+--------------+-----------------------------------------------+
  150. | Field | Size | Description |
  151. +===============+==============+===============================================+
  152. | discriminant | ``1 bit`` | Indicates whether a reader should read a |
  153. | | | Function or Metadata record. Set to ``1`` for |
  154. | | | Metadata records. |
  155. +---------------+--------------+-----------------------------------------------+
  156. | record_kind | ``7 bits`` | The type of Metadata record. |
  157. +---------------+--------------+-----------------------------------------------+
  158. | data | ``15 bytes`` | A data field used differently for each record |
  159. | | | type. |
  160. +---------------+--------------+-----------------------------------------------+
  161. Here is a table of the enumerated record kinds.
  162. +--------+---------------------------+
  163. | Number | Type |
  164. +========+===========================+
  165. | 0 | NewBuffer |
  166. +--------+---------------------------+
  167. | 1 | EndOfBuffer |
  168. +--------+---------------------------+
  169. | 2 | NewCPUId |
  170. +--------+---------------------------+
  171. | 3 | TSCWrap |
  172. +--------+---------------------------+
  173. | 4 | WallTimeMarker |
  174. +--------+---------------------------+
  175. | 5 | CustomEventMarker |
  176. +--------+---------------------------+
  177. | 6 | CallArgument |
  178. +--------+---------------------------+
  179. NewBuffer Records
  180. -----------------
  181. Each buffer begins with a NewBuffer record immediately after the header.
  182. It records the thread ID of the thread that the trace belongs to.
  183. Its data segment is as follows.
  184. +---------------+--------------+-----------------------------------------------+
  185. | Field | Size (bytes) | Description |
  186. +===============+==============+===============================================+
  187. | thread_Id | ``2`` | Thread ID for buffer. |
  188. +---------------+--------------+-----------------------------------------------+
  189. | reserved | ``13`` | Unused. |
  190. +---------------+--------------+-----------------------------------------------+
  191. WallClockTime Records
  192. ---------------------
  193. Following the NewBuffer record, each buffer records an absolute time as a frame
  194. of reference for the durations recorded by timestamp counter deltas.
  195. Its data segment is as follows.
  196. +---------------+--------------+-----------------------------------------------+
  197. | Field | Size (bytes) | Description |
  198. +===============+==============+===============================================+
  199. | seconds | ``8`` | Seconds on absolute timescale. The starting |
  200. | | | point is unspecified and depends on the |
  201. | | | implementation and platform configured by the |
  202. | | | tracer. |
  203. +---------------+--------------+-----------------------------------------------+
  204. | microseconds | ``4`` | The microsecond component of the time. |
  205. +---------------+--------------+-----------------------------------------------+
  206. | reserved | ``3`` | Unused. |
  207. +---------------+--------------+-----------------------------------------------+
  208. NewCpuId Records
  209. ----------------
  210. Each function entry invokes a routine to determine what CPU is executing.
  211. Typically, this is done with readtscp, which reads the timestamp counter at the
  212. same time.
  213. If the tracing detects that the execution has switched CPUs or if this is the
  214. first instrumented entry point, the tracer will output a NewCpuId record.
  215. Its data segment is as follows.
  216. +---------------+--------------+-----------------------------------------------+
  217. | Field | Size (bytes) | Description |
  218. +===============+==============+===============================================+
  219. | cpu_id | ``2`` | CPU Id. |
  220. +---------------+--------------+-----------------------------------------------+
  221. | absolute_tsc | ``8`` | The absolute value of the timestamp counter. |
  222. +---------------+--------------+-----------------------------------------------+
  223. | reserved | ``5`` | Unused. |
  224. +---------------+--------------+-----------------------------------------------+
  225. TSCWrap Records
  226. ---------------
  227. Since each function record uses a 32 bit value to represent the number of ticks
  228. of the timestamp counter since the last reference, it is possible for this value
  229. to overflow, particularly for sparsely instrumented binaries.
  230. When this delta would not fit into a 32 bit representation, a reference absolute
  231. timestamp counter record is written in the form of a TSCWrap record.
  232. Its data segment is as follows.
  233. +---------------+--------------+-----------------------------------------------+
  234. | Field | Size (bytes) | Description |
  235. +===============+==============+===============================================+
  236. | absolute_tsc | ``8`` | Timestamp counter value. |
  237. +---------------+--------------+-----------------------------------------------+
  238. | reserved | ``7`` | Unused. |
  239. +---------------+--------------+-----------------------------------------------+
  240. CallArgument Records
  241. --------------------
  242. Immediately following an Entry_Args type function record, there may be one or
  243. more CallArgument records that contain the traced function's parameter values.
  244. The order of the CallArgument Record sequency corresponds one to one with the
  245. order of the function parameters.
  246. CallArgument data segment:
  247. +---------------+--------------+-----------------------------------------------+
  248. | Field | Size (bytes) | Description |
  249. +===============+==============+===============================================+
  250. | argument | ``8`` | Numeric argument (may be pointer address). |
  251. +---------------+--------------+-----------------------------------------------+
  252. | reserved | ``7`` | Unused. |
  253. +---------------+--------------+-----------------------------------------------+
  254. CustomEventMarker Records
  255. -------------------------
  256. XRay provides the feature of logging custom events. This may be leveraged to
  257. record tracing info for RPCs or similarly trace data that is application
  258. specific.
  259. Custom Events themselves are an unstructured (application defined) segment of
  260. memory with arbitrary size within the buffer. They are preceded by
  261. CustomEventMarkers to indicate their presence and size.
  262. CustomEventMarker data segment:
  263. +---------------+--------------+-----------------------------------------------+
  264. | Field | Size (bytes) | Description |
  265. +===============+==============+===============================================+
  266. | event_size | ``4`` | Size of preceded event. |
  267. +---------------+--------------+-----------------------------------------------+
  268. | absolute_tsc | ``8`` | A timestamp counter of the event. |
  269. +---------------+--------------+-----------------------------------------------+
  270. | reserved | ``3`` | Unused. |
  271. +---------------+--------------+-----------------------------------------------+
  272. EndOfBuffer Records
  273. -------------------
  274. An EndOfBuffer record type indicates that there is no more trace data in this
  275. buffer. The reader is expected to seek past the remaining buffer_size expressed
  276. before the start of buffer and look for either another header or EOF.
  277. Format Grammar and Invariants
  278. =============================
  279. Not all sequences of Metadata records and Function records are valid data. A
  280. sequence should be parsed as a state machine. The expectations for a valid
  281. format can be expressed as a context free grammar.
  282. This is an attempt to explain the format with statements in EBNF format.
  283. - Format := Header ThreadBuffer* EOF
  284. - ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End
  285. - BodySequence := NewCPUId | TSCWrap | Function | CustomEvent
  286. - Function := (Function_Entry_Args CallArgument*) | Function_Other_Type
  287. - CustomEvent := CustomEventMarker CustomEventUnstructuredMemory
  288. - End := EndOfBuffer RemainingBufferSizeToSkip
  289. Function Record Order
  290. ---------------------
  291. There are a few clarifications that may help understand what is expected of
  292. Function records.
  293. - Functions with an Exit are expected to have a corresponding Entry or
  294. Entry_Args function record precede them in the trace.
  295. - Tail_Exit Function records record the Function ID of the function whose return
  296. address the program counter will take. In other words, the final function that
  297. would be popped off of the call stack if tail call optimization was not used.
  298. - Not all functions marked for instrumentation are necessarily in the trace. The
  299. tracer uses heuristics to preserve the trace for non-trivial functions.
  300. - Not every entry must have a traced Exit or Tail Exit. The buffer may run out
  301. of space or the program may request for the tracer to finalize toreturn the
  302. buffer before an instrumented function exits.