Much enjoyed this empirical work and will ponder what (if anything) can be inferred from the statistical distributions.
I have two comments:
1) Crash reports have a potential which is largely unexplored, for instance as evidence of (buggy) 0day exploits in computer security. See Window Error Report (kept under tight qras by Microsoft) from Websense 2014 "USING ANOMALIES IN CRASH REPORTS TO DETECT UNKNOWN THREATS" http://www.websense.com/assets/reports/websense-crash-report-en.pdf
They field two examples: Discovering a new APT attack on a global telecommunication company and a government entity and a previously unreported campaign against point-of- sale (POS) systems.
2) On the function name lengths, deduplication goals and tokens, see also Hatton "Power-laws and the Conservation of Information in discrete token systems: Part 1 General Theory"
"Using variational principles suggested in ,  and using the princi- ple of the Conservation of Information, it is predicted that the probability pi of a component appearing with ti tokens in any software system, whatever its implementation details, obeys the following distribution with respect to the size of its unique alphabet of tokens ai,
pi ∼ (ai)−β (26)
Overwhelming evidence for this behaviour has been presented derived from some 55.5 million lines of code in six languages with an associated p-value of < 2.2.10−16.
The behaviour exemplified by (26) has been demonstrated to be persis- tent through the life of single software systems as exemplified by three very disparate systems."
Best regards, will study your paper more in depth over the holidays