AceTwo
AceTwo
  • Threads: 5
  • Posts: 359
Joined: Mar 13, 2012
Thanked by
camapl
July 17th, 2014 at 1:43:35 PM permalink
I came accross this Benford's law a few days ago and it blew my mind.
Rarely I find some new Math stuff that would leave me wondering.
See Wikipedia link:
http://en.wikipedia.org/wiki/Benford's_law

Basically it states that in many different statistics (for example population of a city) the first digit of each item is not evenly distributed between 1 to 9.
But a lot more start with 1 (30%) and it gradually reduces for each subsequent digit upto 9 (5%).
We are talking huge differences for each digit.
The explanation why this is the case is not obvious, there is some explanation about populations that relate to exponential growth. But for other statistics is not quite sure the reason for this.

And apparently it also applies to financial data, for example sale invoices of a business.
And is being used in forensic accounting to detect fraud (ie if the data are not distrivuted as per the Benford's law).

Now when I read this and been naturally suspicous of anything that I do not understand why, I decided to test it.
I am a professional accountant, so I put the individual sales invoices of a client of mine on a speadsheet (around 10.000 transactions) and calculated the first digit frequency.
When I saw the results, it was 'frightening'. The fit of this 10.000 entries with Benford's law was unreal, it was a very close fit.
For example the Law says 30,1% for 1, my data showed 30,6%
All digits from 1 to 9 closely fit this Law.

I still do not understand why this law should apply in financial data like these.
AxiomOfChoice
AxiomOfChoice
  • Threads: 32
  • Posts: 5761
Joined: Sep 12, 2012
Thanked by
camapl
July 17th, 2014 at 1:55:19 PM permalink
Quote: AceTwo

I came accross this Benford's law a few days ago and it blew my mind.
Rarely I find some new Math stuff that would leave me wondering.
See Wikipedia link:
http://en.wikipedia.org/wiki/Benford's_law

Basically it states that in many different statistics (for example population of a city) the first digit of each item is not evenly distributed between 1 to 9.
But a lot more start with 1 (30%) and it gradually reduces for each subsequent digit upto 9 (5%).
We are talking huge differences for each digit.
The explanation why this is the case is not obvious, there is some explanation about populations that relate to exponential growth. But for other statistics is not quite sure the reason for this.

And apparently it also applies to financial data, for example sale invoices of a business.
And is being used in forensic accounting to detect fraud (ie if the data are not distrivuted as per the Benford's law).

Now when I read this and been naturally suspicous of anything that I do not understand why, I decided to test it.
I am a professional accountant, so I put the individual sales invoices of a client of mine on a speadsheet (around 10.000 transactions) and calculated the first digit frequency.
When I saw the results, it was 'frightening'. The fit of this 10.000 entries with Benford's law was unreal, it was a very close fit.
For example the Law says 30,1% for 1, my data showed 30,6%
All digits from 1 to 9 closely fit this Law.

I still do not understand why this law should apply in financial data like these.



It's because numbers tend to have loose bounds on them, and that will tend to exclude first-digit high numbers much more than first-digit low numbers.

Think about it this way. Imagine that you are counting, keeping track of the number of instances of each first digit as you go. So you start with 1, 2, 3, ... up to 9. At this point you have 1 of each number. But your next 10 numbers all begin with a 1. So your counter for 1's will be at 11 before any of your other counters even reaches 2. And then your counter for 2's will go up to 11 before any of your counters for 3 through 9 reaches 2. And so on and so on.

At any given time, your counters are always sorted: the number of 1s is always greater than or equal to the numbers of 2s, which is greater than or equal to the number of 3s, etc, all the way up to 9. If you happen to stop right at a power of 10 minus 1 (eg, 999 or 9999) then your counters will all be equal, otherwise, there will be some inequality weighted towards the low digits.

So, imagine you make money one dollar at a time, until at some arbitrary point (like the end of a quarter) you stop. You can see that the first digit of the amount of money that you made is most likely to be a 1, followed by a 2 which is 2nd most likely, etc, all the way up to 9.

This can be useful for fraud detection if the fraudsters don't know this law, because people making up numbers will tend to pick them more or less evenly distributed.
AceTwo
AceTwo
  • Threads: 5
  • Posts: 359
Joined: Mar 13, 2012
July 17th, 2014 at 2:29:27 PM permalink
Quote: AxiomOfChoice


Think about it this way. Imagine that you are counting, keeping track of the number of instances of each first digit as you go. So you start with 1, 2, 3, ... up to 9. At this point you have 1 of each number. But your next 10 numbers all begin with a 1. So your counter for 1's will be at 11 before any of your other counters even reaches 2. And then your counter for 2's will go up to 11 before any of your counters for 3 through 9 reaches 2. And so on and so on.

At any given time, your counters are always sorted: the number of 1s is always greater than or equal to the numbers of 2s, which is greater than or equal to the number of 3s, etc, all the way up to 9. If you happen to stop right at a power of 10 minus 1 (eg, 999 or 9999) then your counters will all be equal, otherwise, there will be some inequality weighted towards the low digits.

So, imagine you make money one dollar at a time, until at some arbitrary point (like the end of a quarter) you stop. You can see that the first digit of the amount of money that you made is most likely to be a 1, followed by a 2 which is 2nd most likely, etc, all the way up to 9.
.



I can appreciate that if you start counting sequentially at one dollar a time a stop randomly at some point, then this law applies.
But why should sales invoices data be thought as this 'counting squentially and stop randomly at some point' scenario.

Why aren't sales invoices data completely random (or almost completely random).
If you take completely random data then the law does not applym each digit appears 1/9 of the times.
AxiomOfChoice
AxiomOfChoice
  • Threads: 32
  • Posts: 5761
Joined: Sep 12, 2012
July 17th, 2014 at 2:38:35 PM permalink
Quote: AceTwo

I can appreciate that if you start counting sequentially at one dollar a time a stop randomly at some point, then this law applies.
But why should sales invoices data be thought as this 'counting squentially and stop randomly at some point' scenario.

Why aren't sales invoices data completely random (or almost completely random).



This is pretty much how sales works. You start earning money and stop at some arbitrary point (end of quarter, end of year, whatever). The fact that you are not counting up by increments of 1 each time is not relevant; the point here is that you are counting something, and you have to go past the numbers with the small first digits before you go by the ones with the large first digits.

Even if you look at individual invoices and not just aggregate data, this is pretty much how pricing works as well. Prices rise with inflation, etc, so you are essentially counting up until you stop at some arbitrary time. If you look at the number of things that someone buys, it's the same. Maybe a small business needs 1 computer, then 2, then 3, as they grow larger and larger. Eventually they get into the 10s and the 100s and the 1000s.

Quote:

If you take completely random data then the law does not applym each digit appears 1/9 of the times.



What do you mean by "take completely random data"? There is no such thing as a uniform probability distribution over all the integers (or all the positive integers, or any other countably infinite set).
AceTwo
AceTwo
  • Threads: 5
  • Posts: 359
Joined: Mar 13, 2012
July 17th, 2014 at 2:45:21 PM permalink
Quote: AxiomOfChoice

What do you mean by "take completely random data"? There is no such thing as a uniform probability distribution over all the integers (or all the positive integers, or any other countably infinite set).


I tested using rand()*1000 in Excel and the frequency is 1/9 for each.

Wow, I see what you mean now, I put an upper bound doing that test of a 1.000.
You are right, to do random you need to put a upper Bound.
It's the luck of upper bound in data that makes them when you go over to more digits, say from 3 to 4 to favour more the one.
I think, I get it now.
AxiomOfChoice
AxiomOfChoice
  • Threads: 32
  • Posts: 5761
Joined: Sep 12, 2012
July 17th, 2014 at 2:47:33 PM permalink
Quote: AceTwo

I tested using rand()*1000 in Excel and the frequency is 1/9 for each.

Wow, I see what you mean now, I put an upper bound doing that test of a 1.000.
You are right, to do random you need to put a upper Bound.
It's the luck of upper bound in data that makes them when you go over to more digits, say from 3 to 4 to favour more the one.
I think, I get it now.



Yes, exactly, you are picking a random number between 0 and 1000.

Note that real-life data is rarely uniformly distributed.
FleaStiff
FleaStiff
  • Threads: 265
  • Posts: 14484
Joined: Oct 19, 2009
July 17th, 2014 at 2:48:49 PM permalink
Yeah, 1.99 is a more common price than 2.00.
Any set of real world numbers will often be skewed and the lack of such skewness is often due to a thief creating fictitious numbers and trying to make them random.

Its the same thing as the professor who tells students to toss a coin a thousand times or else write down results as if you had actually tossed a coin a thousand times. Those who just pretend to toss a coin usually keep the Hs and Ts fairly random and don't ever do long runs of Heads or long runs of Tails.

That is why one woman who was stealing millions from a charity in California always had the cash drawer in the thrift shop "balance" no more than three days a week. She intentionally stole a few pennies from the thrift shop so the auditor would see "reasonable" numbers rather than perfect ones.

The same principle as during World War Two when the Dutch Underground agents were radioing their London controllers Morse Code messages but despite all the stress being an agent in the field, never making any mistakes or asking for any re-transmissions. The British soon realized they now had a back channel.

One British officer when he became suspicious of a French Maquis radio operator's morse code simply sent HH at the end of his message and immediately got an HH in reply since the French resistance fighter was in reality a German signals clerk who typed HH at the beginning and end of every message all day long.

People who are "cooking the books" often make the figures just too good to be passing any "real world" scrutiny. One reviewer of medical trials got suspicious when a form was submitted that was pristine and all written using one pen. If the form had actually been used in the real world, that form would have had some coffee cup rings on it, been in several different inks, had some messages and phone numbers scribbled in the margin, etc. It was just not a "real world" form, it was too perfect.
24Bingo
24Bingo
  • Threads: 23
  • Posts: 1348
Joined: Jul 4, 2012
July 17th, 2014 at 4:27:55 PM permalink
Quote: AceTwo

The explanation why this is the case is not obvious, there is some explanation about populations that relate to exponential growth. But for other statistics is not quite sure the reason for this.



It's pretty simple - the bigger things are, the further apart they tend to be. Just, things, in general. That's why we describe things in percentage terms.

With that in mind, it's pretty simple to see that that'll create a distribution uniform on a log plot, and a quick look at a slide rule will tell you why that should create a preponderance toward lower first digits.
The trick to poker is learning not to beat yourself up for your mistakes too much, and certainly not too little, but just the right amount.
AxiomOfChoice
AxiomOfChoice
  • Threads: 32
  • Posts: 5761
Joined: Sep 12, 2012
July 17th, 2014 at 4:30:52 PM permalink
I like this explanation. Unfortunately I left my slide rule in my other pants.
kubikulann
kubikulann
  • Threads: 27
  • Posts: 905
Joined: Jun 28, 2011
August 1st, 2014 at 12:13:58 PM permalink
Assuming there is such a distribution, it must appear whatever the unit used. For example, converting your data from dollars to euros or pounds should not destroy the effect.

This means that the distribution must be invariant under multiplication by any constant.

The only distribution with that property (on the finite discrete set of first digits) is Benford's law.
Reperiet qui quaesiverit
kubikulann
kubikulann
  • Threads: 27
  • Posts: 905
Joined: Jun 28, 2011
August 1st, 2014 at 12:18:24 PM permalink
It has been used to detect frauds, in scientific papers as well as in financial matters.
The problem, from a statistician's point of view, is that classical (frequentist) is less efficient in this surrounding than Bayesian inference.

As an example, some years ago, Belgium was supposedly demonstrated cheating European rules, by the analysis of German researchers. It was proved subsequently that nothing was wrong. Luckily for Belgium, the EU Commission did not put too much faith in the researchers' results.
Reperiet qui quaesiverit
pacomartin
pacomartin
  • Threads: 649
  • Posts: 7895
Joined: Jan 14, 2010
August 1st, 2014 at 12:56:36 PM permalink
Quote: AceTwo

I still do not understand why this law should apply in financial data like these.



If you designate one side of a coin a loss, and the other a win, and you ask yourself on average how many times you must toss the coin before losing twice in a row?

You might work out all possible combinations as follows.

LL WW WL LW
(1 in 4 for two in a row)

LLL LLW WLL LWL WLW WWL LWW WWW
(3 in 8 for three in a row)

LLLW WWWL
LLWL WWLW
LWLL WLWW
WLLL LWWW
WWLL LWLW
WLLW WLWL
LLWW LWWL
LLLL WWWW
(8 in 16 for four in a row)

LLLLL

LLLLW
LLLWL
LLWLL
LWLLL
WLLLL

LLWWW
WLLWW
WWLLW
WWWLL

WWLLL
LWWLL
LLWWL
LLLWW

WLWLL
WLLWL
WLLLW
LWLLW
LLWLW

WLWLW
WLWWL WWLWL
LWWWW WWWWW
LWWWL WWWWL
LWWLW WWWLW
LWLWW WWLWW
LWLWL WLWWW
(19 in 32 for five in a row)

2) 1/4 =25.000%
3) 3/8 =37.500%
4) 8/16 =50.000%
5) 9/32 =59.375%
6) 67.188%
7) 73.438%
8) 78.516%
9) 82.617%



So you conclude that on average you must toss the coin 4 times before losing two in a row. (Based 8/16 combos having at least two losses in a row).

You would be WRONG. The correct answer is related to principal behind Benford's Law.

On average you will toss the coin 6 times in a row before a run of 2 losses
  • Jump to: