# Last edited on 2014-01-30 01:32:22 by stolfilocal

Market prices must be the most heavily studied type of time series ever.
Bitcoins are a rather interesting example: since they pay no dividends
and have no backing assests, their price at any moment is entirely due
to speculation, namely to the market's prediciton of their future price.

I could not resist doing my own amateur analysis. (There must be tons of
books on this subject, but books and papers are best read after you
spent a couple of weeks banging your head on the wall...)

[b]The Brownian model[/b]

I could not find any significant correlation between future prices and past
prices earlier than the current one. In log scale (that is, considering
price ratios rather than differences), the change between the mean price
in one period (say, one hour) and the price in the next period looks
pertty much like a random variable, with zero mean, that is independent of
all earlier changes.

Specifically, let's take the weighted mean BTC/USD price at Bitstamp in
successive 1-hour intervals. Let P[i] be that price in period number i
(counted from some arbitrary starting point) and Z[i] be the log base 10
of P[i]. Thus, an increase of 1.0 in Z means that the price P was
multiplied by 10, while a decrease of 1.0 means that P was divided by
10.

As said above, looking at the Bitstamp hourly data since last september
I cannot find any significant correlation between the future changes in
Z (that is, Z[i+n] - Z[i], for any n>1) and the past changes (that is,
Z[i] - Z[i-1], Z[i-1]-Z[i-2], etc.) Thus, the best predictive model I
found that fits that data is a simple Brownian model

(1.1)   Z[i + 1] = Z[i] + C*RND[i]

where C ~0.01, and each RND[i] is an independent random variable with
zero mean and unit standard deviations. That is, at each hour the mean
price changes by a random factor, on the order of 1 percent, in either
direction.

This model implies that, for any n > 0,

(1.2)   Z[i+n] = Z[i] + C*sqrt(n)*RND[i,n]

where RND[i,n] is essentially a Gaussian random variable with mean 0 and
unit deviation. (Note that these variables are [i]not[/i] independent
when n > 1.)

I verified model (1.2) experimentally, by collecting a set S[n] of increments
Z[i+n] - Z[i] for each value of n, computing their standard deviation dev[n]
(assuming zero mean) and plotting dev[n] as a function of n.  See below

(1.3) [IMAGE]

The emprical deviation dev[n] is the line red line with dots, and the mathematical model C*sqrt(n) 
is the solid green line.

Since ~95% of the probability in a Gaussian variable is within 2
deviations of its mean, we can expect that Z[i+n] will be within the
interval

(1.4)   Z[i] ± 2*C*sqrt(n)

with 95% probability.

Thus, at any time in the future, the Bitstamp 1-hour weighted mean price
should be within the two blue lines in the following graph, with 95%
probability:

(1.5) [IMAGE]

(Note that this is not the same thing as saying that the [i]entire
graph[/i] of Z[i+n], for all n > 1, will stay within that region with
90% probability!)

As figure (1.3) shows, these algebraic bounds (smooth blue lines) fit
quite well the 5%-95% percentiles of the samples S[n] (the stepped blue
lines).

Moreover, at any future time, there is 50% probability that the price
will be above (or below) the horizontal red line in figure (1.5),
defined by the equation

(1.4)   Z[i+n] = Z[i].

This model of course is not very helpful for traders, since it does not
give any hint about whether the price will go up or down in the future,
near or far. However, I don't think
one can get significantly better predictions from the price data alone,
without external information (such as regulations, arrests, press
coverage, etc.).

[b]Historical trend[/b]

But, you may say, what about the historic trend?  Shouldn't the formula be

(2.1)   Z[i+1] = Z[i] + T + C'*RND[i,n]
  
where T is a "trend" constant; so that for ant n > 0, we will have

(2.2)   Z[i+n] = Z[i] + T*n + C'*sqrt(n)*RND[i,n]

and the 2-sigma confidence lines will be

(2.3)   Z[i] + T*n ± 2*C*sqrt(n)

The "trendy" model (2.1--2.2) is equivalent to assuming that the single
step increments Z[i+1] - Z[i] have a non-zero average, namely T; and
defining C' as the standard deviation of the increments from that mean,
rather than from zero.

I have experimented with a trendy model as well. While it yields
seemingly tighter predictions (C' slightly smaller than C), I think that
the no-trend model (1.1--1.2) is better, for several reasons:

  * The T parameter depends strongly on what part of the data one one
  uses to estimate C' and T. If one starts from 2013-09-01 (or earlier)
  and ends at 2014-01-17, one gets an increasing trend (positive T). But
  if one starts at 2013-11-29, or 2013-01-06, the trend will be strongly
  decreasing (T < 0). And, if one starts looking at 2013-11-22, the
  trend will be flat (T ~ 0). Therefore, the T parameter cannot be
  reliably determined.  (In contrast, the value of C (or C') seems
  to be fairly independent of the period of analysis.)
  
  * If there was in fact a trend term T*n in the price data, the 
  analysis 

  * Any finite segment of a purely Brownian series, as generated by the
  trendless model (1)--(2), will appear to have some general trend,
  since the sum of its n random is unlikely to be zero. Indeed, in a
  price evolution chart our eyes usually see many sections with
  increasing or decreasing trends, at all time scales. So the apparent
  presence of an overall trend in Bitcoin prices over certain time spans
  is not a sufficient argument to include that trend in the model.
  
  * There seems to be no logical justification for a trend term. Bitcoin
  owners and fans are understandably fond of plotting the price
  evolution since the birth of the universe, and pointing out how much
  the thing has grown. But the traders who will decide its future prices
  do not care whether it was worth 1$ or 1000$ a year ago, given that it
  is worth 900$ today, was 950 yesterday, and 850 last week. Most
  traders know that the remote past does not matter; and they know that
  most traders know that it does not matter; and they know that most
  traders know that most traders know that it does not matter; and so
  on. Which is precisely why most traders know that the remote past does
  not matter. So, there should be no term that takes into account the
  remote past.
  
  * In any case, for predictions over relatively short time spans (in
  the Bitstamp data above, for n ~48 or less), the trend term T*n in (2.2)
  is fairly small compared to the deviation of the random term
  C*sqrt(n). Therefore, for that order of n, it can be included in the
  random term with little loss of precision.
  
  * On the other hand, for larger values of n, a nonzero term T*n would
  eventually overpower the random term C*srt(n)*RND[i,n]. If T is
  positive, for example, that term would eventually cause the blue curve
  Z[i] + T*n - 2*C**srt(n) to start rising after going down for a while,
  and eventually rise above the present value Z[i]. See figure (2.4)
  
(2.4)  [IMAGE]
  
  In other words, a trendy model with positive T would say that the
  value of Z[i+n] is equally likely to go up or down in the short term,
  but after so many days it is 90% certain that it will be higher than
  now, and will continue rising and rising [i]forever[/i], with
  practically zero chance of getting down. While that prediction will
  please the most "bullish" traders, it seems to be rather implausible
  --- since the positive T value that yields it is entirely based on
  ancient data which, as discussed above, should have no influence on
  the future behavior of the market.
  
  * If a nonzero trend term T*n did exist in reality, it would manifest itself
  in figure (1.3) as an increasing discrepancy between the model deviation C*sqrt(n)
  (green solid line) and the sample deviations dev[n] of the increment sets S[n] (red line with dots).
  That is because dev[n] was computed with assumed zero mean.  The omitted term T*n would
  cause dev[n] to eventually grow proportionally to n, rather than sqrt(n).
  In fact the match between dev[n] and C*sqrt(n) gets better as n increases, 
  implying that ay trend term T*n must be much smaller than C*sqrt(n) for the
  values of n shown in the plot.
  
One argument in favor of the trendy model is that the term T*n could be
due to an exterior cause (like expanding adoption by merchants), not
just to peculative trading. In that case, analysis of a longer dataset
would yield a more accurate value for the trend parameter T (in this
case, positive) than the analysis of a shorter sample.

However, the T one obtains from analysis of the data since 2013-11-30 is
not only negative, but has a much larger magnitude than the positive T
one obtains by starting in 2013-09-01 or earlier. See figure (2.5)
below.

(2.5) [IMAGE estimated trend parameter T as a function of sample start date]

[b]Short term trends[/b]

More surprising than the lack of a long-term linear trend component was
the apparent absence of any short-term trend, that is, the absence of
correlation between successive increments. Visual examination of the
charts suggests to many that the market has some inertia, so that if the
price increased during the last time step, it is more likely to increase
in the next step too; and similarly for decreases.

However, statistical analysis shows very little or no correlation
between successive increments DZ[i] = Z[i]-Z[i-1] and DZ[i+1] =
Z[i+1]-Z[i]. If one subtracts from DZ[i+1] the part that can be ascribed
to influence of DZ[i], the residual DZ'[i+1] still has practically the
same deviation as the raw increments DZ[i+1].

The apparent "correlations" and "short-term trends" seen by the eye may
have a psychological explanation. Two or more successive increments with
the same sign, say up-up or down-down, will usually create a conspicuous
step in the plot, which attracts the viewer attention; whereas
mixed-sign sequences like up-down or down-up will often generate a
"noise blip" on the plot but no significant step, and therefore will
tend to be overlooked.

In fact, in some datasets (especially from minute-by-minute files) there
seem to be a weak [i]negative[/i] correlation between successive increments:
an increase is a bit more likely to be followed by a decrease than by
another increase.  This may be the result of certain high-frequency, low-volume 
robots that have been found to operate in some exchanges.  It may be due also to