Quantitative Trading: Cointegration is not the same as correlation

Monday, November 06, 2006

Cointegration is not the same as correlation

A reader asked me recently why I believe that energy stock prices (e.g. XLE) are correlated with crude oil futures front-month contract (QM). Actually I don’t believe they are necessarily correlated – I only think they are “cointegrated”.

What is the difference between correlation and cointegration? If XLE and QM were really correlated, when XLE goes up one day, QM would likely go up also on the same day, and vice versa. Their daily (or weekly, or monthly) returns would have risen or fallen in synchrony. But that’s not what my analysis was about. I claim that XLE and QM are cointegrated, meaning that the two price series cannot wander off in opposite directions for very long without coming back to a mean distance eventually. But it doesn’t mean that on a daily basis the two prices have to move in synchrony at all.

Two hypothetical graphs illustrate the differences. In the first graph, stock A and stock B are correlated. You can see that their prices move in the same direction almost everyday.

Now consider stock A and stock C.

Stock C clearly doesn’t move in any correlated fashion with stock A: some days they move in same direction, other days opposite. Most days stock C doesn’t move at all! But notice that the spread in stock prices between C and A always return to about $1 after a while. This is a manifestation of cointegration between A and C. In this instance, a profitable trade would be to buy A and short C at around day 10, then exit both positions at around day 19. Another profitable trade would be to buy C and short A at around day 31, then closing out the positions around day 40.

Cointegration is the foundation upon which pair trading (“statistical arbitrage”) is built. If two stocks simply move in a correlated manner, there may never be any widening of the spread. Without a temporary widening of the spread in either direction, there is no opportunity to short (or buy) the spread, and no reason to expect the spread to revert to the mean either.

For further reading:

Alexander, Carol (2001). Market Models: A Guide to Financial Data Analysis. John Wiley & Sons.

51 comments:

NA said...: Interesting post. I've found this cointegration to be between OIH and USO/XLE.

I usually like to short OIH due to its higher intra-day volatility than XLE and go long XLE as a hedge, or vice versa.

It doesn't always cointegrate like you mentioned but every now and then there is an opportunity. I.e. On short-covering days; Thursday, November 9, 2006 at 8:23:00 AM EST
Ernie Chan said...: Yes, OIH is certainly an alternative to XLE. OIH is the most liquid oil services ETF. Frankly, I don't remember the reason anymore why I chose XLE instead of OIH to do the analysis. They both cointegrate with USO equally well.; Thursday, November 9, 2006 at 8:42:00 AM EST
Anonymous said...: Dear Mr. Chan,

Wonderful blog you have over here. I looked for something like this for a long time.

You might also try DVN as an alternative to XLE.

Cheers,
Max; Saturday, February 17, 2007 at 11:10:00 PM EST
Ernie Chan said...: Max,
Glad you like my articles, and thanks for your suggestion.
Hope to exchange more ideas with you in the future!
Ernie; Sunday, February 18, 2007 at 12:27:00 AM EST
Paul Teetor said...: Ernie and Yaser: For my trading, I've found XLE has a strong advantage over the alternatives: There is a single-stock futures available for XLE, and using the SSF drops the margin requirement from 50% to 20%. This extra leverage is very useful in spread trading. It is difficult to capture spread profits without that leverage, due to the small size of spread changes.; Thursday, August 23, 2007 at 1:39:00 AM EDT
Camilo Rostoker said...: Hi Ernie et al.,

Just wondering if there are any traders out there that use correlation or cointegration on an intra-day time scale to do day-trading. For example taking data samples as fast as 15 seconds, or maybe longer like every 10 minutes. Is there any useful information in time scales that small? I would think that it would depend highly on the volatility/liquidity of the underlyings so that enough margin could be made on the spread for such a strategy to be profitable. Just wondering if you have any experience or opinions on this.
Cheers,
Jack; Saturday, March 22, 2008 at 6:04:00 PM EDT
Ernie Chan said...: Hi Jack,
Theoretically, cointegration is time-scale independent. So we cannot say a pair of stocks are cointegrated on a time scale of years, but not minutes. However, it is meaningful to ask what the average mean-reversion time is. I have written elsewhere on this blog (see Ornstein-Uhlenbeck formula) a good way to estimate this, and it will help you determine whether the pair of stocks is suitable for trading at the time-scale of interest.
Ernie; Tuesday, March 25, 2008 at 5:43:00 PM EDT
Unknown said...: Hi Ernie,

I have been trading pairs in the Indian stock markets. I find your blog very much informative and educative. I really appreciate your efforts towards sharing indepth knowledge on the subject.

Can u explain the cointegration method via spreadsheet and if possible, share the spreadsheet. Appreciate if you can explain in a non-quantitative style. I want to learn interpreting the output of the cointegration test, whether it is mean-reverting or not for a given time frame.

Thanks
Bhumir; Sunday, May 31, 2009 at 10:34:00 AM EDT
Ernie Chan said...: Hi Bhumir,
Thank you for your interest in my blog. Unfortunately, cointegration test cannot easily be performed on Excel. I performed mine using Matlab. If you purchase my book, you will find sample codes on how to compute this.
Ernie; Monday, June 1, 2009 at 3:30:00 PM EDT
Unknown said...: Hi Ernie, I have been reading your book. I must say it's very informative and it has helped me tremendously.

One question though about LeSage's cadf function when testing for co-integration. I notice that if you reverse the order of the y and x parameters (in cadf(y,x,p,nlag)), the resulting t-statistic can be very different for the same two sets of data.

Using your Matlab sample code 7_2.m as an example, if y is GLD and x is GDX, I get a t-statistic of -3.52. If y is GDX and x is GLD, I get -4.11. So what do I make out of this? Which result should I rely on to see if there's co-integration between the two sets of data? Or should I use both results (or the average of the two) as a guideline?

Thanks
Sam; Friday, July 31, 2009 at 12:07:00 AM EDT
Ernie Chan said...: Hi Sam,
Yes, indeed the results are different depending on which series you pick as the independent variable.
My rule of thumb is to be conservative: regard a pair as cointegrating only if both t-stats meet the criterion.
Ernie; Friday, July 31, 2009 at 11:21:00 AM EDT
Peter Magner said...: Hey Ernie, this is Peter from University of Cape Town South Africa. I am writing to ask you if you get any meaningful link if one is testing for integration if one uses correlation. i am testing integration across african markets for my thesis and have used Engel Granger cointegration test, but thought it might be nice to include a correlation matrix but dont want to look stupid. Also, just to confirm, does it matter if i only use A and independant and B as dependant and not test both ways?

Thanks
Please respond asap if possible!; Wednesday, October 14, 2009 at 10:55:00 AM EDT
Ernie Chan said...: Peter,
Including a correlation matrix will not convince anybody that the African markets are cointegrated. However, it might serve as an useful comparison in technique.

Indeed cointegration tests are variable-order-dependent, esp. for borderline cases. Try both orders.
Ernie; Wednesday, October 14, 2009 at 8:51:00 PM EDT
Fuzhi Cheng said...: Hi, Ernie:
I have a rather simple question regarding index tracking using cointegration optimal portfolio (following an earler paper by Dunis & Ho: Cointegration portfolio of European Equities for Index Tracking) Suppose I am able to find cointegration in the following manner: ln(index)=2*ln(p1)+3*ln(p2) where p1 and p2 are the prices of constituent stocks in the index. The paper suggests using the "normalized" parameters for weights (can you please explain what normalization means in that paper?). I assume it is 2/5=0.4 and 3/5=0.6 for weights. Suppose asset 1&2 each has return of 5%, then the portfolio constructed with the 0.4, 0.6 weight would give 5%*0.4+5%*0.6=5% return. However by the original cointegration result: ln(index)=2*ln(p1)+3*ln(p2) and by first differencing it (becoming returns on both sides), the index return should be 2*10%+3*10%=50%. Definitely the portfolio is not tracking the index. I am sure there is something not right here... Thanks for your help.

Fuzhi; Wednesday, August 3, 2011 at 2:31:00 PM EDT
Ernie Chan said...: Hi Fuzhi,
You have to apply the normalized weights before computing returns, otherwise the two sides won't match. It would be like comparing the P&L of $1 capital with the P&L of $1M capital if you don't normalize by capital.
Ernie; Wednesday, August 3, 2011 at 3:15:00 PM EDT
Fuzhi Cheng said...: Ernie:
Appreciate very much your reply. However, I am still a little confused. Could you please explain again how you would normalize the weights if this is the cointegration results you get: ln(index)=2*ln(p1)+3*ln(p2)
where "index" is the index price, "p1" and "p2" are the prices of constituent stocks in the index. Seems all are in percent return terms and have nothing to do with the amount of capital.

Thanks.

Fuzhi; Thursday, August 4, 2011 at 8:06:00 AM EDT
Ernie Chan said...: Fuzhi,
The 2 and 3 represents units of capital. So clearly we need to normalize them so that both sides have the same total capital, typically 1 unit.

In any case, I dislike using logs. I prefer raw prices so that the number of shares are fixed.
Ernie; Thursday, August 4, 2011 at 9:35:00 AM EDT
Fuzhi Cheng said...: Ernie:

Thank you so much for your help.

Fuzhi; Thursday, August 4, 2011 at 11:03:00 AM EDT
Suny said...: Dear Ernie, like say I manage to identify a good cointegrated pair. My question now is to work on a hedge ratio. When I regress price of A over B compared to B over A, I end up with two different hedge ratios. Which hedge ratio should I choose? As I will need to use the residual to determine a band for entry and exit. Depending on which hedge ratio I use, I end up with two different entry and exit.; Monday, October 10, 2011 at 6:40:00 AM EDT
Ernie Chan said...: Suny,
The eigenvector obtained from the Johansen test can be used to determine a unique linear combination (i.e. hedge ratio) of the 2 price series.
Ernie; Monday, October 10, 2011 at 8:59:00 AM EDT
Jeet said...: Hi Erin,
As you said in one of the earlier blog here, You said S1 ~ S2 and
S2 ~ S1 both should pass co integration test.

1. Let us say S1 ~ S2 is co integrated while reverse order is not. So does it mean that such pair is not co integrated?

2. Let us say, we have 5 stock to trade with, which one I should use as independent variable and other 4 as dependent without trying so many combination.; Thursday, December 8, 2011 at 4:18:00 AM EST
Ernie Chan said...: Hi Jeet,
1) This indicates the pair is borderline cointegrating. Trade at your own risk!
2) You should use Johansen test: it will give you all good combinations of symbols with no unique "independent" variable.
Ernie; Thursday, December 8, 2011 at 7:59:00 AM EST
Jeet said...: what should be the logic behind choosing the independent set of stock and dependent stock from a basket of stock?

Johnson set might give result but I am looking for logic.; Friday, December 9, 2011 at 4:32:00 AM EST
Ernie Chan said...: Jeet,
Logic can only be found if you have a fundamental economic understanding of the relationship between the assets. For e.g. if you believe that firm A and B are both big customers of firm C, you might argue that C's price should be a dependent variable.
However, I usually do not find it important to find out why a variable is independent: it makes no difference to the trading model.
Ernie; Friday, December 9, 2011 at 8:12:00 AM EST
Anonymous said...: Dear Mr.Chan,

I used your file to test ex7_2.m for GLD and GDX.
The t-statistic is -9.72, not -3.36.
What's wrong with it??

By the way, your book is very good.
Thanks.; Tuesday, February 7, 2012 at 10:57:00 AM EST
Ernie Chan said...: Anon,
Did you use my data file for the test? Did you set all the parameters for the cadf test to be the same as mine?
Ernie; Tuesday, February 7, 2012 at 12:43:00 PM EST
99 said...: Dear Mr.Chan,

Only used copy and paste.(ex7-2 , jplv7 *m-files and GLD/ GLD)

parameters?
Not only use those m.file, but also need to change parameters??

Base on the result, only t-statistic is not right.
Others are similar.

I tried GLD and GLD, two the same data, t-statistic is also -9. @@||; Monday, February 13, 2012 at 9:37:00 AM EST
Ernie Chan said...: 99,
There is no need to change the input parameters to the cadf function for testing cointegration.

Are you using the same input data as I used? Have you made sure the dates of those price series are ascending (most recent data on last row)?

Ernie; Monday, February 13, 2012 at 11:08:00 AM EST
99 said...: Dear Chan,

hmm.... I used those GLD/ GDX files from your server.(2006/05/23~2007/11/30 data)
I tested those file to "adftest.xls", the "Dickey Fuller Test Statistic" is right.
Is my Matlab wrong? @.@||; Tuesday, February 14, 2012 at 7:53:00 AM EST
99 said...: Dear Chan,

I mailed a letter to your G-mail with all m.file and my matlab screen.
If you have time, could you help to read it?
Thanks and sorry... disturb you.; Tuesday, February 14, 2012 at 7:58:00 AM EST
Ernie Chan said...: 99,
I did not receive your email (I checked the spam folder too). Could u pls resend?
Ernie; Tuesday, February 14, 2012 at 10:47:00 AM EST
FasTechs.com, Inc. said...: I am wondering what everyone feels is the most reliable cointegration test in matlab? I have tried egci adf and jci and get widely different results. Then to make matters worse I check with catalystcorner and many pairs show significantly different results there.; Wednesday, March 14, 2012 at 10:15:00 AM EDT
Ernie Chan said...: cbucks,
Have you tried Johansen test?
Ernie; Wednesday, March 14, 2012 at 11:15:00 AM EDT
cf16 said...: Cointegration is not the same as correlation.
certainly. it can be proven that pearson correlation coefficient will be close to 1 only if variance of each asset is relatively small to the variance of random walk process that generates data.; Tuesday, February 19, 2013 at 9:22:00 AM EST
cheerful said...: ADF, Variance Ratio, CADF test

Dear Dr Chan,

I have a set of pair data using 5min and daily data@9am.

5min data: ADF = -2.63 vs 10%critical value=2.59, H2 = 0.35, h=1, CADF= -2.6 vs 10% critical value = -3.

Daily data: ADF = -1.6 vs 10%critical value=2.59, H2 = 0.4, h=0 p=0.8, CADF= -1.9 vs 10% critical value = -3

May I know if I can trade this pair? How to overcome different time frame where one shows a trend and the other shows a weak mean-reversal as above.

Thanks
Leo; Friday, May 30, 2014 at 1:35:00 PM EDT
cheerful said...: ADF, Variance Ratio, CADF test

Dear Dr Chan,

I have a set of pair data using 5min and daily data@9am.

5min data: ADF = -2.63 vs 10%critical value=2.59, H2 = 0.35, h=1, CADF= -2.6 vs 10% critical value = -3.

Daily data: ADF = -1.6 vs 10%critical value=2.59, H2 = 0.4, h=0 p=0.8, CADF= -1.9 vs 10% critical value = -3

May I know if I can trade this pair? How to overcome different time frame where one shows a trend and the other shows a weak mean-reversal as above.

Thanks
Leo; Friday, May 30, 2014 at 1:38:00 PM EDT
Ernie Chan said...: Hi Leo,
It looks to me that for both 5 min or daily data, we can't reject the CADF null hypothesis. But that doesn't mean you can't create a profitable mean-reverting strategy: you just have to backtest it with various parameters.

It is common for an instrument to mean-revert in some timeframe while trend in another. You just have to adapt your strategy to the respective timeframes accordingly.

Ernie; Friday, May 30, 2014 at 1:42:00 PM EDT
cheerful said...: Dear Dr Earnest,

1) It looks to me that for both 5 min or daily data, we can't reject the CADF null hypothesis. But that doesn't mean you can't create a profitable mean-reverting strategy: you just have to backtest it with various parameters.
>> Do you mean adjusting the moving average and number of standard deviation for bollinger? If they are very weak mean-reversal as indicated by the cadf, we can expect a very poor result regardless of how well we optimize the bollinger. May I know how we can overcome it?

2) In example 5.1,
you use audusd data but I could only find the inputData_AUDCAD_20120426 from your "box" link. Could there be a mistake because the results I get after I replace audusd with audcad looked very close.

3) You use Johansen weights for the hedge ratio. If we normalize it ie weight 1/ weight2, we see huge spikes ie 20 times. If we use the second set of Johansen weights, the normalize weights also have spikes. The change of weights vary drastically day to day. Looks unstable to me?

Thanks
Leo; Monday, June 23, 2014 at 3:13:00 PM EDT
Ernie Chan said...: Hi Leo,
1) Yes, you can optimize the parameters of the Bollinger band. If the best results are still too weak, you shouldn't be trading these assets using mean reversion models.
2) My box.net has data on all 3 files: AUDCAD, AUDUSD, and USDCAD.
3) I think I mentioned to you (or some other reader?) before that the Johansen test is not guaranteed to pick the same eigenvector as the "best" one everyday, as the order of their eigenvalues do change. So if you want to enforce continuity, you will have to make sure that the same eigenvector is used until the eigenvalue is much "worse" than the best one.

Ernie; Monday, June 23, 2014 at 4:08:00 PM EDT
cheerful said...: Dear Dr Ernest,

1) & 2) you are right. This shows highly correlated asset may not be highly cointegrated for mean-reversion.

3) I think I mentioned to you (or some other reader?) Yes. I have tried to use moving average on the eigenvector, switching to another eigenvector set when one is too high, taking average of the two eigenvector sets but all failed. The change is still drastic. Then I try to figure out the new value that is added and oldest value that is dropped out and compared to the average for the period (as you taught me), I dont see much of the change in graphical point of view but the resultant eigenvector set is still having big change. What can we do about it? Because at a certain day for both different eigenvector sets, I have to suddenly long ie 20 times more than usually on an asset.

Thank you
Leo; Monday, June 23, 2014 at 4:37:00 PM EDT
Ernie Chan said...: Leo,
Are you saying that even if you stick with one continuously changing eigenvector, the hedge ratio still changes discontinuously? That seems unlikely.
Ernie; Monday, June 23, 2014 at 5:05:00 PM EDT
Ernie Chan said...: Hi notcher,
As I wrote before, my web host considers that file too large to upload. If you email me, I can send you a link to the box.net folder.
Ernie; Friday, September 26, 2014 at 11:41:00 AM EDT
cheerful said...: Dear Dr Chan,

May I know if you are using signal properties ie ADF and cointegration to check mean-reverting property in the lookback period and if yes, proceed to use mean-reverting strategy for the next period. Hoping that the next period will continue to be mean-reverting until ADF and cointegration in the lookback period indicates non-reverting?

Thank you
Leo; Friday, October 3, 2014 at 12:23:00 PM EDT
Ernie Chan said...: Hi Leo,
Actually, we use ADF test only as a screen for suitable pairs. Once a pair is deemed suitable for pair-trading, we just run Bollinger band type strategy on it.
Ernie; Friday, October 3, 2014 at 1:48:00 PM EDT
cheerful said...: Dear Earnest Chan,

Do you use Matlab in your production? I know prototyping using Matlab has a fast development speed but Matlab is very slow, compared to python, C++ or Java in terms of processing speed.

Thank you
Leo; Thursday, April 14, 2016 at 4:33:00 PM EDT
Ernie Chan said...: Hi Leo,
Yes, I use Matlab for trading some low frequency strategies.

I disagree that Matlab is slower than Python. Please see this academic study: Aruoba, S. Borağan and Fernández-Villaverde, Jesús . 2014. A Comparison of Programming Languages in Economics. NBER Working Paper No. 20263. Available at economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf

Ernie; Thursday, April 14, 2016 at 6:59:00 PM EDT
cheerful said...: Hi Dr Ernie

Would you run in unix or windows for production? I thinking of using cygwin to configure PC to unix.

That paper is very good.

Thank you
Leo; Friday, April 15, 2016 at 6:55:00 PM EDT
Ernie Chan said...: Hi Leo,

Since I am not a high frequency trader, it doesn't matter to me whether we run it in Linux or Windows. We run it in Windows Server.

Ernie; Friday, April 15, 2016 at 8:23:00 PM EDT
Anonymous said...: Hi,
pg. 111 of "Algorithmic Trading:winning strategies..." refers to 2 files:
1)inputData_USDCAD_20120426
2)inputData_AUDUSD_20120426

But when I go to "http://epchan.com/book2/" I don't see none of them.
Where can I find them?; Sunday, January 28, 2018 at 9:40:00 AM EST
Ernie Chan said...: Hi,
Please email me to get those files.
Ernie; Sunday, January 28, 2018 at 10:17:00 AM EST
Anonymous said...: Hi Ernie,
my email: polargnome@yahoo.com

Thanks; Sunday, January 28, 2018 at 10:59:00 AM EST