Sunday, 18 March 2012

Getting Your Data: The Prerequisites

Scraping Data

(A word of warning before we start. The legalities or otherwise of data 'scraping' must be highlighted. Most providers do not like you doing this - as I am an 'amateur' I feel I have a little leeway in this stricture, however you have been warned!)

As a budding algo trader, your initial problem is getting data to test. There are sources of commercially available end-of-day data however these tend to be expensive and seldom in a form I like. I originally used to 'scrape' data from Yahoo, however I found the data was erratic and often 'changed'. Additionally, the Yahoo data appeared to be patchy and I was never truly sure it was any good. Eventually I moved over to using Google's own end-of-day data. I have found this to be very good and consistent as well as having suitable history.

There is a difficulty with Google data that is not apparent on Yahoo's - Google does not make it easy for us to automate downloading of stock prices as each URL is addressed using Google's own code for the stock e.g. Vodaphone's EPIC (exchange symbol) is 'VOD' whilst Google uses the code '834331'.

The code snippet below will fetch Google id's for each each EPIC you supply - in this snippet, the EPICs are fetched into a <string> list using the function:


This could as easily be a Linq-to-Sql call from a database. You can quite easily get the list of EPICs from Yahoo to populate your table beforehand.

The FetchGoogleCIDs method eventual saves the resultant data to an XML file named 'GoogleEpicCodes.xml' - this was just done for ease of use.

 public static void FetchGoogleCIDs()  
       IList<string> epiclist = EpicList.GetEpicListFromYahoo();  
       var dt = new DataTable("Epics");  
       dt.Columns.Add("EPIC", Type.GetType("System.String"));  
       dt.Columns.Add("GoogleId", Type.GetType("System.String"));  
       var wc = new WebClient(); //create new webclient for process  
       foreach (string s in epiclist)  
         Console.WriteLine("Starting on " + s);  
         string thisEpic = s.Replace(".L", "");  
         string urlTemplate = 
                 "" + thisEpic;  
         string history = "";  
           history = wc.DownloadString(urlTemplate);  
         catch (WebException)  
           throw new WebException("Problem fetching GoogleId for " + thisEpic);  
         string cid = RegexCID(history);  
         if (!string.IsNullOrEmpty(cid))  
           DataRow newRow = dt.NewRow();  
           newRow["Epic"] = s.Replace(".L", "");  
           newRow["GoogleId"] = cid;  
       wc.Dispose(); //Clean-up  
       string path = Environment.CurrentDirectory;  
       path = string.Format("{0}\\", path);  
       dt.WriteXml(path + "GoogleEpicCodes.xml", false);  

One method missing here is the method call:

 string cid = RegexCID(history);  

This uses a regex match to extract the relevant Google id for the stock as is listed below:

 private static string RegexCID(string html)  
       string regex = @"<<cid>.*).>";  
       RegexOptions options = ((RegexOptions.IgnorePatternWhitespace 
                   | RegexOptions.Multiline)  
                   | RegexOptions.IgnoreCase);  
       var reg = new Regex(regex, options);  
       MatchCollection matches = reg.Matches(html);  
       if (matches.Count == 0)  
         return string.Empty;  
       GroupCollection groups = matches[0].Groups;  
       string match = groups["cid"].Value;  
       return match;  

By iterating through your list of EPICs and applying the above methods, you will quickly derive a list of the associated Google id's required for the main price scraping exercise.

Be warned that Yahoo's list of EPICs is horribly mangled as Yahoo insists of attaching '.L' after each one - in the process stocks such as 'BT.A' (British Telecom) become 'BT-A.L'. This can be a source of major problems later on. I actually get my list of EPICs from who luckily do not mess them around.

My next post will concentrate more on the scraping engine and the utility classes surrounding it.

1 comment:

  1. Have you considered using something like NinjaTrader which uses inbuilt API's to collect data from several data-sources (Kinetik, Google, Yahoo)?