I download pricing data everyday for all companies in the FTSE100 and FTSE250. I also download 6 years-worth of data for each EPIC every day (750,000 records). You may ask why delete and re-download everything every day? The reason is, I believe there are errors in the daily data that are sometimes corrected by Google and I wouldn't catch these 'corrections' if I didn't download everyday. I may be very wrong in this opinion, but it makes me feel happier.
The actual 'work' of fetching the Google price data is covered by the following code:
protected static DataTable DownLoadPricesFromGoogle(DateTime latestDate,
DateTime earliestDate, string googleCid)
{
Begin:
string urlTemplate =
@"http://finance.google.co.uk/finance/historical?cid=
[symbol]&startdate=[earliestMonth]%2F[earliestDay]%2F[earliestYear]
&enddate=[latestMonth]%2F[latestDay]%2F[latestYear]&output=csv";
string latestMonth = latestDate.Month.ToString("00");
string latestDay = latestDate.Day.ToString("00");
string latestYear = latestDate.Year.ToString("00").Substring(2, 2);
string earliestMonth = earliestDate.Month.ToString("00");
string earliestDay = earliestDate.Day.ToString("00");
string earliestYear = earliestDate.Year.ToString("00").Substring(2, 2);
urlTemplate = urlTemplate.Replace("[symbol]", googleCid);
urlTemplate = urlTemplate.Replace("[latestMonth]", latestMonth);
urlTemplate = urlTemplate.Replace("[latestDay]", latestDay);
urlTemplate = urlTemplate.Replace("[latestYear]", latestYear);
urlTemplate = urlTemplate.Replace("[earliestMonth]", earliestMonth);
urlTemplate = urlTemplate.Replace("[earliestDay]", earliestDay);
urlTemplate = urlTemplate.Replace("[earliestYear]", earliestYear);
string history;
var wc = new WebClient();
try
{
history = wc.DownloadString(urlTemplate);
Thread.Sleep(1000);
}
catch (WebException ex)
{
if (ex.Status == WebExceptionStatus.ProtocolError)
{
return null;
}
Thread.Sleep(3000);
goto Begin;
}
finally
{
wc.Dispose();
}
return Parse(history, true);
}
The 'Begin:' at the top of the code is a horrible 'kludge' put in to get around any delay in fetching the page - I have as yet not found a better way. The rest of the code is fairly self explanatory.
The 'Parse' method converts the document stream into a datatable - to be honest I should get around to rewriting this as there is really no need for this and a much simpler method such as a simple data transport object would suffice:
public static DataTable Parse(TextReader stream, bool headers)
{
var table = new DataTable();
table.TableNewRow += Table_NewRow;
var csv = new CsvStream(stream);
string[] row = csv.GetNextRow();
if (row == null)
return null;
if (headers)
{
foreach (string header in row)
{
if (!string.IsNullOrEmpty(header) && !table.Columns.Contains(header))
if (header.Contains("Date"))
{
table.Columns.Add("Date", typeof(string));
}
else
{
table.Columns.Add(header, typeof(string));
}
else
table.Columns.Add(GetNextColumnHeader(table), typeof(string));
}
row = csv.GetNextRow();
}
while (row != null)
{
while (row.Length > table.Columns.Count)
table.Columns.Add(GetNextColumnHeader(table), typeof(string));
if (String.IsNullOrEmpty(row[1]))
{
Debug.WriteLine("`");
}
table.Rows.Add(row);
row = csv.GetNextRow();
}
return table;
}
private static void Table_NewRow(object sender, DataTableNewRowEventArgs e)
{
Console.WriteLine("Table_NewRow Event: RowState={0}",
e.Row.RowState);
}
private static string GetNextColumnHeader(DataTable table)
{
int c = 1;
while (true)
{
string h = "Column" + c++;
if (!table.Columns.Contains(h))
return h;
}
}
The 'RefreshPriceData' method controls the program flow of the 'scrape' - key points to note are:
- the 'Persistence' class is purely a utility class to clean-up the backend database prior to the receiving of a new set of data. To be honest, with the advent of Linq-to-Sql, I could probably get rid of 'Persistence' and keep everything in line.
- The code checks for the presence of 'GoogleEpicCodes.xml' in the path - as previously mentioned, this file is the 'key' for matching Google id's to real EPICs. If the file doesn't exist, it is created in the 'FetchGoogleCIDs' method.
- I use a store of known EPICs in my database to filter what prices I am going after - these EPICs are checked against the data from 'GoogleEpicCodes.xml' to ensure validity and to filter for only FTSE100 and FTSE250 companies - the LINQ code to fetch the EPICs from the SQL database is:
IQueryable<string> epics = (from e in _db.StatsLists where e.FTSE_Index == "FTSE100" || e.FTSE_Index == "FTSE250" orderby e.EPIC select e.EPIC);
- An interesting piece of code surrounds the line:
The reason for this is that I have found occaisionally that Google will throw the same price/date pair in twice - I don't know if this is designed to confound scrapers, but this small check before adding the record will save you much grief later.if (!_dt.Rows.Contains(keysFind))
public void RefreshPriceData()
{
_db = new DataClassesDataContext();
_db.Connection.ConnectionString = Database.ConnectionString;
//Delete all prices first
Persistence.ExecuteDelete();
string path = Environment.CurrentDirectory;
path = string.Format("{0}\\", path);
if (!File.Exists(path + "GoogleEpicCodes.xml"))
{
FetchGoogleCIDs();
}
DataTable gId = GetGoogleCids(path + "GoogleEpicCodes.xml");
IQueryable<string> epics = (from e in _db.StatsLists
where e.FTSE_Index == "FTSE100" || e.FTSE_Index == "FTSE250"
orderby e.EPIC
select e.EPIC);
DataTable _dt = Persistence.CreateDataTable();
var keys = new DataColumn[2];
keys[0] = _dt.Columns["EPIC"];
keys[1] = _dt.Columns["Date"];
int count = 0;
foreach (DataRow row in gId.Rows)
{
if (epics.Contains(row[0]))
{
Console.WriteLine("Starting on " + row[0]);
DataTable tempdt = DownLoadPricesFromGoogle(row[1].ToString());
if (tempdt == null)
{
continue;
}
var keysFind = new object[2];
foreach (DataRow dataRow in tempdt.Rows)
{
DataRow newRow = _dt.NewRow();
newRow["EPIC"] = row[0];
newRow["Date"] =
DateTime.ParseExact(dataRow["Date"].ToString(), "d-MMM-yy", null);
newRow["Open"] = Convert.ToDecimal(dataRow["Open"].ToString());
newRow["Volume"] = Convert.ToDecimal(dataRow["Volume"].ToString());
decimal high = Convert.ToDecimal(dataRow["High"].ToString());
decimal low = Convert.ToDecimal(dataRow["Low"].ToString());
decimal close = Convert.ToDecimal(dataRow["Close"].ToString());
newRow["High"] = high;
newRow["Low"] = low;
newRow["Close"] = close;
keysFind[0] = row[0];
keysFind[1] =
DateTime.ParseExact(dataRow["Date"].ToString(), "d-MMM-yy", null);
if (!_dt.Rows.Contains(keysFind))
{
_dt.Rows.Add(newRow);
}
}
if (count % 30 == 0 && count != 0)
{
Persistence.CopyData(_dt);
_dt.Rows.Clear();
}
count++;
}
Persistence.CopyData(_dt);
_dt.Rows.Clear();
}
}
Implementation of this code will allow you to easily 'scrape' the data you need. I run this every night prior to testing, each run netting some three-quarters of a million records within the space of about 15 minutes.
Next time we will discuss how to easily insert so many records into an SQL database.
No comments:
Post a Comment