Analyzing site content with GAS

One the things I wanted to play around with was visualizing this site content with d3. But first I needed to create something which would generate data from the site (any domain site actually). It’s easy enough to modify for non-domain sites but I’m starting with domains, since that’s what I have.

 
To do this we’ll use a couple of script services.
  • Google Drive – results will either be served up as JSON, JSONP or written as JSON file on google drive for later consumption.
  • Content service – to serve up either data results, or file location results

Objective

 

Ultimately this data will be used for visualization. I’ll cover that in a separate section. First of all I’m going to scrape the site, looking for and counting occurrences of specific tags and reporting them. That way we can generate some visualizations showing which topics are related and where to find them. The web app – tagsite – will take these URL arguments
 parameter example purpose    
 tagdomain tagdomain=mcpher.com the name of the domain to which the site is mapped
 tagsite tagsite=share the name of the site 
 tagoutput tagoutput=drive|rest whether to output the result to a drive or as a rest response. default is drive
 tagfile tagfile=tagsite.json name of file to write to drive if tagoutput=drive
 callback callback=somefunction name of a callback function. if specified then jsonp rather than json will be returned
 tag descriptions &d3=d3,js,d3js,d3&excel=excel,xl this is just a list of tag=synonym1,synonym2… Each tag specified as a parameter will count each occurrence of each of its synonyms. You can use regex syntax if you need to for a synonym

An example

 

REST
Let’s take an example (it does take a while to run – there’s a lot of content). This will create some relationship data for each page on the site for the given tags, and return straight json.
 
https://script.google.com/macros/s/AKfycbz4Q0o4R3Kq9KubpgOSU5iy4eY6rcN2KcqGzo6GHi6hxZUM0bA/exec?tagdomain=mcpher.com&tagoutput=rest&tagsite=share&d3=d3js,d3.js,d3&vba=vba,vb&excel=excel,xl&gas=gas,script
 
 
Results
You are going to get back an array, one item for each page in the web site, that starts like this first element. The counts are the number of times that each synonym is encountered on a given page.

{
"data": [
{
"parent": "gassites",
"name": "gastags",
"url": "https://sites.google.com/a/mcpher.com/share/Home/excelquirks/gassites/gastags",
"tags": {
"tagmap": [
{
"name": "gas",
"values": [
"gas",
"script"
],
"counts": [
0,
1
]
},
{
"name": "d3",
"values": [
"d3js",
"d3.js",
"d3"
],
"counts": [
1,
1,
4
]
},
{
"name": "vba",
"values": [
"vba",
"vb"
],
"counts": [
0,
0
]
},
{
"name": "excel",
"values": [
"excel",
"xl"
],
"counts": [
2,
1
]
}
]
}
},
DRIVE
In this case, we want to do the same thing, but this time write the result to gDrive
 
https://script.google.com/macros/s/AKfycbz4Q0o4R3Kq9KubpgOSU5iy4eY6rcN2KcqGzo6GHi6hxZUM0bA/exec?tagdomain=mcpher.com&tagoutput=drive&tagfile=play.json&tagsite=share&d3=d3js,d3.js,d3&vba=vba,vb&excel=excel,xl&gas=gas,script
 
Results
What gets returned is a description of the drive file. The “hosted” property is a link to the created json file and is the one you should use for getting data into your web app. 

{
    "data": [],
    "file": {
        "url": "https://docs.google.com/a/mcpher.com/file/d/0B92ExLh4POiZTFgwcWtXUG1qVU0/edit?usp=drivesdk",
        "name": "play.json",
        "id": "0B92ExLh4POiZTFgwcWtXUG1qVU0",
        "download": "https://docs.google.com/a/mcpher.com/uc?id=0B92ExLh4POiZTFgwcWtXUG1qVU0&export=download",
        "hosted": "https://googledrive.com/host/0B92ExLh4POiZTFgwcWtXUG1qVU0"
    }
}

Dependencies

Normally I reference a shared library for GAS stuff ( see Using the mcpher library in your code ), but this is very straightforward and all the code is below. There are no library references needed. 
 
Main code here

Now let’s do something with the data  – see Site data to sheets