Ggthjy

Question

The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE dblp SYSTEM "dblp.dtd">

<dblp>

    <article mdate="2017-05-28" key="journals/acta/Saxena96">

        <author>Sanjeev Saxena</author>

        <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>

        <pages>607-619</pages>

        <year>1996</year>

        <volume>33</volume>

        <journal>Acta Inf.</journal>

        <number>7</number>

        <url>db/journals/acta/acta33.html#Saxena96</url>

        <ee>https://doi.org/10.1007/BF03036466</ee>

    </article>

    <article mdate="2017-05-28" key="journals/acta/Simon83">

        <author>Hans Ulrich Simon</author>

        <title>Pattern Matching in Trees and Nets.</title>

        <pages>227-248</pages>

        <year>1983</year>

        <volume>20</volume>

        <journal>Acta Inf.</journal>

        <url>db/journals/acta/acta20.html#Simon83</url>

        <ee>https://doi.org/10.1007/BF01257084</ee>

    </article>

        <article mdate="2017-05-28" key="journals/acta/GoodmanS83">

        <author>Nathan Goodman</author>

        <author>Oded Shmueli</author>

        <title>NP-complete Problems Simplified on Tree Schemas.</title>

        <pages>171-178</pages>

        <year>1983</year>

        <volume>20</volume>

        <journal>Acta Inf.</journal>

        <url>db/journals/acta/acta20.html#GoodmanS83</url>

        <ee>https://doi.org/10.1007/BF00289414</ee>

    </article>

</dblp>

The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:

package main



import (

    "compress/gzip"

    "encoding/csv"

    "encoding/xml"

    "fmt"

    "io"

    "log"

    "os"

    "sort"

    "strconv"

    "time"



    "golang.org/x/text/encoding/charmap"

)



// Dblp contains the array of articles in the dblp xml file

type Dblp struct {

    XMLName xml.Name `xml:"dblp"`

    Dblp    []Article

}



// Metadata contains the fields shared by all structs

type Metadata struct {

    Key    string `xml:"key,attr"` // not currently in use

    Year   string `xml:"year"`

    Author string `xml:"author"` // not currently in use

    Title  string `xml:"title"`  // not currently in use

}



// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above

type Article struct {

    XMLName xml.Name `xml:"article"`

    Metadata

}



type InProceedings struct {

    XMLName xml.Name `xml:"inproceedings"`

    Metadata

}



type Proceedings struct {

    XMLName xml.Name `xml:"proceedings"`

    Metadata

}



type Book struct {

    XMLName xml.Name `xml:"book"`

    Metadata

}



type InCollection struct {

    XMLName xml.Name `xml:"incollection"`

    Metadata

}



type PhdThesis struct {

    XMLName xml.Name `xml:"phdthesis"`

    Metadata

}



type MastersThesis struct {

    XMLName xml.Name `xml:"mastersthesis"`

    Metadata

}



type WWW struct {

    XMLName xml.Name `xml:"www"`

    Metadata

}



// Record is used to store each Article's type and year which will be passed as a value to map m

type Record struct {

    UID  int

    ID   int

    Type string

    Year string

}



// SumRecord is used to store the aggregated articles by year in srMap map

//(count is stored in the map's int which is used as key)

type SumRecord struct {

    Type string

    Year string

}

The program stores each publication in a map structure and finally exports two csv files:

results.csv which contains an id, publication type and year for each publication

sumresults.csv which contains the sum of each publication type per year

It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.

The rest of the code:

func main() {

    // Start counting time

    start := time.Now()



    // Initialize counter variables for each publication type

    var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,

        InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int

    var i = 1



    // Initialize hash map

    m := make(map[int]Record)



    //Open gzipped dblp xml

    xmlFile, err := os.Open("dblp.xml.gz")

    gz, err := gzip.NewReader(xmlFile)

    if err != nil {

        log.Fatal(err)



    }

    defer gz.Close()



    //Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above

    //xmlFile, err := os.Open("dblp.xml")

    //xmlFile, err := os.Open("TestDblp.xml")

    if err != nil {

        fmt.Println(err)

    } else {

        log.Println("Successfully Opened Dblp XML file")

    }



    // defer the closing of XML file so that we can parse it later on

    defer xmlFile.Close()



    // Initialize main object from Dblp struct

    var articles Dblp



    // Create decoder element

    decoder := xml.NewDecoder(gz)



    // Suppress xml errors

    decoder.Strict = false

    decoder.CharsetReader = makeCharsetReader

    err = decoder.Decode(&articles.Dblp)

    if err != nil {

        fmt.Println(err)

    }



    for {

        // Read tokens from the XML document in a stream.

        t, err := decoder.Token()



        // If we reach the end of the file, we are done

        if err == io.EOF {

            log.Println("XML successfully parsed:", err)

            break

        } else if err != nil {

            log.Fatalf("Error decoding token: %t", err)

        } else if t == nil {

            break

        }



        // Here, we inspect the token

        switch se := t.(type) {



        // We have the start of an element and the token we created above in t:

        case xml.StartElement:

            switch se.Name.Local {

            case "dblp":



            case "article":

                var p Article

                decoder.DecodeElement(&p, &se)

                increment(&articleCounter)

                m[i] = Record{i, articleCounter, "article", p.Year}

                increment(&i)



            case "inproceedings":

                var p InProceedings

                decoder.DecodeElement(&p, &se)

                increment(&InProceedingsCounter)

                m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}

                increment(&i)



            case "proceedings":

                var p Proceedings

                decoder.DecodeElement(&p, &se)

                increment(&ProceedingsCounter)

                m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}

                increment(&i)



            case "book":

                var p Book

                decoder.DecodeElement(&p, &se)

                increment(&BookCounter)

                m[i] = Record{i, BookCounter, "proceedings", p.Year}

                increment(&i)



            case "incollection":

                var p InCollection

                decoder.DecodeElement(&p, &se)

                increment(&InCollectionCounter)

                m[i] = Record{i, InCollectionCounter, "incollection", p.Year}

                increment(&i)



            case "phdthesis":

                var p PhdThesis

                decoder.DecodeElement(&p, &se)

                increment(&PhdThesisCounter)

                m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}

                increment(&i)



            case "mastersthesis":

                var p MastersThesis

                decoder.DecodeElement(&p, &se)

                increment(&mastersThesisCounter)

                m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}

                increment(&i)



            case "www":

                var p WWW

                decoder.DecodeElement(&p, &se)

                increment(&wwwCounter)

                m[i] = Record{i, wwwCounter, "www", p.Year}

                increment(&i)

            }

        }

    }

    log.Println("Element parsing completed in:", time.Since(start))



    // All parsed elements have been added to m := make(map[int]Record)

    // We can start processing the map.

    // First we create a map and count the number of occurences of each publication type for a given year.



    srMap := make(map[SumRecord]int)

    log.Println("Creating sums by article type per year")

    for key := range m {

        sr := SumRecord{

            Type: m[key].Type,

            Year: m[key].Year,

        }

        srMap[sr]++

    }



    //// Create sum csv

    log.Println("Creating sum results csv file")

    sumfile, err := os.Create("sumresult.csv")

    checkError("Cannot create file", err)

    defer sumfile.Close()

    sumwriter := csv.NewWriter(sumfile)

    defer sumwriter.Flush()



    // define column headers

    sumheaders := []string{

        "type",

        "year",

        "sum",

    }



    sumwriter.Write(sumheaders)

    var SumString string



    // Create sorted map by VALUE (integer)



    SortedSrMap := map[int]SumRecord{}

    SortedSrMapKeys := []int{}

    for key, val := range SortedSrMap {

        // SortedSrMap[val] = key

        // SortedSrMapKeys = append(SortedSrMapKeys, val)

        SumString = strconv.Itoa(key)

        fmt.Println("sumstring:", SumString, "value: ", val)

    }

    sort.Ints(SortedSrMapKeys)



    // END Create sorted map by VALUE (integer)



    // Export sum csv

    for key, val := range srMap {

        r := make([]string, 0, 1+len(sumheaders))

        SumString = strconv.Itoa(val)

        r = append(

            r,

            key.Type,

            key.Year,

            SumString,

        )

        sumwriter.Write(r)

    }

    sumwriter.Flush()



    // CREATE RESULTS CSV

    log.Println("Creating results csv file")

    file, err := os.Create("result.csv")

    checkError("Cannot create file", err)

    defer file.Close()

    writer := csv.NewWriter(file)

    defer writer.Flush()



    // define column headers

    headers := []string{

        "uid",

        "id",

        "type",

        "year",

    }



    // write column headers

    writer.Write(headers)



    var idString string

    var uidString string



    // Create sorted map

    var keys []int

    for k := range m {

        keys = append(keys, k)

    }

    sort.Ints(keys)



    for _, k := range keys {



        r := make([]string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing



        // convert the Record.ID and UID ints to string in order to pass into append()

        idString = strconv.Itoa(m[k].ID)

        uidString = strconv.Itoa(m[k].UID)



        r = append(

            r,

            uidString,

            idString,

            m[k].Type,

            m[k].Year,

        )

        writer.Write(r)

    }

    writer.Flush()



    // END CREATE RESULTS CSV



    // Finally report results - update below line with more counters as desired

    log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)

    //log.Println("map:", m)

    //log.Println("map length:", len(m))

    //log.Println("sum map length:", len(srMap))

    //fmt.Println("sum map contents:", srMap)

    log.Println("XML parsing and csv export executed in:", time.Since(start))

}



func increment(i *int) {

    *i = *i + 1

}



func checkError(message string, err error) {

    if err != nil {

        log.Fatal(message, err)

    }

}



func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {

    if charset == "ISO-8859-1" {

        // Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case

        return charmap.Windows1252.NewDecoder().Reader(input), nil

    }

    return nil, fmt.Errorf("Unknown charset: %s", charset)

}

Main problems and issues I've identified:

The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?

Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the case / switch statements?

feradaferada 9,3161557 · Accepted Answer · 2019-03-11 23:37:23Z

The decoder.Decode call is unnecessary and in fact throws an error at
the moment.

To your second point, yes, especially the case statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.

The indexing into a hash map map[int]Record is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.

And for increment(&i) ... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1 is much clearer).

make([]string, 0, 1+len(headers) - well that's valid, but you can
simply create the array with all elements instead, like
[]string{uidString, ..., m[k].Year etc. Might be even better if you
can reuse that array for all loop iterations.

Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.

Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element. — Mar 17 at 20:01
From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions. — Mar 17 at 22:25

orestisforestisf 285 · Accepted Answer · 2019-03-24 01:35:20Z

I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.

Main points:

Only two structs are now used:

type Metadata struct {

    Key    string `xml:"key,attr"`

    Year   string `xml:"year"`

    Author string `xml:"author"`

    Title  string `xml:"title"`

}



type Record struct {

    UID  int

    ID   int

    Type string

    Year string

}

The publications are all processed with the following function:

func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {

    m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}

}

The entire code looks now like this:

package main



import (

    "compress/gzip"

    "encoding/csv"

    "encoding/xml"

    "fmt"

    "io"

    "log"

    "os"

    "sort"

    "strconv"

    "time"



    "golang.org/x/text/encoding/charmap"

)



// Metadata contains the fields shared by all structs

type Metadata struct {

    Key    string `xml:"key,attr"` // currently not in use

    Year   string `xml:"year"`

    Author string `xml:"author"` // currently not in use

    Title  string `xml:"title"`  // currently not in use

}



// Record is used to store each Article's type and year which will be passed as a value to map m

type Record struct {

    UID  int

    ID   int

    Type string

    Year string

}



type Count int



type Counter interface {

    Incr() int

    ReturnInt() int

}



var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,

    InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count



func main() {

    start := time.Now()



    //Open gzipped dblp xml

    //xmlFile, err := os.Open("TestDblp.xml.gz")

    // Uncomment below for actual xml

    xmlFile, err := os.Open("dblp.xml.gz")

    gz, err := gzip.NewReader(xmlFile)

    if err != nil {

        log.Fatal(err)



    } else {

        log.Println("Successfully Opened Dblp XML file")

    }



    defer gz.Close()



    // Create decoder element

    decoder := xml.NewDecoder(gz)



    // Suppress xml errors

    decoder.Strict = false

    decoder.CharsetReader = makeCharsetReader

    if err != nil {

        log.Fatal(err)

    }



    m := make(map[int]Record)

    var p Metadata



    for {

        // Read tokens from the XML document in a stream.

        t, err := decoder.Token()



        // If we reach the end of the file, we are done with parsing.

        if err == io.EOF {

            log.Println("XML successfully parsed:", err)

            break

        } else if err != nil {

            log.Fatalf("Error decoding token: %t", err)

        } else if t == nil {

            break

        }



        // Let's inspect the token

        switch se := t.(type) {



        // We have the start of an element and the token we created above in t:

        case xml.StartElement:

            switch se.Name.Local {



            case "article":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)



            case "inproceedings":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)



            case "proceedings":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)



            case "book":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)



            case "incollection":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)



            case "phdthesis":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)



            case "mastersthesis":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)



            case "www":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)

            }

        }

    }

    log.Println("XML parsing done in:", time.Since(start))



    // All parsed elements have been added to m := make(map[int]Record)

    // We create srMap map object and count the number of occurences of each publication type for a given year.



    srMap := make(map[Record]int)

    log.Println("Creating sums by article type per year")

    for key := range m {

        sr := Record{

            Type: m[key].Type,

            Year: m[key].Year,

        }

        srMap[sr]++

    }



    // Create sumresult.csv

    log.Println("Creating sum results csv file")

    sumfile, err := os.Create("sumresult.csv")

    checkError("Cannot create file", err)

    defer sumfile.Close()



    sumwriter := csv.NewWriter(sumfile)

    defer sumwriter.Flush()



    sumheaders := []string{

        "publicationType",

        "year",

        "sum",

    }



    sumwriter.Write(sumheaders)



    // Export sumresult.csv

    for key, val := range srMap {

        r := make([]string, 0, 1+len(sumheaders))

        r = append(

            r,

            key.Type,

            key.Year,

            strconv.Itoa(val),

        )

        sumwriter.Write(r)

    }

    sumwriter.Flush()



    // Create result.csv

    log.Println("Creating result.csv")



    file, err := os.Create("result.csv")

    checkError("Cannot create file", err)

    defer file.Close()



    writer := csv.NewWriter(file)

    defer writer.Flush()



    headers := []string{

        "uid",

        "id",

        "type",

        "year",

    }



    writer.Write(headers)



    // Create sorted map

    var keys []int

    for k := range m {

        keys = append(keys, k)

    }

    sort.Ints(keys)



    for _, k := range keys {



        r := make([]string, 0, 1+len(headers))

        r = append(

            r,

            strconv.Itoa(m[k].UID),

            strconv.Itoa(m[k].ID),

            m[k].Type,

            m[k].Year,

        )

        writer.Write(r)

    }

    writer.Flush()



    // Finally report results

    log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",

        ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",

        PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)

    log.Println("Distinct publication map length:", len(m))

    log.Println("Sum map length:", len(srMap))

    log.Println("XML parsing and csv export executed in:", time.Since(start))

}



func checkError(message string, err error) {

    if err != nil {

        log.Fatal(message, err)

    }

}



func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {

    if charset == "ISO-8859-1" {

        // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml

        return charmap.Windows1252.NewDecoder().Reader(input), nil

    }

    return nil, fmt.Errorf("Unknown charset: %s", charset)

}



func (c *Count) Incr() int {

    *c = *c + 1

    return int(*c)

}



func (c *Count) ReturnInt() int {

    return int(*c)

}



func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {

    m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}

}

I feel that the csv generation parts can be further streamlined as they are still a bit messy.

feradaferada 9,3161557 · Accepted Answer · 2019-03-11 23:37:23Z

The decoder.Decode call is unnecessary and in fact throws an error at
the moment.

To your second point, yes, especially the case statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.

The indexing into a hash map map[int]Record is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.

And for increment(&i) ... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1 is much clearer).

make([]string, 0, 1+len(headers) - well that's valid, but you can
simply create the array with all elements instead, like
[]string{uidString, ..., m[k].Year etc. Might be even better if you
can reuse that array for all loop iterations.

Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.

Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element. — Mar 17 at 20:01
From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions. — Mar 17 at 22:25

orestisforestisf 285 · Accepted Answer · 2019-03-24 01:35:20Z

I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.

Main points:

Only two structs are now used:

type Metadata struct {

    Key    string `xml:"key,attr"`

    Year   string `xml:"year"`

    Author string `xml:"author"`

    Title  string `xml:"title"`

}



type Record struct {

    UID  int

    ID   int

    Type string

    Year string

}

The publications are all processed with the following function:

func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {

    m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}

}

The entire code looks now like this:

package main



import (

    "compress/gzip"

    "encoding/csv"

    "encoding/xml"

    "fmt"

    "io"

    "log"

    "os"

    "sort"

    "strconv"

    "time"



    "golang.org/x/text/encoding/charmap"

)



// Metadata contains the fields shared by all structs

type Metadata struct {

    Key    string `xml:"key,attr"` // currently not in use

    Year   string `xml:"year"`

    Author string `xml:"author"` // currently not in use

    Title  string `xml:"title"`  // currently not in use

}



// Record is used to store each Article's type and year which will be passed as a value to map m

type Record struct {

    UID  int

    ID   int

    Type string

    Year string

}



type Count int



type Counter interface {

    Incr() int

    ReturnInt() int

}



var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,

    InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count



func main() {

    start := time.Now()



    //Open gzipped dblp xml

    //xmlFile, err := os.Open("TestDblp.xml.gz")

    // Uncomment below for actual xml

    xmlFile, err := os.Open("dblp.xml.gz")

    gz, err := gzip.NewReader(xmlFile)

    if err != nil {

        log.Fatal(err)



    } else {

        log.Println("Successfully Opened Dblp XML file")

    }



    defer gz.Close()



    // Create decoder element

    decoder := xml.NewDecoder(gz)



    // Suppress xml errors

    decoder.Strict = false

    decoder.CharsetReader = makeCharsetReader

    if err != nil {

        log.Fatal(err)

    }



    m := make(map[int]Record)

    var p Metadata



    for {

        // Read tokens from the XML document in a stream.

        t, err := decoder.Token()



        // If we reach the end of the file, we are done with parsing.

        if err == io.EOF {

            log.Println("XML successfully parsed:", err)

            break

        } else if err != nil {

            log.Fatalf("Error decoding token: %t", err)

        } else if t == nil {

            break

        }



        // Let's inspect the token

        switch se := t.(type) {



        // We have the start of an element and the token we created above in t:

        case xml.StartElement:

            switch se.Name.Local {



            case "article":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)



            case "inproceedings":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)



            case "proceedings":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)



            case "book":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)



            case "incollection":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)



            case "phdthesis":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)



            case "mastersthesis":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)



            case "www":

                decoder.DecodeElement(&p, &se)

                ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)

            }

        }

    }

    log.Println("XML parsing done in:", time.Since(start))



    // All parsed elements have been added to m := make(map[int]Record)

    // We create srMap map object and count the number of occurences of each publication type for a given year.



    srMap := make(map[Record]int)

    log.Println("Creating sums by article type per year")

    for key := range m {

        sr := Record{

            Type: m[key].Type,

            Year: m[key].Year,

        }

        srMap[sr]++

    }



    // Create sumresult.csv

    log.Println("Creating sum results csv file")

    sumfile, err := os.Create("sumresult.csv")

    checkError("Cannot create file", err)

    defer sumfile.Close()



    sumwriter := csv.NewWriter(sumfile)

    defer sumwriter.Flush()



    sumheaders := []string{

        "publicationType",

        "year",

        "sum",

    }



    sumwriter.Write(sumheaders)



    // Export sumresult.csv

    for key, val := range srMap {

        r := make([]string, 0, 1+len(sumheaders))

        r = append(

            r,

            key.Type,

            key.Year,

            strconv.Itoa(val),

        )

        sumwriter.Write(r)

    }

    sumwriter.Flush()



    // Create result.csv

    log.Println("Creating result.csv")



    file, err := os.Create("result.csv")

    checkError("Cannot create file", err)

    defer file.Close()



    writer := csv.NewWriter(file)

    defer writer.Flush()



    headers := []string{

        "uid",

        "id",

        "type",

        "year",

    }



    writer.Write(headers)



    // Create sorted map

    var keys []int

    for k := range m {

        keys = append(keys, k)

    }

    sort.Ints(keys)



    for _, k := range keys {



        r := make([]string, 0, 1+len(headers))

        r = append(

            r,

            strconv.Itoa(m[k].UID),

            strconv.Itoa(m[k].ID),

            m[k].Type,

            m[k].Year,

        )

        writer.Write(r)

    }

    writer.Flush()



    // Finally report results

    log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",

        ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",

        PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)

    log.Println("Distinct publication map length:", len(m))

    log.Println("Sum map length:", len(srMap))

    log.Println("XML parsing and csv export executed in:", time.Since(start))

}



func checkError(message string, err error) {

    if err != nil {

        log.Fatal(message, err)

    }

}



func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {

    if charset == "ISO-8859-1" {

        // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml

        return charmap.Windows1252.NewDecoder().Reader(input), nil

    }

    return nil, fmt.Errorf("Unknown charset: %s", charset)

}



func (c *Count) Incr() int {

    *c = *c + 1

    return int(*c)

}



func (c *Count) ReturnInt() int {

    return int(*c)

}



func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {

    m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}

}

I feel that the csv generation parts can be further streamlined as they are still a bit messy.

搜尋此網誌

Ggthjy

Parse dblp XML and output sums of publications grouped by year and typeXML schema parsing and XML creation...

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

How do i solve the “ No module named 'mlxtend' ” issue on Jupyter?

St. Wolfgang (Mickhausen) Inhaltsverzeichnis Geschichte | Beschreibung | Ausstattung | Literatur |...

PTIJ: Mordechai mourningParashat PekudeiPurim and Shushan PurimIs wearing masks on Purim a Biblical...

Parse dblp XML and output sums of publications grouped by year and typeXML schema parsing and XML creation...

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

How do i solve the “ No module named 'mlxtend' ” issue on Jupyter?

St. Wolfgang (Mickhausen) Inhaltsverzeichnis Geschichte | Beschreibung | Ausstattung | Literatur |...

PTIJ: Mordechai mourningParashat PekudeiPurim and Shushan PurimIs wearing masks on Purim a Biblical...

2 Answers
2

2 Answers
2

2 Answers
2