Parse dblp XML and output sums of publications grouped by year and typeXML schema parsing and XML creation...
Do all polymers contain either carbon or silicon?
Adding empty element to declared container without declaring type of element
Is there an Impartial Brexit Deal comparison site?
How can I successfully establish a nationwide combat training program for a large country?
Can the electrostatic force be infinite in magnitude?
The One-Electron Universe postulate is true - what simple change can I make to change the whole universe?
What does the "3am" section means in manpages?
Did US corporations pay demonstrators in the German demonstrations against article 13?
Was the picture area of a CRT a parallelogram (instead of a true rectangle)?
Greatest common substring
Giant Toughroad SLR 2 for 200 miles in two days, will it make it?
Is infinity mathematically observable?
Are Warlocks Arcane or Divine?
Can a Bard use an arcane focus?
Is there an wasy way to program in Tikz something like the one in the image?
Books on the History of math research at European universities
Word describing multiple paths to the same abstract outcome
My boss asked me to take a one-day class, then signs it up as a day off
The most efficient algorithm to find all possible integer pairs which sum to a given integer
What is the opposite of 'gravitas'?
Can the harmonic series explain the origin of the major scale?
A known event to a history junkie
Calculating the number of days between 2 dates in Excel
Teaching indefinite integrals that require special-casing
Parse dblp XML and output sums of publications grouped by year and type
XML schema parsing and XML creation from flat filesParse an XML file using objects / methodsParse XML using Python XML eTreeParse and Query XML code optimisationFind change in XML and print nodeParse and write data from/to a xml fileAndroid XML parser to parse themesParse child value from XML using ezxmlDownload, unzip and parse xml into pandas dataframeConverting Cisco IOS output to XML
$begingroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp []Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := []string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := []int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := []string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
$endgroup$
add a comment |
$begingroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp []Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := []string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := []int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := []string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
$endgroup$
add a comment |
$begingroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp []Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := []string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := []int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := []string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
$endgroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp []Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := []string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := []int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := []string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
parsing xml go
edited yesterday
Jamal♦
30.4k11121227
30.4k11121227
asked Mar 11 at 15:20
orestisforestisf
285
285
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make([]string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
[]string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := []string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := []string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215203%2fparse-dblp-xml-and-output-sums-of-publications-grouped-by-year-and-type%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make([]string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
[]string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make([]string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
[]string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make([]string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
[]string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make([]string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
[]string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
answered Mar 11 at 23:37
feradaferada
9,3161557
9,3161557
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
1
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed
decoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name
element.$endgroup$
– orestisf
Mar 17 at 20:01
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed
decoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name
element.$endgroup$
– orestisf
Mar 17 at 20:01
1
1
$begingroup$
From what I know about the
encoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.$endgroup$
– ferada
Mar 17 at 22:25
$begingroup$
From what I know about the
encoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := []string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := []string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := []string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := []string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := []string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := []string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := []string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make([]string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := []string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys []int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make([]string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
answered 2 days ago
orestisforestisf
285
285
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215203%2fparse-dblp-xml-and-output-sums-of-publications-grouped-by-year-and-type%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown