Reading a large file that doesn't fit in memory Announcing the arrival of Valued Associate...

How do I automatically answer y in bash script?

Cold is to Refrigerator as warm is to?

I'm having difficulty getting my players to do stuff in a sandbox campaign

How to rotate it perfectly?

What computer would be fastest for Mathematica Home Edition?

How do you clear the ApexPages.getMessages() collection in a test?

Using "nakedly" instead of "with nothing on"

What do I do if technical issues prevent me from filing my return on time?

Cauchy Sequence Characterized only By Directly Neighbouring Sequence Members

Windows 10: How to Lock (not sleep) laptop on lid close?

What is the electric potential inside a point charge?

Notation for two qubit composite product state

How to say that you spent the night with someone, you were only sleeping and nothing else?

When communicating altitude with a '9' in it, should it be pronounced "nine hundred" or "niner hundred"?

Autumning in love

Mortgage adviser recommends a longer term than necessary combined with overpayments

Why is there no army of Iron-Mans in the MCU?

How can I make names more distinctive without making them longer?

What did Darwin mean by 'squib' here?

Stars Make Stars

What is the largest species of polychaete?

How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time

Complexity of many constant time steps with occasional logarithmic steps

Two different pronunciation of "понял"

Reading a large file that doesn't fit in memory

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Reading a .gz fileReading data from fileLarge ASCII file data readProcessing large file in PythonFile class and large constructorReading a file in PythonReading log file of subprocessReading and writing configuration fileReading a certain number of lines from a fileApplying a filter to a large array with elements that are not regularly spaced

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}

We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: {userId, timestamp, actionType} where actionType is one of two possible values: [open, close]

Constraints:

The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

Code has to be able to run on a single machine.

Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

Timestamps will come in strictly ascending order.

For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.

The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.

Sample log file (comma-separated, text file)

1,1435456566,open 

2,1435457643,open 

3,1435458912,open 

1,1435459567,close 

4,1435460345,open 

1,1435461234,open 

2,1435462567,close 

1,1435463456,open 

3,1435464398,close 

4,1435465122,close 

1,1435466775,close

Approach

Here is the code I've written in Python and Scala, which seems to be not efficient and up to the expectations of the scenario given. I'd like feedback on how I could optimise this code as per the given scenario.

Scala implementation

import java.io.FileInputStream

import java.util.{Scanner, Map, LinkedList}

import java.lang.Long

import scala.collection.mutable



object UserMetrics extends App {

  if (args.length == 0) {

    println("Please provide input data file name for processing")

  } 

  val userMetrics = new UserMetrics()

  userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)

}



case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)



class UserMetrics {



  val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()



  def readInputFile(stArr:String, timeOut: Int) {

    var inputStream: FileInputStream = null

    var sc: Scanner = null

    try {

      inputStream = new FileInputStream(stArr);

      sc = new Scanner(inputStream, "UTF-8");

      while (sc.hasNextLine()) {

        val line: String = sc.nextLine();

        processInput(line, timeOut)

      }



      for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {

        val userInfo:UserInfo = userLs.get(0)

        val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0

        println("{" + key +","+timespent + "}")

      }



      if (sc.ioException() != null) {

        throw sc.ioException();

      }

    } finally {

      if (inputStream != null) {

        inputStream.close();

      }

      if (sc != null) {

        sc.close();

      }

    }

  }



  def processInput(line: String, timeOut: Int) {

    val strSp = line.split(",")



    val userId: Integer = Integer.parseInt(strSp(0))

    val curTimeStamp = Long.parseLong(strSp(1))

    val status = strSp(2)

    val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)

    val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()



    val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)



    if (lsUserInfo != null && lsUserInfo.size() > 0) {

      val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)

      val prevTimeStamp: Long = lastUserInfo.prevTimeStamp

      val prevStatus: String = lastUserInfo.prevStatus



      if (prevStatus.equals("open")) {

        if (status.equals(lastUserInfo.prevStatus)) {

           val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

           val timeDiff = lastUserInfo.timeSpent + timeSelector

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        } else if(!status.equals(lastUserInfo.prevStatus)){

          val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        }

      } else if(prevStatus.equals("close")) {

        if (status.equals(lastUserInfo.prevStatus)) {

          lsUserInfo.remove()

          val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))

        }else if(!status.equals(lastUserInfo.prevStatus))

          {     

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))

        }

      }

    }else if(lsUserInfo.size()==0){

      lsUserInfo.add(uInfo)

    }

    usermap.put(userId, lsUserInfo)

  }



}

Python Implementation

import sys



def fileBlockStream(fp, number_of_blocks, block):

    #A generator that splits a file into blocks and iterates over the lines of one of the blocks.



    assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given

    assert 0 < number_of_blocks



    fp.seek(0,2) #seek to end of file to compute block size

    file_size = fp.tell() 



    ini = file_size * block / number_of_blocks #compute start & end point of file block

    end = file_size * (1 + block) / number_of_blocks



    if ini <= 0:

        fp.seek(0)

    else:

        fp.seek(ini-1)

        fp.readline()



    while fp.tell() < end:

        yield fp.readline() #iterate over lines of the particular chunk or block



def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):

    countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4

    for rows in chunk.splitlines():

        if len(rows.split(",")) != 3:

            continue

        userKeyID = rows.split(",")[0]

        try:

            curTimeStamp = int(rows.split(",")[1])

        except ValueError:

            print("Invalid Timestamp for ID:" + str(userKeyID))

            continue

        curEvent = rows.split(",")[2]

        if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close": 

        #Check if already existing userID with expected Close event 0 - Open; 1 - Close

        #Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]

            curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount

            avgTimeSpentDict[userKeyID][totTmPos] = totalTime

            avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":

            avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":

            curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][closeTmPos]=closeTime

            avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount          



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close": 

            curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][openTmPos]=openTime

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount



        elif curEvent == "open":

            #Initialize userid with Open event

            avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]



        elif curEvent == "close":

            #Initialize userid with missing handler function since there is no Open event for this User

            totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)

            avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]



def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):

    if lastTimeVal - curTimeVal > defaultTimeOut:

        return defaultTimeOut,curTimeVal

    else:

        return lastTimeVal - curTimeVal,curTimeVal



def computeAvg(avgTimeSpentDict,defaultTimeOut):

    resDict = {}

    for k,v in avgTimeSpentDict.iteritems():

        if v[0] == 0:

            resDict[k] = 0

        else:

            resDict[k] = v[1]/v[0]

    return resDict



if __name__ == "__main__":

    avgTimeSpentDict = {}

    if len(sys.argv) < 2:

        print("Please provide input data file name for processing")

        sys.exit(1)



    fileObj = open(sys.argv[1])

    number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])

    defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])

    for chunk_number in range(number_of_chunks):

        for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):

            computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)

    print (computeAvg(avgTimeSpentDict,defaultTimeOut))

    avgTimeSpentDict.clear() #Nullify dictionary 

    fileObj.close #Close the file object

Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.

edited 13 mins ago

Jamal♦

30.6k11121227

asked 15 hours ago

Wiki_91

New contributor

$begingroup$
how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
$endgroup$
– Oh My Goodness
14 hours ago

$begingroup$
This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
$endgroup$
– Wiki_91
13 hours ago

$begingroup$
is the problem real or imaginary?
$endgroup$
– Oh My Goodness
12 hours ago

$begingroup$
Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
$endgroup$
– Wiki_91
11 hours ago

1

$begingroup$
you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
$endgroup$
– Oh My Goodness
11 hours ago

|
show 1 more comment

Constraints:

The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

Code has to be able to run on a single machine.

Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

Timestamps will come in strictly ascending order.

The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.

Sample log file (comma-separated, text file)

1,1435456566,open 

2,1435457643,open 

3,1435458912,open 

1,1435459567,close 

4,1435460345,open 

1,1435461234,open 

2,1435462567,close 

1,1435463456,open 

3,1435464398,close 

4,1435465122,close 

1,1435466775,close

Approach

Scala implementation

import java.io.FileInputStream

import java.util.{Scanner, Map, LinkedList}

import java.lang.Long

import scala.collection.mutable



object UserMetrics extends App {

  if (args.length == 0) {

    println("Please provide input data file name for processing")

  } 

  val userMetrics = new UserMetrics()

  userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)

}



case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)



class UserMetrics {



  val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()



  def readInputFile(stArr:String, timeOut: Int) {

    var inputStream: FileInputStream = null

    var sc: Scanner = null

    try {

      inputStream = new FileInputStream(stArr);

      sc = new Scanner(inputStream, "UTF-8");

      while (sc.hasNextLine()) {

        val line: String = sc.nextLine();

        processInput(line, timeOut)

      }



      for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {

        val userInfo:UserInfo = userLs.get(0)

        val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0

        println("{" + key +","+timespent + "}")

      }



      if (sc.ioException() != null) {

        throw sc.ioException();

      }

    } finally {

      if (inputStream != null) {

        inputStream.close();

      }

      if (sc != null) {

        sc.close();

      }

    }

  }



  def processInput(line: String, timeOut: Int) {

    val strSp = line.split(",")



    val userId: Integer = Integer.parseInt(strSp(0))

    val curTimeStamp = Long.parseLong(strSp(1))

    val status = strSp(2)

    val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)

    val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()



    val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)



    if (lsUserInfo != null && lsUserInfo.size() > 0) {

      val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)

      val prevTimeStamp: Long = lastUserInfo.prevTimeStamp

      val prevStatus: String = lastUserInfo.prevStatus



      if (prevStatus.equals("open")) {

        if (status.equals(lastUserInfo.prevStatus)) {

           val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

           val timeDiff = lastUserInfo.timeSpent + timeSelector

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        } else if(!status.equals(lastUserInfo.prevStatus)){

          val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        }

      } else if(prevStatus.equals("close")) {

        if (status.equals(lastUserInfo.prevStatus)) {

          lsUserInfo.remove()

          val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))

        }else if(!status.equals(lastUserInfo.prevStatus))

          {     

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))

        }

      }

    }else if(lsUserInfo.size()==0){

      lsUserInfo.add(uInfo)

    }

    usermap.put(userId, lsUserInfo)

  }



}

Python Implementation

import sys



def fileBlockStream(fp, number_of_blocks, block):

    #A generator that splits a file into blocks and iterates over the lines of one of the blocks.



    assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given

    assert 0 < number_of_blocks



    fp.seek(0,2) #seek to end of file to compute block size

    file_size = fp.tell() 



    ini = file_size * block / number_of_blocks #compute start & end point of file block

    end = file_size * (1 + block) / number_of_blocks



    if ini <= 0:

        fp.seek(0)

    else:

        fp.seek(ini-1)

        fp.readline()



    while fp.tell() < end:

        yield fp.readline() #iterate over lines of the particular chunk or block



def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):

    countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4

    for rows in chunk.splitlines():

        if len(rows.split(",")) != 3:

            continue

        userKeyID = rows.split(",")[0]

        try:

            curTimeStamp = int(rows.split(",")[1])

        except ValueError:

            print("Invalid Timestamp for ID:" + str(userKeyID))

            continue

        curEvent = rows.split(",")[2]

        if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close": 

        #Check if already existing userID with expected Close event 0 - Open; 1 - Close

        #Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]

            curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount

            avgTimeSpentDict[userKeyID][totTmPos] = totalTime

            avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":

            avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":

            curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][closeTmPos]=closeTime

            avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount          



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close": 

            curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][openTmPos]=openTime

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount



        elif curEvent == "open":

            #Initialize userid with Open event

            avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]



        elif curEvent == "close":

            #Initialize userid with missing handler function since there is no Open event for this User

            totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)

            avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]



def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):

    if lastTimeVal - curTimeVal > defaultTimeOut:

        return defaultTimeOut,curTimeVal

    else:

        return lastTimeVal - curTimeVal,curTimeVal



def computeAvg(avgTimeSpentDict,defaultTimeOut):

    resDict = {}

    for k,v in avgTimeSpentDict.iteritems():

        if v[0] == 0:

            resDict[k] = 0

        else:

            resDict[k] = v[1]/v[0]

    return resDict



if __name__ == "__main__":

    avgTimeSpentDict = {}

    if len(sys.argv) < 2:

        print("Please provide input data file name for processing")

        sys.exit(1)



    fileObj = open(sys.argv[1])

    number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])

    defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])

    for chunk_number in range(number_of_chunks):

        for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):

            computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)

    print (computeAvg(avgTimeSpentDict,defaultTimeOut))

    avgTimeSpentDict.clear() #Nullify dictionary 

    fileObj.close #Close the file object

Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.

edited 13 mins ago

Jamal♦

30.6k11121227

asked 15 hours ago

Wiki_91

New contributor

$begingroup$
how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
$endgroup$
– Oh My Goodness
14 hours ago

$begingroup$
This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
$endgroup$
– Wiki_91
13 hours ago

$begingroup$
is the problem real or imaginary?
$endgroup$
– Oh My Goodness
12 hours ago

$begingroup$
Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
$endgroup$
– Wiki_91
11 hours ago

1

$begingroup$
you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
$endgroup$
– Oh My Goodness
11 hours ago

|
show 1 more comment

Constraints:

The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

Code has to be able to run on a single machine.

Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

Timestamps will come in strictly ascending order.

The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.

Sample log file (comma-separated, text file)

1,1435456566,open 

2,1435457643,open 

3,1435458912,open 

1,1435459567,close 

4,1435460345,open 

1,1435461234,open 

2,1435462567,close 

1,1435463456,open 

3,1435464398,close 

4,1435465122,close 

1,1435466775,close

Approach

Scala implementation

import java.io.FileInputStream

import java.util.{Scanner, Map, LinkedList}

import java.lang.Long

import scala.collection.mutable



object UserMetrics extends App {

  if (args.length == 0) {

    println("Please provide input data file name for processing")

  } 

  val userMetrics = new UserMetrics()

  userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)

}



case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)



class UserMetrics {



  val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()



  def readInputFile(stArr:String, timeOut: Int) {

    var inputStream: FileInputStream = null

    var sc: Scanner = null

    try {

      inputStream = new FileInputStream(stArr);

      sc = new Scanner(inputStream, "UTF-8");

      while (sc.hasNextLine()) {

        val line: String = sc.nextLine();

        processInput(line, timeOut)

      }



      for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {

        val userInfo:UserInfo = userLs.get(0)

        val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0

        println("{" + key +","+timespent + "}")

      }



      if (sc.ioException() != null) {

        throw sc.ioException();

      }

    } finally {

      if (inputStream != null) {

        inputStream.close();

      }

      if (sc != null) {

        sc.close();

      }

    }

  }



  def processInput(line: String, timeOut: Int) {

    val strSp = line.split(",")



    val userId: Integer = Integer.parseInt(strSp(0))

    val curTimeStamp = Long.parseLong(strSp(1))

    val status = strSp(2)

    val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)

    val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()



    val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)



    if (lsUserInfo != null && lsUserInfo.size() > 0) {

      val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)

      val prevTimeStamp: Long = lastUserInfo.prevTimeStamp

      val prevStatus: String = lastUserInfo.prevStatus



      if (prevStatus.equals("open")) {

        if (status.equals(lastUserInfo.prevStatus)) {

           val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

           val timeDiff = lastUserInfo.timeSpent + timeSelector

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        } else if(!status.equals(lastUserInfo.prevStatus)){

          val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        }

      } else if(prevStatus.equals("close")) {

        if (status.equals(lastUserInfo.prevStatus)) {

          lsUserInfo.remove()

          val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))

        }else if(!status.equals(lastUserInfo.prevStatus))

          {     

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))

        }

      }

    }else if(lsUserInfo.size()==0){

      lsUserInfo.add(uInfo)

    }

    usermap.put(userId, lsUserInfo)

  }



}

Python Implementation

import sys



def fileBlockStream(fp, number_of_blocks, block):

    #A generator that splits a file into blocks and iterates over the lines of one of the blocks.



    assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given

    assert 0 < number_of_blocks



    fp.seek(0,2) #seek to end of file to compute block size

    file_size = fp.tell() 



    ini = file_size * block / number_of_blocks #compute start & end point of file block

    end = file_size * (1 + block) / number_of_blocks



    if ini <= 0:

        fp.seek(0)

    else:

        fp.seek(ini-1)

        fp.readline()



    while fp.tell() < end:

        yield fp.readline() #iterate over lines of the particular chunk or block



def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):

    countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4

    for rows in chunk.splitlines():

        if len(rows.split(",")) != 3:

            continue

        userKeyID = rows.split(",")[0]

        try:

            curTimeStamp = int(rows.split(",")[1])

        except ValueError:

            print("Invalid Timestamp for ID:" + str(userKeyID))

            continue

        curEvent = rows.split(",")[2]

        if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close": 

        #Check if already existing userID with expected Close event 0 - Open; 1 - Close

        #Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]

            curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount

            avgTimeSpentDict[userKeyID][totTmPos] = totalTime

            avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":

            avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":

            curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][closeTmPos]=closeTime

            avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount          



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close": 

            curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][openTmPos]=openTime

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount



        elif curEvent == "open":

            #Initialize userid with Open event

            avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]



        elif curEvent == "close":

            #Initialize userid with missing handler function since there is no Open event for this User

            totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)

            avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]



def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):

    if lastTimeVal - curTimeVal > defaultTimeOut:

        return defaultTimeOut,curTimeVal

    else:

        return lastTimeVal - curTimeVal,curTimeVal



def computeAvg(avgTimeSpentDict,defaultTimeOut):

    resDict = {}

    for k,v in avgTimeSpentDict.iteritems():

        if v[0] == 0:

            resDict[k] = 0

        else:

            resDict[k] = v[1]/v[0]

    return resDict



if __name__ == "__main__":

    avgTimeSpentDict = {}

    if len(sys.argv) < 2:

        print("Please provide input data file name for processing")

        sys.exit(1)



    fileObj = open(sys.argv[1])

    number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])

    defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])

    for chunk_number in range(number_of_chunks):

        for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):

            computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)

    print (computeAvg(avgTimeSpentDict,defaultTimeOut))

    avgTimeSpentDict.clear() #Nullify dictionary 

    fileObj.close #Close the file object

Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.

edited 13 mins ago

Jamal♦

30.6k11121227

asked 15 hours ago

Wiki_91

New contributor

Constraints:

The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

Code has to be able to run on a single machine.

Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

Timestamps will come in strictly ascending order.

The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.

Sample log file (comma-separated, text file)

1,1435456566,open 

2,1435457643,open 

3,1435458912,open 

1,1435459567,close 

4,1435460345,open 

1,1435461234,open 

2,1435462567,close 

1,1435463456,open 

3,1435464398,close 

4,1435465122,close 

1,1435466775,close

Approach

Scala implementation

import java.io.FileInputStream

import java.util.{Scanner, Map, LinkedList}

import java.lang.Long

import scala.collection.mutable



object UserMetrics extends App {

  if (args.length == 0) {

    println("Please provide input data file name for processing")

  } 

  val userMetrics = new UserMetrics()

  userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)

}



case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)



class UserMetrics {



  val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()



  def readInputFile(stArr:String, timeOut: Int) {

    var inputStream: FileInputStream = null

    var sc: Scanner = null

    try {

      inputStream = new FileInputStream(stArr);

      sc = new Scanner(inputStream, "UTF-8");

      while (sc.hasNextLine()) {

        val line: String = sc.nextLine();

        processInput(line, timeOut)

      }



      for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {

        val userInfo:UserInfo = userLs.get(0)

        val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0

        println("{" + key +","+timespent + "}")

      }



      if (sc.ioException() != null) {

        throw sc.ioException();

      }

    } finally {

      if (inputStream != null) {

        inputStream.close();

      }

      if (sc != null) {

        sc.close();

      }

    }

  }



  def processInput(line: String, timeOut: Int) {

    val strSp = line.split(",")



    val userId: Integer = Integer.parseInt(strSp(0))

    val curTimeStamp = Long.parseLong(strSp(1))

    val status = strSp(2)

    val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)

    val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()



    val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)



    if (lsUserInfo != null && lsUserInfo.size() > 0) {

      val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)

      val prevTimeStamp: Long = lastUserInfo.prevTimeStamp

      val prevStatus: String = lastUserInfo.prevStatus



      if (prevStatus.equals("open")) {

        if (status.equals(lastUserInfo.prevStatus)) {

           val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

           val timeDiff = lastUserInfo.timeSpent + timeSelector

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        } else if(!status.equals(lastUserInfo.prevStatus)){

          val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

        }

      } else if(prevStatus.equals("close")) {

        if (status.equals(lastUserInfo.prevStatus)) {

          lsUserInfo.remove()

          val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))

        }else if(!status.equals(lastUserInfo.prevStatus))

          {     

          lsUserInfo.remove()

          lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))

        }

      }

    }else if(lsUserInfo.size()==0){

      lsUserInfo.add(uInfo)

    }

    usermap.put(userId, lsUserInfo)

  }



}

Python Implementation

import sys



def fileBlockStream(fp, number_of_blocks, block):

    #A generator that splits a file into blocks and iterates over the lines of one of the blocks.



    assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given

    assert 0 < number_of_blocks



    fp.seek(0,2) #seek to end of file to compute block size

    file_size = fp.tell() 



    ini = file_size * block / number_of_blocks #compute start & end point of file block

    end = file_size * (1 + block) / number_of_blocks



    if ini <= 0:

        fp.seek(0)

    else:

        fp.seek(ini-1)

        fp.readline()



    while fp.tell() < end:

        yield fp.readline() #iterate over lines of the particular chunk or block



def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):

    countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4

    for rows in chunk.splitlines():

        if len(rows.split(",")) != 3:

            continue

        userKeyID = rows.split(",")[0]

        try:

            curTimeStamp = int(rows.split(",")[1])

        except ValueError:

            print("Invalid Timestamp for ID:" + str(userKeyID))

            continue

        curEvent = rows.split(",")[2]

        if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close": 

        #Check if already existing userID with expected Close event 0 - Open; 1 - Close

        #Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]

            curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount

            avgTimeSpentDict[userKeyID][totTmPos] = totalTime

            avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":

            avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp

            avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":

            curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][closeTmPos]=closeTime

            avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount          



        elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close": 

            curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)

            totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]

            avgTimeSpentDict[userKeyID][totTmPos]=totalTime

            avgTimeSpentDict[userKeyID][openTmPos]=openTime

            eventCount = avgTimeSpentDict[userKeyID][countPos] + 1

            avgTimeSpentDict[userKeyID][countPos] = eventCount



        elif curEvent == "open":

            #Initialize userid with Open event

            avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]



        elif curEvent == "close":

            #Initialize userid with missing handler function since there is no Open event for this User

            totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)

            avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]



def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):

    if lastTimeVal - curTimeVal > defaultTimeOut:

        return defaultTimeOut,curTimeVal

    else:

        return lastTimeVal - curTimeVal,curTimeVal



def computeAvg(avgTimeSpentDict,defaultTimeOut):

    resDict = {}

    for k,v in avgTimeSpentDict.iteritems():

        if v[0] == 0:

            resDict[k] = 0

        else:

            resDict[k] = v[1]/v[0]

    return resDict



if __name__ == "__main__":

    avgTimeSpentDict = {}

    if len(sys.argv) < 2:

        print("Please provide input data file name for processing")

        sys.exit(1)



    fileObj = open(sys.argv[1])

    number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])

    defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])

    for chunk_number in range(number_of_chunks):

        for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):

            computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)

    print (computeAvg(avgTimeSpentDict,defaultTimeOut))

    avgTimeSpentDict.clear() #Nullify dictionary 

    fileObj.close #Close the file object

Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.

python scala memory-optimization

edited 13 mins ago

Jamal♦

30.6k11121227

asked 15 hours ago

Wiki_91

New contributor

edited 13 mins ago

Jamal♦

30.6k11121227

asked 15 hours ago

Wiki_91

New contributor

edited 13 mins ago

Jamal♦

30.6k11121227

edited 13 mins ago

Jamal♦

30.6k11121227

edited 13 mins ago

Jamal♦

30.6k11121227

asked 15 hours ago

Wiki_91

New contributor

asked 15 hours ago

Wiki_91

asked 15 hours ago

Wiki_91

New contributor

Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
$endgroup$
– Oh My Goodness
14 hours ago

$begingroup$
This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
$endgroup$
– Wiki_91
13 hours ago

$begingroup$
is the problem real or imaginary?
$endgroup$
– Oh My Goodness
12 hours ago

$begingroup$
Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
$endgroup$
– Wiki_91
11 hours ago

1

$begingroup$
you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
$endgroup$
– Oh My Goodness
11 hours ago

|
show 1 more comment

$begingroup$
how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
$endgroup$
– Oh My Goodness
14 hours ago

$begingroup$
This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
$endgroup$
– Wiki_91
13 hours ago

$begingroup$
is the problem real or imaginary?
$endgroup$
– Oh My Goodness
12 hours ago

$begingroup$
Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
$endgroup$
– Wiki_91
11 hours ago

1

$begingroup$
you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
$endgroup$
– Oh My Goodness
11 hours ago

how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?

– Oh My Goodness
14 hours ago

This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.

– Wiki_91
13 hours ago

is the problem real or imaginary?

– Oh My Goodness
12 hours ago

Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.

– Wiki_91
11 hours ago

you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.

– Oh My Goodness
11 hours ago

|
show 1 more comment

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217434%2freading-a-large-file-that-doesnt-fit-in-memory%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggthjy

Reading a large file that doesn't fit in memory Announcing the arrival of Valued Associate...

Constraints:

Approach

Scala implementation

Python Implementation

Constraints:

Approach

Scala implementation

Python Implementation

Constraints:

Approach

Scala implementation

Python Implementation

Constraints:

Approach

Scala implementation

Python Implementation

0

Your Answer

Post as a guest

0

0

Post as a guest