Function figuring out which close-by value comes from which input valueUpdating list of Dictionaries from...

How do you conduct xenoanthropology after first contact?

XeLaTeX and pdfLaTeX ignore hyphenation

Chess with symmetric move-square

Should I join office cleaning event for free?

If Manufacturer spice model and Datasheet give different values which should I use?

Can a German sentence have two subjects?

Calculus Optimization - Point on graph closest to given point

How is it possible for user's password to be changed after storage was encrypted? (on OS X, Android)

Why doesn't Newton's third law mean a person bounces back to where they started when they hit the ground?

Concept of linear mappings are confusing me

Why is an old chain unsafe?

The magic money tree problem

Is it possible to do 50 km distance without any previous training?

"which" command doesn't work / path of Safari?

Is there a minimum number of transactions in a block?

Are there any consumables that function as addictive (psychedelic) drugs?

Can I interfere when another PC is about to be attacked?

Example of a relative pronoun

Circuitry of TV splitters

Why has Russell's definition of numbers using equivalence classes been finally abandoned? ( If it has actually been abandoned).

Why don't electron-positron collisions release infinite energy?

Banach space and Hilbert space topology

What typically incentivizes a professor to change jobs to a lower ranking university?

Why CLRS example on residual networks does not follows its formula?



Function figuring out which close-by value comes from which input value


Updating list of Dictionaries from other such listCompute the box covering on a graph using CPythonReplacing nodal values in a mesh with >1e6 inputs selectively using a polygonImplement LSD radix sort in pythonRed-black tree appears to be slower than std::multimapFind the longest rectilinear path in a matrix with descending entriesUsing dictionary.get() in if conditionFilling values for nested dictionariesCombine data from a list and map based on criteriaOptimization of BFS for finding minimum distances between elements of a grid






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







2












$begingroup$


I would like to optimize the below traceback function. This function is part of an elementary step of the program and is called a lot...



import numpy as np

def traceback(tuple_node, tuple_node_alt):
"""
Compute which value from tuple_node_alt comes from which value from tuple_node.
Return a dictionnary where the key are the values from tuple_node and the values are the idx at which
the value may be located in tuple_node_alt.
"""
# Compute the tolerances based on the node
tolerances = [0.1 if x <= 100 else 0.2 for x in tuple_node]

# Traceback
distance = dict()
alt_identification = dict()
for k, x in enumerate(tuple_node):
distance[k] = [abs(elt-1) for elt in [alt_x/x for alt_x in tuple_node_alt]]
alt_identification[x] = list(np.where([elt <= tolerances[k]+0.00001 for elt in distance[k]])[0])

# Controls the identification and corrects it
len_values = {key: len(val) for key, val in alt_identification.items()}

if all([x <= 1 for x in len_values.values()]):
return alt_identification
else:
for key, value in alt_identification.items():
if len(value) <= 1:
continue
else:
other_values = [val for k, val in alt_identification.items() if k != key]
if value in other_values:
continue
else:
for val in other_values:
set1 = set(value)
intersec = set1.intersection(set(val))
if len(intersec) == 0:
continue
else:
alt_identification[key] = [v for v in value if v not in intersec]

return alt_identification


The input is composed of 2 tuples which do not need to have the same size. e.g.



tuple_node = (40, 50, 60, 80)
tuple_node_alt = (87, 48, 59, 39)


The goal is to figure out which value from tuple_node_alt may come from which value from tuple_node. If the value from tuple_node_alt is within a 10% margin from a value from tuple_node, it is considered that it comes from this value.



e.g. 39 is within a 10% margin of 40. It comes from 40.
This aprt is perform in the "Traceback" section, where a distance dictionnary is computed and where the idx are computed. With the example above, the output is:



Out[67]: {40: [3], 50: [1], 60: [2], 80: [0]}


However, because of a potential overlapping of the tolerance band, 3 scenarios exists:



Scenario 1: each value has been identified to one alternative value. That's the case above.



Scenario 2:



tuple_node = (40, 50, 60, 80)
tuple_node_alt = (42, 55, 54)


55 and 54 are both in the tolerance band of both 50 and 60. Thus, the output is:



Out[66]: {40: [0], 50: [1, 2], 60: [1, 2], 80: []}


Scenario 3:



tuple_node = (40, 50, 60)
tuple_node_alt = (42, 55, 59)


This is when the control part comes in play. With this input, alt_identification becomes: Out[66]: {40: [0], 50: [1], 60: [1, 2], 80: []}. However, the 55 can not come from 60 since 50 only has one possibility: 55. Thus, this number being already taken, the correct output which is provided through the control & correct section is:



Out[66]: {40: [0], 50: [1], 60: [2], 80: []}


I would really like to optimize this part and to make it a lot more quicker. At the moment, it takes:



# With an input which does not enter the control & correct part.
node = (40, 50, 60, 80)
node_alt = (39, 48, 59, 87)
%timeit traceback(node, node_alt)
22.6 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# With an input which need correction
node = (40, 50, 60, 100)
node_alt = (42, 55, 59, 89)
%timeit traceback(node, node_alt)
28.1 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)









share|improve this question











$endgroup$



















    2












    $begingroup$


    I would like to optimize the below traceback function. This function is part of an elementary step of the program and is called a lot...



    import numpy as np

    def traceback(tuple_node, tuple_node_alt):
    """
    Compute which value from tuple_node_alt comes from which value from tuple_node.
    Return a dictionnary where the key are the values from tuple_node and the values are the idx at which
    the value may be located in tuple_node_alt.
    """
    # Compute the tolerances based on the node
    tolerances = [0.1 if x <= 100 else 0.2 for x in tuple_node]

    # Traceback
    distance = dict()
    alt_identification = dict()
    for k, x in enumerate(tuple_node):
    distance[k] = [abs(elt-1) for elt in [alt_x/x for alt_x in tuple_node_alt]]
    alt_identification[x] = list(np.where([elt <= tolerances[k]+0.00001 for elt in distance[k]])[0])

    # Controls the identification and corrects it
    len_values = {key: len(val) for key, val in alt_identification.items()}

    if all([x <= 1 for x in len_values.values()]):
    return alt_identification
    else:
    for key, value in alt_identification.items():
    if len(value) <= 1:
    continue
    else:
    other_values = [val for k, val in alt_identification.items() if k != key]
    if value in other_values:
    continue
    else:
    for val in other_values:
    set1 = set(value)
    intersec = set1.intersection(set(val))
    if len(intersec) == 0:
    continue
    else:
    alt_identification[key] = [v for v in value if v not in intersec]

    return alt_identification


    The input is composed of 2 tuples which do not need to have the same size. e.g.



    tuple_node = (40, 50, 60, 80)
    tuple_node_alt = (87, 48, 59, 39)


    The goal is to figure out which value from tuple_node_alt may come from which value from tuple_node. If the value from tuple_node_alt is within a 10% margin from a value from tuple_node, it is considered that it comes from this value.



    e.g. 39 is within a 10% margin of 40. It comes from 40.
    This aprt is perform in the "Traceback" section, where a distance dictionnary is computed and where the idx are computed. With the example above, the output is:



    Out[67]: {40: [3], 50: [1], 60: [2], 80: [0]}


    However, because of a potential overlapping of the tolerance band, 3 scenarios exists:



    Scenario 1: each value has been identified to one alternative value. That's the case above.



    Scenario 2:



    tuple_node = (40, 50, 60, 80)
    tuple_node_alt = (42, 55, 54)


    55 and 54 are both in the tolerance band of both 50 and 60. Thus, the output is:



    Out[66]: {40: [0], 50: [1, 2], 60: [1, 2], 80: []}


    Scenario 3:



    tuple_node = (40, 50, 60)
    tuple_node_alt = (42, 55, 59)


    This is when the control part comes in play. With this input, alt_identification becomes: Out[66]: {40: [0], 50: [1], 60: [1, 2], 80: []}. However, the 55 can not come from 60 since 50 only has one possibility: 55. Thus, this number being already taken, the correct output which is provided through the control & correct section is:



    Out[66]: {40: [0], 50: [1], 60: [2], 80: []}


    I would really like to optimize this part and to make it a lot more quicker. At the moment, it takes:



    # With an input which does not enter the control & correct part.
    node = (40, 50, 60, 80)
    node_alt = (39, 48, 59, 87)
    %timeit traceback(node, node_alt)
    22.6 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    # With an input which need correction
    node = (40, 50, 60, 100)
    node_alt = (42, 55, 59, 89)
    %timeit traceback(node, node_alt)
    28.1 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)









    share|improve this question











    $endgroup$















      2












      2








      2





      $begingroup$


      I would like to optimize the below traceback function. This function is part of an elementary step of the program and is called a lot...



      import numpy as np

      def traceback(tuple_node, tuple_node_alt):
      """
      Compute which value from tuple_node_alt comes from which value from tuple_node.
      Return a dictionnary where the key are the values from tuple_node and the values are the idx at which
      the value may be located in tuple_node_alt.
      """
      # Compute the tolerances based on the node
      tolerances = [0.1 if x <= 100 else 0.2 for x in tuple_node]

      # Traceback
      distance = dict()
      alt_identification = dict()
      for k, x in enumerate(tuple_node):
      distance[k] = [abs(elt-1) for elt in [alt_x/x for alt_x in tuple_node_alt]]
      alt_identification[x] = list(np.where([elt <= tolerances[k]+0.00001 for elt in distance[k]])[0])

      # Controls the identification and corrects it
      len_values = {key: len(val) for key, val in alt_identification.items()}

      if all([x <= 1 for x in len_values.values()]):
      return alt_identification
      else:
      for key, value in alt_identification.items():
      if len(value) <= 1:
      continue
      else:
      other_values = [val for k, val in alt_identification.items() if k != key]
      if value in other_values:
      continue
      else:
      for val in other_values:
      set1 = set(value)
      intersec = set1.intersection(set(val))
      if len(intersec) == 0:
      continue
      else:
      alt_identification[key] = [v for v in value if v not in intersec]

      return alt_identification


      The input is composed of 2 tuples which do not need to have the same size. e.g.



      tuple_node = (40, 50, 60, 80)
      tuple_node_alt = (87, 48, 59, 39)


      The goal is to figure out which value from tuple_node_alt may come from which value from tuple_node. If the value from tuple_node_alt is within a 10% margin from a value from tuple_node, it is considered that it comes from this value.



      e.g. 39 is within a 10% margin of 40. It comes from 40.
      This aprt is perform in the "Traceback" section, where a distance dictionnary is computed and where the idx are computed. With the example above, the output is:



      Out[67]: {40: [3], 50: [1], 60: [2], 80: [0]}


      However, because of a potential overlapping of the tolerance band, 3 scenarios exists:



      Scenario 1: each value has been identified to one alternative value. That's the case above.



      Scenario 2:



      tuple_node = (40, 50, 60, 80)
      tuple_node_alt = (42, 55, 54)


      55 and 54 are both in the tolerance band of both 50 and 60. Thus, the output is:



      Out[66]: {40: [0], 50: [1, 2], 60: [1, 2], 80: []}


      Scenario 3:



      tuple_node = (40, 50, 60)
      tuple_node_alt = (42, 55, 59)


      This is when the control part comes in play. With this input, alt_identification becomes: Out[66]: {40: [0], 50: [1], 60: [1, 2], 80: []}. However, the 55 can not come from 60 since 50 only has one possibility: 55. Thus, this number being already taken, the correct output which is provided through the control & correct section is:



      Out[66]: {40: [0], 50: [1], 60: [2], 80: []}


      I would really like to optimize this part and to make it a lot more quicker. At the moment, it takes:



      # With an input which does not enter the control & correct part.
      node = (40, 50, 60, 80)
      node_alt = (39, 48, 59, 87)
      %timeit traceback(node, node_alt)
      22.6 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

      # With an input which need correction
      node = (40, 50, 60, 100)
      node_alt = (42, 55, 59, 89)
      %timeit traceback(node, node_alt)
      28.1 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)









      share|improve this question











      $endgroup$




      I would like to optimize the below traceback function. This function is part of an elementary step of the program and is called a lot...



      import numpy as np

      def traceback(tuple_node, tuple_node_alt):
      """
      Compute which value from tuple_node_alt comes from which value from tuple_node.
      Return a dictionnary where the key are the values from tuple_node and the values are the idx at which
      the value may be located in tuple_node_alt.
      """
      # Compute the tolerances based on the node
      tolerances = [0.1 if x <= 100 else 0.2 for x in tuple_node]

      # Traceback
      distance = dict()
      alt_identification = dict()
      for k, x in enumerate(tuple_node):
      distance[k] = [abs(elt-1) for elt in [alt_x/x for alt_x in tuple_node_alt]]
      alt_identification[x] = list(np.where([elt <= tolerances[k]+0.00001 for elt in distance[k]])[0])

      # Controls the identification and corrects it
      len_values = {key: len(val) for key, val in alt_identification.items()}

      if all([x <= 1 for x in len_values.values()]):
      return alt_identification
      else:
      for key, value in alt_identification.items():
      if len(value) <= 1:
      continue
      else:
      other_values = [val for k, val in alt_identification.items() if k != key]
      if value in other_values:
      continue
      else:
      for val in other_values:
      set1 = set(value)
      intersec = set1.intersection(set(val))
      if len(intersec) == 0:
      continue
      else:
      alt_identification[key] = [v for v in value if v not in intersec]

      return alt_identification


      The input is composed of 2 tuples which do not need to have the same size. e.g.



      tuple_node = (40, 50, 60, 80)
      tuple_node_alt = (87, 48, 59, 39)


      The goal is to figure out which value from tuple_node_alt may come from which value from tuple_node. If the value from tuple_node_alt is within a 10% margin from a value from tuple_node, it is considered that it comes from this value.



      e.g. 39 is within a 10% margin of 40. It comes from 40.
      This aprt is perform in the "Traceback" section, where a distance dictionnary is computed and where the idx are computed. With the example above, the output is:



      Out[67]: {40: [3], 50: [1], 60: [2], 80: [0]}


      However, because of a potential overlapping of the tolerance band, 3 scenarios exists:



      Scenario 1: each value has been identified to one alternative value. That's the case above.



      Scenario 2:



      tuple_node = (40, 50, 60, 80)
      tuple_node_alt = (42, 55, 54)


      55 and 54 are both in the tolerance band of both 50 and 60. Thus, the output is:



      Out[66]: {40: [0], 50: [1, 2], 60: [1, 2], 80: []}


      Scenario 3:



      tuple_node = (40, 50, 60)
      tuple_node_alt = (42, 55, 59)


      This is when the control part comes in play. With this input, alt_identification becomes: Out[66]: {40: [0], 50: [1], 60: [1, 2], 80: []}. However, the 55 can not come from 60 since 50 only has one possibility: 55. Thus, this number being already taken, the correct output which is provided through the control & correct section is:



      Out[66]: {40: [0], 50: [1], 60: [2], 80: []}


      I would really like to optimize this part and to make it a lot more quicker. At the moment, it takes:



      # With an input which does not enter the control & correct part.
      node = (40, 50, 60, 80)
      node_alt = (39, 48, 59, 87)
      %timeit traceback(node, node_alt)
      22.6 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

      # With an input which need correction
      node = (40, 50, 60, 100)
      node_alt = (42, 55, 59, 89)
      %timeit traceback(node, node_alt)
      28.1 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)






      python performance numpy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 1 at 16:17









      200_success

      131k17157422




      131k17157422










      asked Apr 1 at 11:23









      MathieuMathieu

      1507




      1507






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          A couple of low-hanging fruit inefficiencies:





          1. distance = dict(). The distance[k] value is computed in a loop, and only every used in the next statement of the loop. It does not need to be stored in a dictionary.


          2. all([ ...list comprehension... ]): You are using list comprehension to build up a list, which you immediately pass to all(...). There is no need to actually create the list. Just use all(...list comprehension...).


          3. set1 = set(value). This is inside a for val in other_values: loop, where value and set1 are not changed. Move the statement out of the for loop, to avoid recreating the same set each iteration.


          4. len_values is only used in the afore mentioned all(...), and only the the values of len_values dictionary are used. As such, the len_value dictionary construction is also unnecessary, and the if statement can be written:



            if all(len(val) <= 1 for val in alt_identification.values()):





          Since you are returning alt_identification from the if statement, and after the if...else statement, you can invert the test, and remove one return statement:



          if any(len(val) > 1 for val in alt_identification.values()):
          for key, value in alt_identification.items():
          # ... omitted for brevity ...

          return alt_identification


          Similarly, the two if condition: continue else: could be re-written if not condition:.





          Other possible improvements:





          • tolerances[k] is only used in next for k loop. The list can be removed and the calculations move into the loop.


          • numpy is only used for a list(np.where([...])[0]) operation, which is fairly obfuscated. A simple list comprehension can be used instead.

          • The values of alt_identification are of type list, and converted (repeatedly) into a set() in the "control & correct" code. They could be stored as set() to avoid repeated conversions.


          Here is my rework of the code, with the changes based on above comments:



          def traceback(tuple_node, tuple_node_alt):

          def close_alternates(x):
          tolerance = (0.1 if x <= 100 else 0.2) + 0.00001
          return set( k for k, alt_x in enumerate(tuple_node_alt)
          if abs(alt_x/x - 1) <= tolerance )

          alt_identification = { x: close_alternates(x) for x in tuple_node }

          if any(len(val) > 1 for val in alt_identification.values()):
          for key, values in alt_identification.items():
          if len(values) > 1:
          other_values = [val for k, val in alt_identification.items() if k != key]
          if values not in other_values:
          for other in other_values:
          alt_identification[key] -= other

          return alt_identification


          I'm getting up to a 2.8x speedup with the above code, on your test data set that require correction.






          share|improve this answer











          $endgroup$














            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "196"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216646%2ffunction-figuring-out-which-close-by-value-comes-from-which-input-value%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            A couple of low-hanging fruit inefficiencies:





            1. distance = dict(). The distance[k] value is computed in a loop, and only every used in the next statement of the loop. It does not need to be stored in a dictionary.


            2. all([ ...list comprehension... ]): You are using list comprehension to build up a list, which you immediately pass to all(...). There is no need to actually create the list. Just use all(...list comprehension...).


            3. set1 = set(value). This is inside a for val in other_values: loop, where value and set1 are not changed. Move the statement out of the for loop, to avoid recreating the same set each iteration.


            4. len_values is only used in the afore mentioned all(...), and only the the values of len_values dictionary are used. As such, the len_value dictionary construction is also unnecessary, and the if statement can be written:



              if all(len(val) <= 1 for val in alt_identification.values()):





            Since you are returning alt_identification from the if statement, and after the if...else statement, you can invert the test, and remove one return statement:



            if any(len(val) > 1 for val in alt_identification.values()):
            for key, value in alt_identification.items():
            # ... omitted for brevity ...

            return alt_identification


            Similarly, the two if condition: continue else: could be re-written if not condition:.





            Other possible improvements:





            • tolerances[k] is only used in next for k loop. The list can be removed and the calculations move into the loop.


            • numpy is only used for a list(np.where([...])[0]) operation, which is fairly obfuscated. A simple list comprehension can be used instead.

            • The values of alt_identification are of type list, and converted (repeatedly) into a set() in the "control & correct" code. They could be stored as set() to avoid repeated conversions.


            Here is my rework of the code, with the changes based on above comments:



            def traceback(tuple_node, tuple_node_alt):

            def close_alternates(x):
            tolerance = (0.1 if x <= 100 else 0.2) + 0.00001
            return set( k for k, alt_x in enumerate(tuple_node_alt)
            if abs(alt_x/x - 1) <= tolerance )

            alt_identification = { x: close_alternates(x) for x in tuple_node }

            if any(len(val) > 1 for val in alt_identification.values()):
            for key, values in alt_identification.items():
            if len(values) > 1:
            other_values = [val for k, val in alt_identification.items() if k != key]
            if values not in other_values:
            for other in other_values:
            alt_identification[key] -= other

            return alt_identification


            I'm getting up to a 2.8x speedup with the above code, on your test data set that require correction.






            share|improve this answer











            $endgroup$


















              1












              $begingroup$

              A couple of low-hanging fruit inefficiencies:





              1. distance = dict(). The distance[k] value is computed in a loop, and only every used in the next statement of the loop. It does not need to be stored in a dictionary.


              2. all([ ...list comprehension... ]): You are using list comprehension to build up a list, which you immediately pass to all(...). There is no need to actually create the list. Just use all(...list comprehension...).


              3. set1 = set(value). This is inside a for val in other_values: loop, where value and set1 are not changed. Move the statement out of the for loop, to avoid recreating the same set each iteration.


              4. len_values is only used in the afore mentioned all(...), and only the the values of len_values dictionary are used. As such, the len_value dictionary construction is also unnecessary, and the if statement can be written:



                if all(len(val) <= 1 for val in alt_identification.values()):





              Since you are returning alt_identification from the if statement, and after the if...else statement, you can invert the test, and remove one return statement:



              if any(len(val) > 1 for val in alt_identification.values()):
              for key, value in alt_identification.items():
              # ... omitted for brevity ...

              return alt_identification


              Similarly, the two if condition: continue else: could be re-written if not condition:.





              Other possible improvements:





              • tolerances[k] is only used in next for k loop. The list can be removed and the calculations move into the loop.


              • numpy is only used for a list(np.where([...])[0]) operation, which is fairly obfuscated. A simple list comprehension can be used instead.

              • The values of alt_identification are of type list, and converted (repeatedly) into a set() in the "control & correct" code. They could be stored as set() to avoid repeated conversions.


              Here is my rework of the code, with the changes based on above comments:



              def traceback(tuple_node, tuple_node_alt):

              def close_alternates(x):
              tolerance = (0.1 if x <= 100 else 0.2) + 0.00001
              return set( k for k, alt_x in enumerate(tuple_node_alt)
              if abs(alt_x/x - 1) <= tolerance )

              alt_identification = { x: close_alternates(x) for x in tuple_node }

              if any(len(val) > 1 for val in alt_identification.values()):
              for key, values in alt_identification.items():
              if len(values) > 1:
              other_values = [val for k, val in alt_identification.items() if k != key]
              if values not in other_values:
              for other in other_values:
              alt_identification[key] -= other

              return alt_identification


              I'm getting up to a 2.8x speedup with the above code, on your test data set that require correction.






              share|improve this answer











              $endgroup$
















                1












                1








                1





                $begingroup$

                A couple of low-hanging fruit inefficiencies:





                1. distance = dict(). The distance[k] value is computed in a loop, and only every used in the next statement of the loop. It does not need to be stored in a dictionary.


                2. all([ ...list comprehension... ]): You are using list comprehension to build up a list, which you immediately pass to all(...). There is no need to actually create the list. Just use all(...list comprehension...).


                3. set1 = set(value). This is inside a for val in other_values: loop, where value and set1 are not changed. Move the statement out of the for loop, to avoid recreating the same set each iteration.


                4. len_values is only used in the afore mentioned all(...), and only the the values of len_values dictionary are used. As such, the len_value dictionary construction is also unnecessary, and the if statement can be written:



                  if all(len(val) <= 1 for val in alt_identification.values()):





                Since you are returning alt_identification from the if statement, and after the if...else statement, you can invert the test, and remove one return statement:



                if any(len(val) > 1 for val in alt_identification.values()):
                for key, value in alt_identification.items():
                # ... omitted for brevity ...

                return alt_identification


                Similarly, the two if condition: continue else: could be re-written if not condition:.





                Other possible improvements:





                • tolerances[k] is only used in next for k loop. The list can be removed and the calculations move into the loop.


                • numpy is only used for a list(np.where([...])[0]) operation, which is fairly obfuscated. A simple list comprehension can be used instead.

                • The values of alt_identification are of type list, and converted (repeatedly) into a set() in the "control & correct" code. They could be stored as set() to avoid repeated conversions.


                Here is my rework of the code, with the changes based on above comments:



                def traceback(tuple_node, tuple_node_alt):

                def close_alternates(x):
                tolerance = (0.1 if x <= 100 else 0.2) + 0.00001
                return set( k for k, alt_x in enumerate(tuple_node_alt)
                if abs(alt_x/x - 1) <= tolerance )

                alt_identification = { x: close_alternates(x) for x in tuple_node }

                if any(len(val) > 1 for val in alt_identification.values()):
                for key, values in alt_identification.items():
                if len(values) > 1:
                other_values = [val for k, val in alt_identification.items() if k != key]
                if values not in other_values:
                for other in other_values:
                alt_identification[key] -= other

                return alt_identification


                I'm getting up to a 2.8x speedup with the above code, on your test data set that require correction.






                share|improve this answer











                $endgroup$



                A couple of low-hanging fruit inefficiencies:





                1. distance = dict(). The distance[k] value is computed in a loop, and only every used in the next statement of the loop. It does not need to be stored in a dictionary.


                2. all([ ...list comprehension... ]): You are using list comprehension to build up a list, which you immediately pass to all(...). There is no need to actually create the list. Just use all(...list comprehension...).


                3. set1 = set(value). This is inside a for val in other_values: loop, where value and set1 are not changed. Move the statement out of the for loop, to avoid recreating the same set each iteration.


                4. len_values is only used in the afore mentioned all(...), and only the the values of len_values dictionary are used. As such, the len_value dictionary construction is also unnecessary, and the if statement can be written:



                  if all(len(val) <= 1 for val in alt_identification.values()):





                Since you are returning alt_identification from the if statement, and after the if...else statement, you can invert the test, and remove one return statement:



                if any(len(val) > 1 for val in alt_identification.values()):
                for key, value in alt_identification.items():
                # ... omitted for brevity ...

                return alt_identification


                Similarly, the two if condition: continue else: could be re-written if not condition:.





                Other possible improvements:





                • tolerances[k] is only used in next for k loop. The list can be removed and the calculations move into the loop.


                • numpy is only used for a list(np.where([...])[0]) operation, which is fairly obfuscated. A simple list comprehension can be used instead.

                • The values of alt_identification are of type list, and converted (repeatedly) into a set() in the "control & correct" code. They could be stored as set() to avoid repeated conversions.


                Here is my rework of the code, with the changes based on above comments:



                def traceback(tuple_node, tuple_node_alt):

                def close_alternates(x):
                tolerance = (0.1 if x <= 100 else 0.2) + 0.00001
                return set( k for k, alt_x in enumerate(tuple_node_alt)
                if abs(alt_x/x - 1) <= tolerance )

                alt_identification = { x: close_alternates(x) for x in tuple_node }

                if any(len(val) > 1 for val in alt_identification.values()):
                for key, values in alt_identification.items():
                if len(values) > 1:
                other_values = [val for k, val in alt_identification.items() if k != key]
                if values not in other_values:
                for other in other_values:
                alt_identification[key] -= other

                return alt_identification


                I'm getting up to a 2.8x speedup with the above code, on your test data set that require correction.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 2 at 22:26

























                answered Apr 1 at 17:37









                AJNeufeldAJNeufeld

                6,6441722




                6,6441722






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Code Review Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216646%2ffunction-figuring-out-which-close-by-value-comes-from-which-input-value%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    is 'sed' thread safeWhat should someone know about using Python scripts in the shell?Nexenta bash script uses...

                    How do i solve the “ No module named 'mlxtend' ” issue on Jupyter?

                    Pilgersdorf Inhaltsverzeichnis Geografie | Geschichte | Bevölkerungsentwicklung | Politik | Kultur...