Model to predict based on frequency of occurrenceUsing machine learning specifically for feature analysis,...

Has Britain negotiated with any other countries outside the EU in preparation for the exit?

What language shall they sing in?

How can the probability of a fumble decrease linearly with more dice?

What is the difference between rolling more dice versus fewer dice?

Can I announce prefix 161.117.25.0/24 even though I don't have all of /24 IPs

Which communication protocol is used in AdLib sound card?

How to access internet and run apt-get through a middle server?

Limits of a density function

What will happen if I transfer large sums of money into my bank account from a pre-paid debit card or gift card?

Does Skippy chunky peanut butter contain trans fat?

Looking for a specific 6502 Assembler

Early credit roll before the end of the film

Has any human ever had the choice to leave Earth permanently?

Short story where statues have their heads replaced by those of carved insect heads

Why does photorec keep finding files after I have filled the disk free space as root?

Existence of Riemann surface, holomorphic maps

Why don't key signatures indicate the tonic?

What will happen if Parliament votes "no" on each of the Brexit-related votes to be held on the 12th, 13th and 14th of March?

Non-Cancer terminal illness that can affect young (age 10-13) girls?

What is the difference between "...", '...', $'...', and $"..." quotes?

How to deal with possible delayed baggage?

A Missing Symbol for This Logo

Is using an 'empty' metaphor considered bad style?

Is "the fire consumed everything on its way" correct?

Model to predict based on frequency of occurrence

Using machine learning specifically for feature analysis, not predictionsApplying random forest model to a dataframe with multiple types of dataReproducing randomForest Proximity Matrix from R package in PythonHow to create a global model with personalized features for multi-label classification problemImbalanced data causing mis-classification on multiclass datasettime series forecasting - sliding window methodHow can I implement tangent distance for k-nearest neighbor in python/scikit-learn?model to predict annual outcome based on previous years dataHow to structure a scikit-learn machine learning project for predicting sports outcomes?how to predict content based demand

I have the following dataset



+-----------------------------------+

|  Passenger           |    Trip    |

+-----------------------------------+

| John                 | London     |

| Jack                 | Paris      | 

| Joe                  | Sydney     |

| John                 | London     |

| John                 | London     |

| Jill                 | New york   |

| Jim                  | Sydney     |

| Jack                 | Paris      |

| James                | Sydney     |

+-----------------------------------+

And am trying to use scikit library to predict the likelihood of next possible trip of a passenger based on the frequency ( In this case John => London).
As a novice am unsure on which model / function to use.

Update 2:

If I have over 10 million records , how different should I approach this problem ?

Update 3:
The following code worked for the larger dataset !





series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))



df_px = series_px.to_frame() 



df_px.index = df_px.index.set_names(['UID', 'DEST'])



df_px.reset_index(inplace=True) 



def getNextPossibleDestByUserID(name,df=df_px): 

    return df.query('UID==@name')['DEST'].to_string(index=False)

My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!

edited 6 mins ago

asked Feb 22 at 1:44

Maddy

464

add a comment |

I have the following dataset



+-----------------------------------+

|  Passenger           |    Trip    |

+-----------------------------------+

| John                 | London     |

| Jack                 | Paris      | 

| Joe                  | Sydney     |

| John                 | London     |

| John                 | London     |

| Jill                 | New york   |

| Jim                  | Sydney     |

| Jack                 | Paris      |

| James                | Sydney     |

+-----------------------------------+

Update 2:

If I have over 10 million records , how different should I approach this problem ?

Update 3:
The following code worked for the larger dataset !





series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))



df_px = series_px.to_frame() 



df_px.index = df_px.index.set_names(['UID', 'DEST'])



df_px.reset_index(inplace=True) 



def getNextPossibleDestByUserID(name,df=df_px): 

    return df.query('UID==@name')['DEST'].to_string(index=False)

My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!

edited 6 mins ago

asked Feb 22 at 1:44

Maddy

464

add a comment |

I have the following dataset



+-----------------------------------+

|  Passenger           |    Trip    |

+-----------------------------------+

| John                 | London     |

| Jack                 | Paris      | 

| Joe                  | Sydney     |

| John                 | London     |

| John                 | London     |

| Jill                 | New york   |

| Jim                  | Sydney     |

| Jack                 | Paris      |

| James                | Sydney     |

+-----------------------------------+

Update 2:

If I have over 10 million records , how different should I approach this problem ?

Update 3:
The following code worked for the larger dataset !





series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))



df_px = series_px.to_frame() 



df_px.index = df_px.index.set_names(['UID', 'DEST'])



df_px.reset_index(inplace=True) 



def getNextPossibleDestByUserID(name,df=df_px): 

    return df.query('UID==@name')['DEST'].to_string(index=False)

My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!

edited 6 mins ago

asked Feb 22 at 1:44

Maddy

464

I have the following dataset



+-----------------------------------+

|  Passenger           |    Trip    |

+-----------------------------------+

| John                 | London     |

| Jack                 | Paris      | 

| Joe                  | Sydney     |

| John                 | London     |

| John                 | London     |

| Jill                 | New york   |

| Jim                  | Sydney     |

| Jack                 | Paris      |

| James                | Sydney     |

+-----------------------------------+

Update 2:

If I have over 10 million records , how different should I approach this problem ?

Update 3:
The following code worked for the larger dataset !





series_px = df_px_dest.groupby('Passenger')['Trip'].apply(lambda x: x.value_counts().head(1))



df_px = series_px.to_frame() 



df_px.index = df_px.index.set_names(['UID', 'DEST'])



df_px.reset_index(inplace=True) 



def getNextPossibleDestByUserID(name,df=df_px): 

    return df.query('UID==@name')['DEST'].to_string(index=False)

My next target is to expose that as an API (using Flask maybe) , Will probably raise a new question for that !!

scikit-learn machine-learning-model

edited 6 mins ago

asked Feb 22 at 1:44

Maddy

464

edited 6 mins ago

asked Feb 22 at 1:44

Maddy

464

edited 6 mins ago

asked Feb 22 at 1:44

Maddy

464

asked Feb 22 at 1:44

Maddy

464

asked Feb 22 at 1:44

Maddy

464

add a comment |

1 Answer
1

active

oldest

votes

For something like this, you could go with a simpler approach. One idea is to sample randomly among the cities that a given passenger has visited using the amount of times each city has been visited as probabilites.

Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:

     Passenger    Trip

0       John     London

1       Jack     Girona

2       Jack      Paris

3        Joe     Sydney

4        Joe  Amsterdam

5        Joe  Barcelona

6        Joe  Barcelona

7       John     London

8       John      Paris

9       Jill    Newyork

10       Jim     Sydney

11      Jack      Paris

12     James     Sydney

You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:

def random_sample(df, name):

    import numpy as np

    # group the dataframe by Passenger and count 

    # the different trips 

    g = df.groupby('Passenger').Trip.value_counts()

    # Make the probabilities add up to 1

    freq = g[name] / g[name].sum()

    # random destination based on 

    # its probabilities

    random_name = np.random.choice(a=freq.index, size=1, 

                     p = freq.values)[0]

    # return likelyhood of next randomly chosen

    # destination and destination

    return freq[random_name], random_name

Usage

Say we want to select a a randomly samples destination for say Joe and also to know which is the likelihood. Considering that the destinations where Joe has been are:

Trip

Barcelona    2

Amsterdam    1

Sydney       1

We could get for example:

for _ in range(5):

    freq, dest = random_sample(df, 'Joe')

    print('Chosen destination {} with a probability of {}'.format(dest, freq))



Chosen destination Sydney with a probability of 0.25

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Sydney with a probability of 0.25

answered Feb 22 at 11:20

yatu

1214

New contributor

$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago

1

$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago

$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45989%2fmodel-to-predict-based-on-frequency-of-occurrence%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:

     Passenger    Trip

0       John     London

1       Jack     Girona

2       Jack      Paris

3        Joe     Sydney

4        Joe  Amsterdam

5        Joe  Barcelona

6        Joe  Barcelona

7       John     London

8       John      Paris

9       Jill    Newyork

10       Jim     Sydney

11      Jack      Paris

12     James     Sydney

You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:

def random_sample(df, name):

    import numpy as np

    # group the dataframe by Passenger and count 

    # the different trips 

    g = df.groupby('Passenger').Trip.value_counts()

    # Make the probabilities add up to 1

    freq = g[name] / g[name].sum()

    # random destination based on 

    # its probabilities

    random_name = np.random.choice(a=freq.index, size=1, 

                     p = freq.values)[0]

    # return likelyhood of next randomly chosen

    # destination and destination

    return freq[random_name], random_name

Usage

Say we want to select a a randomly samples destination for say Joe and also to know which is the likelihood. Considering that the destinations where Joe has been are:

Trip

Barcelona    2

Amsterdam    1

Sydney       1

We could get for example:

for _ in range(5):

    freq, dest = random_sample(df, 'Joe')

    print('Chosen destination {} with a probability of {}'.format(dest, freq))



Chosen destination Sydney with a probability of 0.25

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Sydney with a probability of 0.25

answered Feb 22 at 11:20

yatu

1214

New contributor

$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago

1

$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago

$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago

add a comment |

Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:

     Passenger    Trip

0       John     London

1       Jack     Girona

2       Jack      Paris

3        Joe     Sydney

4        Joe  Amsterdam

5        Joe  Barcelona

6        Joe  Barcelona

7       John     London

8       John      Paris

9       Jill    Newyork

10       Jim     Sydney

11      Jack      Paris

12     James     Sydney

You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:

def random_sample(df, name):

    import numpy as np

    # group the dataframe by Passenger and count 

    # the different trips 

    g = df.groupby('Passenger').Trip.value_counts()

    # Make the probabilities add up to 1

    freq = g[name] / g[name].sum()

    # random destination based on 

    # its probabilities

    random_name = np.random.choice(a=freq.index, size=1, 

                     p = freq.values)[0]

    # return likelyhood of next randomly chosen

    # destination and destination

    return freq[random_name], random_name

Usage

Say we want to select a a randomly samples destination for say Joe and also to know which is the likelihood. Considering that the destinations where Joe has been are:

Trip

Barcelona    2

Amsterdam    1

Sydney       1

We could get for example:

for _ in range(5):

    freq, dest = random_sample(df, 'Joe')

    print('Chosen destination {} with a probability of {}'.format(dest, freq))



Chosen destination Sydney with a probability of 0.25

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Sydney with a probability of 0.25

answered Feb 22 at 11:20

yatu

1214

New contributor

$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago

1

$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago

$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago

add a comment |

Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:

     Passenger    Trip

0       John     London

1       Jack     Girona

2       Jack      Paris

3        Joe     Sydney

4        Joe  Amsterdam

5        Joe  Barcelona

6        Joe  Barcelona

7       John     London

8       John      Paris

9       Jill    Newyork

10       Jim     Sydney

11      Jack      Paris

12     James     Sydney

You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:

def random_sample(df, name):

    import numpy as np

    # group the dataframe by Passenger and count 

    # the different trips 

    g = df.groupby('Passenger').Trip.value_counts()

    # Make the probabilities add up to 1

    freq = g[name] / g[name].sum()

    # random destination based on 

    # its probabilities

    random_name = np.random.choice(a=freq.index, size=1, 

                     p = freq.values)[0]

    # return likelyhood of next randomly chosen

    # destination and destination

    return freq[random_name], random_name

Usage

Say we want to select a a randomly samples destination for say Joe and also to know which is the likelihood. Considering that the destinations where Joe has been are:

Trip

Barcelona    2

Amsterdam    1

Sydney       1

We could get for example:

for _ in range(5):

    freq, dest = random_sample(df, 'Joe')

    print('Chosen destination {} with a probability of {}'.format(dest, freq))



Chosen destination Sydney with a probability of 0.25

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Sydney with a probability of 0.25

answered Feb 22 at 11:20

yatu

1214

New contributor

Here's a way you could do so. I've added a few more examples to the dataframe so that the application is seen more clearly. Say you instead have:

     Passenger    Trip

0       John     London

1       Jack     Girona

2       Jack      Paris

3        Joe     Sydney

4        Joe  Amsterdam

5        Joe  Barcelona

6        Joe  Barcelona

7       John     London

8       John      Paris

9       Jill    Newyork

10       Jim     Sydney

11      Jack      Paris

12     James     Sydney

You could define a function like the folllowing in order to randomly sample from the existing data in the dataframe:

def random_sample(df, name):

    import numpy as np

    # group the dataframe by Passenger and count 

    # the different trips 

    g = df.groupby('Passenger').Trip.value_counts()

    # Make the probabilities add up to 1

    freq = g[name] / g[name].sum()

    # random destination based on 

    # its probabilities

    random_name = np.random.choice(a=freq.index, size=1, 

                     p = freq.values)[0]

    # return likelyhood of next randomly chosen

    # destination and destination

    return freq[random_name], random_name

Usage

Say we want to select a a randomly samples destination for say Joe and also to know which is the likelihood. Considering that the destinations where Joe has been are:

Trip

Barcelona    2

Amsterdam    1

Sydney       1

We could get for example:

for _ in range(5):

    freq, dest = random_sample(df, 'Joe')

    print('Chosen destination {} with a probability of {}'.format(dest, freq))



Chosen destination Sydney with a probability of 0.25

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Barcelona with a probability of 0.5

Chosen destination Sydney with a probability of 0.25

answered Feb 22 at 11:20

yatu

1214

New contributor

answered Feb 22 at 11:20

yatu

1214

New contributor

answered Feb 22 at 11:20

yatu

1214

answered Feb 22 at 11:20

yatu

1214

New contributor

yatu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago

1

$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago

$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago

add a comment |

$begingroup$
Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.
$endgroup$
– Maddy
2 days ago

1

$begingroup$
For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)
$endgroup$
– yatu
20 hours ago

$begingroup$
updated my question with the groupby approach !
$endgroup$
– Maddy
5 mins ago

Thanks Alex for your answer. For a smaller subset this works perfectly fine. Now my next step is to try this with a large dataset (Say, I have over 10 million records). Will the same approach works in that case ? P.S I have also updated my question with this constraint now.

– Maddy
2 days ago

For that you could do the groupby only once, so before calling the funcction, and send it as an extra agument. Should be fast enough. Let me know :)

– yatu
20 hours ago

updated my question with the groupby approach !

– Maddy
5 mins ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggthjy