How to deal with possible data leakage in time series data?Unsupervised binning of non-normal dataMachine...

Is there any risk in sharing info about technologies and products we use with a supplier?

How can I remove (non-trivial) duplicates from a VCF file?

Why TEventArgs wasn't made contravariant in standard event pattern in the .Net ecosystem?

What to look for when criticizing poetry?

What is the wife of a henpecked husband called?

How to tell if a BJT is PNP or NPN by looking at the circuit?

Why do cars have plastic shrouds over the engine?

Is Krishna the only avatar among dashavatara who had more than one wife?

Finding a logistic regression model which can achieve zero error on a training set training data for a binary classification problem with two features

Play Zip, Zap, Zop

How do you funnel food off a cutting board?

Crontab: Ubuntu running script (noob)

What is the purpose of easy combat scenarios that don't need resource expenditure?

How can a large fleets maintain formation in interstellar space?

Why was Lupin comfortable with saying Voldemort's name?

What is the data structure of $@ in shell?

Move fast ...... Or you will lose

How does Leonard in "Memento" remember reading and writing?

Is a new Boolean field better than a null reference when a value can be meaningfully absent?

In mixed effect models, how account for grouped random effects?

Making him into a bully (how to show mild violence)

Does every functor from Set to Set preserve products?

Potential client has a problematic employee I can't work with

Why zero tolerance on nudity in space?



How to deal with possible data leakage in time series data?


Unsupervised binning of non-normal dataMachine Learning models in production environmentModel that adapts to sample updatesinformation leakage when using empirical Bayesian to generate a predictorConvolutional network for classification, extremely sensitive to lightingIs it possible to use the saved xgboost model (with one-hot encoding features) on unseen data (without one-hot encoding) for prediction?Predicting with categorical dataData Snooping, Information Leakage When Performing Feature NormalizationMachine Learning Algorithm for Dynamic EnvironmentsData leakage and predictive models: should we use past predictions as a feature?













6












$begingroup$


I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.



My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.



This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.



To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.



These are my questions:




  1. Is there really data leakage in the scenario I described


  2. If yes, can I test it in any way?


  3. Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data


  4. Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.



Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.



Thanks










share|improve this question











$endgroup$

















    6












    $begingroup$


    I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.



    My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.



    This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.



    To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.



    These are my questions:




    1. Is there really data leakage in the scenario I described


    2. If yes, can I test it in any way?


    3. Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data


    4. Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.



    Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.



    Thanks










    share|improve this question











    $endgroup$















      6












      6








      6





      $begingroup$


      I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.



      My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.



      This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.



      To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.



      These are my questions:




      1. Is there really data leakage in the scenario I described


      2. If yes, can I test it in any way?


      3. Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data


      4. Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.



      Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.



      Thanks










      share|improve this question











      $endgroup$




      I have historical consumer data who have taken out a loan at some point in time. The task is to predict if a consumer will default when requesting a loan.



      My issue is that for some customer in the data set, historical transactions are only available after the loan was issued. I believe using data after the loan event for prediction will cause data leakage.



      This is a subtle leakage because it does not involve using information not available at prediction time. My concern is more about behavioral change when the customer is indebted that create a shift in the underlying distribution.



      To test my hypothesis I was wondering if comparing whether the two samples before and after the loan is issued come from the same distribution will be a good approach.



      These are my questions:




      1. Is there really data leakage in the scenario I described


      2. If yes, can I test it in any way?


      3. Can a two samples test provide an answer? Which one? Note that the sample is composed of multivariate data


      4. Can I do testing using any machine learning approach? I was thinking of using a Mixture Model to test for instance.



      Any suggestion on how to best deal with this problem other than what I suggested will be appreciated.



      Thanks







      machine-learning data-leakage






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 15 at 2:03







      irkinosor

















      asked Feb 14 at 12:14









      irkinosorirkinosor

      1415




      1415






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          There is no need for sample tests.



          A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.



          Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).






          share|improve this answer








          New contributor




          P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45575%2fhow-to-deal-with-possible-data-leakage-in-time-series-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            There is no need for sample tests.



            A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.



            Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).






            share|improve this answer








            New contributor




            P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$


















              0












              $begingroup$

              There is no need for sample tests.



              A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.



              Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).






              share|improve this answer








              New contributor




              P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$
















                0












                0








                0





                $begingroup$

                There is no need for sample tests.



                A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.



                Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).






                share|improve this answer








                New contributor




                P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                There is no need for sample tests.



                A customer may have received many loans 1 to n - 1. To predict the default rate of nth request at time t(n), you are allowed to use any information up until t(n). When a user has no transaction history before t(1) system cannot predict the default rate for her; except maybe based on her age, income, etc. However, for the next loan request at t(2) system can use the transactions between t(1) and t(2), but still cannot use any transaction that happened after t(2). For any particular prediction at t(n), events happened after t(n) must never be used.



                Regarding "it does not involve using information not available at prediction time", it does involve using information not available at prediction time t(n), since system is trying to utilize transactions that occur after t(n).







                share|improve this answer








                New contributor




                P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer






                New contributor




                P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered 12 mins ago









                P. EsmailianP. Esmailian

                1




                1




                New contributor




                P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45575%2fhow-to-deal-with-possible-data-leakage-in-time-series-data%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How do i solve the “ No module named 'mlxtend' ” issue on Jupyter?

                    St. Wolfgang (Mickhausen) Inhaltsverzeichnis Geschichte | Beschreibung | Ausstattung | Literatur |...

                    PTIJ: Mordechai mourningParashat PekudeiPurim and Shushan PurimIs wearing masks on Purim a Biblical...