Why do people prefer Pandas to SQL?Do modern R and/or Python libraries make SQL obsolete?memory problem due...

Pronunciation of umlaut vowels in the history of German

How to prevent cleaner from hanging my lock screen in Ubuntu 16.04

Roman Numerals equation 1

What is the lore-based reason that the Spectator has the Create Food and Water trait, instead of simply not requiring food and water?

Why do members of Congress in committee hearings ask witnesses the same question multiple times?

Early credit roll before the end of the film

Why exactly do action photographers need high fps burst cameras?

Why is mind meld hard for T'pol in Star Trek: Enterprise?

Porting Linux to another platform requirements

awk + sum all numbers

Blindfold battle as a gladiatorial spectacle - what are the tactics and communication methods?

How do Chazal know that the descendants of a Mamzer may never marry into the general populace?

What are "industrial chops"?

Traveling through the asteriod belt?

Caruana vs Carlsen game 10 (WCC) why not 18...Nxb6?

Can we use the stored gravitational potential energy of a building to produce power?

Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?

Using only 1s, make 29 with the minimum number of digits

How can I deliver in-universe written lore to players without it being dry exposition?

Why is working on the same position for more than 15 years not a red flag?

Writing a character who is going through a civilizing process without overdoing it?

Eww, those bytes are gross

Table formatting top left corner caption

Who is this Ant Woman character in this image alongside the Wasp?



Why do people prefer Pandas to SQL?


Do modern R and/or Python libraries make SQL obsolete?memory problem due to large fileData manipulation in SQLSql satament sql serverSpark SQL Pivot CrossTab functionalitySQL Server and skip first n rowsData dashboard for SQL server databaseIs there a package for using SQL to manipulate Pandas dataframes in Python?Most Efficient Post Processing with Python and PandasNatural Language to SQL queryCall a SQL procedure, for each group byPython-SQL database













49












$begingroup$


I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle.



The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This includes filtering a dataset, selecting specific columns for display, applying a function to a values, and so on.



SQL has the advantage of having an optimizer and data persistence. SQL also has error messages that are clear and understandable. Pandas has a somewhat cryptic API, in which sometimes it's appropriate to use a single [ stuff ], other times you need [[ stuff ]], and sometimes you need a .loc. Part of the complexity of Pandas arises from the fact that there is so much overloading going on.



So I'm trying to understand why Pandas is so popular.










share|improve this question











$endgroup$












  • $begingroup$
    Comments are not for extended discussion; this conversation has been moved to chat.
    $endgroup$
    – Sean Owen
    Jul 22 '18 at 3:43
















49












$begingroup$


I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle.



The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This includes filtering a dataset, selecting specific columns for display, applying a function to a values, and so on.



SQL has the advantage of having an optimizer and data persistence. SQL also has error messages that are clear and understandable. Pandas has a somewhat cryptic API, in which sometimes it's appropriate to use a single [ stuff ], other times you need [[ stuff ]], and sometimes you need a .loc. Part of the complexity of Pandas arises from the fact that there is so much overloading going on.



So I'm trying to understand why Pandas is so popular.










share|improve this question











$endgroup$












  • $begingroup$
    Comments are not for extended discussion; this conversation has been moved to chat.
    $endgroup$
    – Sean Owen
    Jul 22 '18 at 3:43














49












49








49


14



$begingroup$


I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle.



The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This includes filtering a dataset, selecting specific columns for display, applying a function to a values, and so on.



SQL has the advantage of having an optimizer and data persistence. SQL also has error messages that are clear and understandable. Pandas has a somewhat cryptic API, in which sometimes it's appropriate to use a single [ stuff ], other times you need [[ stuff ]], and sometimes you need a .loc. Part of the complexity of Pandas arises from the fact that there is so much overloading going on.



So I'm trying to understand why Pandas is so popular.










share|improve this question











$endgroup$




I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle.



The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This includes filtering a dataset, selecting specific columns for display, applying a function to a values, and so on.



SQL has the advantage of having an optimizer and data persistence. SQL also has error messages that are clear and understandable. Pandas has a somewhat cryptic API, in which sometimes it's appropriate to use a single [ stuff ], other times you need [[ stuff ]], and sometimes you need a .loc. Part of the complexity of Pandas arises from the fact that there is so much overloading going on.



So I'm trying to understand why Pandas is so popular.







pandas sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 17 '18 at 13:01


























community wiki





vy32













  • $begingroup$
    Comments are not for extended discussion; this conversation has been moved to chat.
    $endgroup$
    – Sean Owen
    Jul 22 '18 at 3:43


















  • $begingroup$
    Comments are not for extended discussion; this conversation has been moved to chat.
    $endgroup$
    – Sean Owen
    Jul 22 '18 at 3:43
















$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Sean Owen
Jul 22 '18 at 3:43




$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Sean Owen
Jul 22 '18 at 3:43










10 Answers
10






active

oldest

votes


















39












$begingroup$

The real first question is why are people more productive with DataFrame abstractions than pure SQL abstractions.



TLDR; SQL is not geared around the (human) development and debugging process, DataFrames are.



The main reason is that DataFrame abstractions allow you to construct SQL statements whilst avoiding verbose and illegible nesting. The pattern of writing nested routines, commenting them out to check them, and then uncommenting them is replaced by single lines of transformation. You can naturally run things line by line in a repl (even in Spark) and view the results.



Consider the example, of adding a new transformed (string mangled column) to a table, then grouping by it and doing some aggregations. The SQL gets pretty ugly. Pandas can solve this but is missing some things when it comes to truly big data or in particular partitions (perhaps improved recently).



DataFrames should be viewed as a high-level API to SQL routines, even if with pandas they are not at all rendered to some SQL planner.



--



You can probably have many technical discussions around this, but I'm considering the user perspective below.



One simple reason why you may see a lot more questions around Pandas data manipulation as opposed to SQL is that to use SQL, by definition, means using a database, and a lot of use-cases these days quite simply require bits of data for 'one-and-done' tasks (from .csv, web api, etc.). In these cases loading, storing, manipulating and extracting from a database is not viable.



However, considering cases where the use-case may justify using either Pandas or SQL, you're certainly not wrong. If you want to do many, repetitive data manipulation tasks and persist the outputs, I'd always recommend trying to go via SQL first. From what I've seen the reason why many users, even in these cases, don't go via SQL is two-fold.



Firstly, the major advantage pandas has over SQL is that it's part of the wider Python universe, which means in one fell swoop I can load, clean, manipulate, and visualize my data (I can even execute SQL through Pandas...). The other is, quite simply, that all too many users don't know the extent of SQL's capabilities. Every beginner learns the 'extraction syntax' of SQL (SELECT, FROM, WHERE, etc.) as a means to get your data from a DB to the next place. Some may pick up some of the more advance grouping and iteration syntax. But after that there tends to be a pretty significant gulf in knowledge, until you get to the experts (DBA, Data Engineers, etc.).



tl;dr: It's often down to the use-case, convenience, or a gap in knowledge around the extent of SQL's capabilities.






share|improve this answer











$endgroup$









  • 2




    $begingroup$
    I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday
    $endgroup$
    – Dave
    Jul 12 '18 at 14:26






  • 3




    $begingroup$
    I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save.
    $endgroup$
    – jpmc26
    Jul 13 '18 at 7:37








  • 2




    $begingroup$
    I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets.
    $endgroup$
    – SaltySub2
    Jul 13 '18 at 8:45








  • 1




    $begingroup$
    @SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache.
    $endgroup$
    – Electric Head
    Jul 13 '18 at 18:23










  • $begingroup$
    @ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then.
    $endgroup$
    – SaltySub2
    Jul 15 '18 at 3:26





















23












$begingroup$

As much as there is overlap in the application of these two things, this is comparing apples to oranges.



pandas is a data analysis toolkit implemented in Python, a general purpose programming language. SQL is a domain-specific language for querying relational data (usually in an relational database management system which SQLite, MySQL, Oracle, SQL Server, PostgreSQL etc. are examples).



SQL implies




  • working with data in an RDBMS* which may or may not be appropriate for the workload, even if it's just a small SQLite database,

  • database domain knowledge (as an end user, developer and/or administrator; the suggestion that "SQL is faster" I often see is a massive over-simplification), and

  • overcoming the not-insignificant learning curve in using SQL effectively, particularly in specialist applications such as data analysis (as opposed to creating simple reports of simple data).



* It's worth underlining the fact that SQL is so domain-specific it's becoming much less relevant to working with increasingly common alternatives to relational databases such as NoSQL databases. This represents a fundamental shift in how data is stored and structured, and there is really no universally common way of accessing it like the development of SQL standardisation aimed to achieve.




Python on the other hand (pandas is fairly "pythonic" so it holds true here) is flexible and accessible to people from various backgrounds. It can be used as a "scripting language", as a functional language and a fully featured OOP language. Visualisation capabilities and data source interoperability are built into pandas, but you're free to incorporate whatever Python can do into your workflow (which is most things); the scientific Python ecosystem has ballooned and includes great tools such as Jupyter Notebook and essential scipy libraries such as matplotlib and numpy (which pandas builds on). Significant elements of pandas' data analysis is R-inspired and you won't generally find statisticians umming and ahhing about whether they use R (or possibly increasingly pandas!) over putting everything in a database and writing their analyses in SQL.



I'm not saying pandas is better than SQL or vice versa, but SQL is a very domain-specific tool whereas pandas is part of a giant, flexible and accessible ecosystem. I work with geospatial data systems, of which relational databases are a huge part, and SQL is a powerful and essential tool. However, pandas is an equally if not more essential part of my day-to-day toolkit and SQL is often relegated to fetching data -- perhaps with some pre-processing -- so I can do things with it in pandas.






share|improve this answer











$endgroup$









  • 1




    $begingroup$
    This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make.
    $endgroup$
    – gented
    Jul 13 '18 at 6:51










  • $begingroup$
    I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out. I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down.
    $endgroup$
    – Electric Head
    Jul 13 '18 at 18:00










  • $begingroup$
    I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage.
    $endgroup$
    – jpmc26
    Jul 14 '18 at 20:31










  • $begingroup$
    I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally.
    $endgroup$
    – Electric Head
    Jul 15 '18 at 8:25










  • $begingroup$
    I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised.
    $endgroup$
    – Media
    Jul 18 '18 at 0:55



















14












$begingroup$

First, pandas is not that much popular. I use both pandas and SQL. First I try to understand the task- if it can be done in SQL, I prefer SQL because it is more efficient than pandas. Try working on a large data (10,000,000 x 50). Try to do some groupby operation in both SQL and pandas. You will understand.



I use pandas where it comes handy- like splitting a column values into an array and doing some stuff on it (like choosing only some values out of that array). Now this kind of task is relatively hard to code in SQL, but pandas will ease your task.






share|improve this answer











$endgroup$













  • $begingroup$
    Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes).
    $endgroup$
    – CodesInChaos
    Jul 13 '18 at 11:59










  • $begingroup$
    pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are.
    $endgroup$
    – Electric Head
    Jul 13 '18 at 18:05












  • $begingroup$
    @CodesInChaos There is this answer of pandas vs SQl - qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.
    $endgroup$
    – Ankit Seth
    Jul 14 '18 at 2:38





















9












$begingroup$

I'm one of those people who would use (in my case) R's dplyr (the language, not necessarily the tool) in every case if I could even though I know my SQL.



The major benefit I see in Pandas/dplyr/data.table pipelines is that the operations are atomic and can be read top to bottom.



In SQL you need to parse the whole script, jumping around (what's being sumamrized, what's being joined and how - left? inner? right?, are there any filters applied?) to fully grasp what is happening.



In Pandas et al each step of the pipeline is self contained, it does something with the input data and returns output data, this sequential process makes it easier to reason about what's happening since there is a clearly defined state for each operation rather than just on a query level.



And yes you can do WITH statements and such but it requires much more code and is not as clear what object is being used compared to piping.






share|improve this answer











$endgroup$





















    3












    $begingroup$

    I thought I would add that I do a lot of time-series based data analysis, and the pandas resample and reindex methods are invaluable for doing this. Yes, you can do similar things in SQL (I tend to create a DateDimension table for helping with date-related queries), but I just find the pandas methods much easier to use.



    Also, as others have said, the rest of my modelling is in Python, and I often have web calls or CSV files.






    share|improve this answer











    $endgroup$





















      3












      $begingroup$

      Things Pandas can do, that SQL can't do




      1. df.describe()

      2. Plotting, e.g. df['population'].plot(kind='hist')

      3. Use a dataframe directly for training machine learning algorithms


      Things Pandas can do, I wasn't aware that SQL can do as well




      1. Export to csv: df.to_csv('foobar.sv'). This is important when you want to show something to a business owner who wants to work with Excel. And there is df.to_excel as well. But in SQL, you can do SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; (thank you, vy32!)






      share|improve this answer











      $endgroup$









      • 1




        $begingroup$
        Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.)
        $endgroup$
        – vy32
        Jul 18 '18 at 19:51










      • $begingroup$
        Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification)
        $endgroup$
        – Martin Thoma
        Jul 18 '18 at 20:19






      • 1




        $begingroup$
        SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; See dev.mysql.com/doc/refman/8.0/en/select-into.html
        $endgroup$
        – vy32
        Jul 19 '18 at 14:17










      • $begingroup$
        Thank you so much, vy! I think I'll adjust my answer when I'm home :-)
        $endgroup$
        – Martin Thoma
        Jul 19 '18 at 16:52










      • $begingroup$
        Sure thing. Remember, the file ends up on the SQL server, not the client.
        $endgroup$
        – vy32
        Jul 20 '18 at 18:21



















      2












      $begingroup$

      The only thing not covered in these answers that I'd like to mention is that it also depends on how you're using SQL. Take arcpy for example. For some reason none of the arcpy.da functions have an execute many feature. This is really strange because pretty much every other python sql library does. The Where statement in the arcpy.da functions is also limited to around 120 characters. This essentially means that if you have any relatively high number of things you're trying to do with your database your only real choice is to call your chosen arcpy.da function multiple times, changing the where statement each time you do. There are a few tricks you can use to make this process go faster - you can iterate over chunks of your dataset for example - but literally every single of these tricks is much much slower than just using one arcpy.da.searchcursor to load your entire table into a pandas data frame, and then manipulating it using pandas, numpy, and, if your data is really this massive, dask. I need to emphasize here that pandas isn't just a little faster in this case. It's disgustingly faster. It's so much faster that I was literally laughing at myself for not doing it sooner. Using pandas dropped one scripts execution time down from well over an hour - I forget if this was the jump from 3.5 hours or from 1.5 hours - to literally 12 minutes.



      One thing to note is that while I could have done this with sql it would have taken me a lot longer to learn. I would have either had to learn operations specifically for sql in Access - that's where the data for this script ended up - - sql in Access wasn't as robust as I needed it to be when I was actually looking into doing this -, or I would have had to write all my data to a sqlite3 database, manipulate it there, and then put it in Access. While this might have given me similar performance results, it would have made my script harder to modify in the future.



      So yeah, sometimes Pandas and is just strictly better than using the sql options you have at your disposal. Everything I would have needed to do in sql was done with a function in pandas. You can also use sql syntax with pandas if you want to. There's little reason not to use pandas and sql in tandem.



      One more thing I want to mention about Pandas and numpy is that both of these libraries are by nature set based approaches. You can loop through dataframes and series build with these libraries, but it's really hard to modify data in these structures like that so you'll end up writing more efficient code - set based - with both of these libraries purely because it's so much easier to do. Being "guided" if not rail-roaded into using set based approaches is not something I've experienced with SQL.



      One more massive thing I forgot to mention with Pandas. Money. Pandas is a tool that a lot of Data Science jobs want you to know how to use. Pretty much every Data Science job I've looked at has paid more than database management type jobs. The only exception to this that I've noticed is in Data Engineering, but I've seen far less of those job postings. Pandas looks like it makes you more money at a glance.






      share|improve this answer











      $endgroup$









      • 4




        $begingroup$
        Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword.
        $endgroup$
        – SaltySub2
        Jul 13 '18 at 8:49








      • 1




        $begingroup$
        @ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python.
        $endgroup$
        – Steve
        Jul 13 '18 at 18:23






      • 1




        $begingroup$
        Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client.
        $endgroup$
        – Electric Head
        Jul 13 '18 at 18:33






      • 1




        $begingroup$
        @Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though).
        $endgroup$
        – Electric Head
        Jul 13 '18 at 18:35






      • 1




        $begingroup$
        @Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that.
        $endgroup$
        – SaltySub2
        Jul 15 '18 at 3:32





















      2












      $begingroup$

      I'm fairly new to Pandas/Python but have 20+ years as a SQLServer DBA, architect, administrator, etc.. I love Pandas and I'm pushing myself to always try to make things work in Pandas before returning to my comfy, cozy SQL world.



      Why RDBMS's are Better:
      The advantage of RDBMS's are their years of experience optimizing query speed and data read operations. What's impressive is that they can do this while simultaneously balancing the need to optimize write speed and manage highly concurrent access. Sometimes these additional overheads tilt the advantage to Pandas when it comes to simple, single-user use cases. But even then, a seasoned DBA can tune a database to be highly optimized for read speed over write speed. DBA's can take advantage of things like optimizing data storage, strategic disk page sizing, page filling/padding, data controller and disk partitioning strategies, optimized I/O plans, in-memory data pinning, pre-defined execution plans, indexing, data compression, and many more. I get the impression from many Pandas developers that they don't understand the depth that's available there. What I think usually happens is that if Pandas developer never has data that's big enough to need these optimizations, they don't appreciate how much time they can save you out of the box. The RDBMS world has 30 years of experience optimizing this so if raw speed on large datasets are needed, RDBMS's can be beat.



      Why Is Python/Pandas Better:
      That said, speed isn't everything and in many use cases isn't the driving factor. It depends on how you're using the data, whether it's shared, and whether you care about the speed of the processing.
      RDBMS's are generally more rigid in their data structures and put a burden on the developer to be more deterministic with data shapes. Pandas lets you be more loose here. Also, and this is my favorite reason, you're in a true programming language. Programming languages give you infinitely more flexibility to apply advanced logic to the data. Of course there's also the rich ecosystem of modules and 3rd party frameworks that SQL can't come close to. Being able to go from raw data all the way to web presentation or data visualization in one code base is VERY convenient. It's also much more portable. You can run Python almost anywhere including public notebooks that can extend the reach of your results to get to people more quickly. Databases don't excel at this.



      My Advice?
      If you find yourself graduating to bigger and bigger datasets you owe it to take the plunge and learn how RDBMS's can help. I've seen million row, multi-table join, summed aggregate queries tuned from 5 minutes down to 2 seconds. Having this understanding in your tool belt just makes you a more well rounded data scientist. You may be able to do everything in Pandas today but some day your may have an assignment where RDBMS is the best choice.






      share|improve this answer











      $endgroup$





















        1












        $begingroup$

        I'll attempt to answer this question based on my own experience. In contrast to the other answers, I prefer Sql for deep learning and big-data-related things. There are numerous reasons for that. As it can be seen here,




        Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale.




        Sql engines usually keep the keys or special columns in data-structures like $B ^+$ tree in order to facilitate CRUD operations. This data structure keeps the status of all data in the database. This is not pandas can do because it can't access all the data simultaneously. On the other hand, it can't do some operations even with its chunk parameter used in read_csv. As an example, you can't have direct batch operations for large datasets that your memory can't accommodate them. Any other tasks which depend on your entire dataset need extra coding. All of these can be handled in Sql without extra coding, just with a simple query. Simple Sql operations are just used without any fear about the memory.



        Another difference is that CRUD operations in Sql can be applied distributed with different authorisation policies which are not possible in pandas.



        It is not meant to say which is better, it all depends on your task. For large scale computation I prefer Sql and for small ones, I prefer pandas.



        There are other things that are not in pandas which are really important for fast experience for data extraction that I'll refer to later. For now, just take a look at here.






        share|improve this answer











        $endgroup$





















          0












          $begingroup$

          Panda is more popular since python in the form of jupyter notebooks is the most populair toolbox as used by data scientist in neural network area. Python is becomming "the" langauge. Its even possible to use SQL backend but your not bound to SQL only with panda.






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f34357%2fwhy-do-people-prefer-pandas-to-sql%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            10 Answers
            10






            active

            oldest

            votes








            10 Answers
            10






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            39












            $begingroup$

            The real first question is why are people more productive with DataFrame abstractions than pure SQL abstractions.



            TLDR; SQL is not geared around the (human) development and debugging process, DataFrames are.



            The main reason is that DataFrame abstractions allow you to construct SQL statements whilst avoiding verbose and illegible nesting. The pattern of writing nested routines, commenting them out to check them, and then uncommenting them is replaced by single lines of transformation. You can naturally run things line by line in a repl (even in Spark) and view the results.



            Consider the example, of adding a new transformed (string mangled column) to a table, then grouping by it and doing some aggregations. The SQL gets pretty ugly. Pandas can solve this but is missing some things when it comes to truly big data or in particular partitions (perhaps improved recently).



            DataFrames should be viewed as a high-level API to SQL routines, even if with pandas they are not at all rendered to some SQL planner.



            --



            You can probably have many technical discussions around this, but I'm considering the user perspective below.



            One simple reason why you may see a lot more questions around Pandas data manipulation as opposed to SQL is that to use SQL, by definition, means using a database, and a lot of use-cases these days quite simply require bits of data for 'one-and-done' tasks (from .csv, web api, etc.). In these cases loading, storing, manipulating and extracting from a database is not viable.



            However, considering cases where the use-case may justify using either Pandas or SQL, you're certainly not wrong. If you want to do many, repetitive data manipulation tasks and persist the outputs, I'd always recommend trying to go via SQL first. From what I've seen the reason why many users, even in these cases, don't go via SQL is two-fold.



            Firstly, the major advantage pandas has over SQL is that it's part of the wider Python universe, which means in one fell swoop I can load, clean, manipulate, and visualize my data (I can even execute SQL through Pandas...). The other is, quite simply, that all too many users don't know the extent of SQL's capabilities. Every beginner learns the 'extraction syntax' of SQL (SELECT, FROM, WHERE, etc.) as a means to get your data from a DB to the next place. Some may pick up some of the more advance grouping and iteration syntax. But after that there tends to be a pretty significant gulf in knowledge, until you get to the experts (DBA, Data Engineers, etc.).



            tl;dr: It's often down to the use-case, convenience, or a gap in knowledge around the extent of SQL's capabilities.






            share|improve this answer











            $endgroup$









            • 2




              $begingroup$
              I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday
              $endgroup$
              – Dave
              Jul 12 '18 at 14:26






            • 3




              $begingroup$
              I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save.
              $endgroup$
              – jpmc26
              Jul 13 '18 at 7:37








            • 2




              $begingroup$
              I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets.
              $endgroup$
              – SaltySub2
              Jul 13 '18 at 8:45








            • 1




              $begingroup$
              @SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:23










            • $begingroup$
              @ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then.
              $endgroup$
              – SaltySub2
              Jul 15 '18 at 3:26


















            39












            $begingroup$

            The real first question is why are people more productive with DataFrame abstractions than pure SQL abstractions.



            TLDR; SQL is not geared around the (human) development and debugging process, DataFrames are.



            The main reason is that DataFrame abstractions allow you to construct SQL statements whilst avoiding verbose and illegible nesting. The pattern of writing nested routines, commenting them out to check them, and then uncommenting them is replaced by single lines of transformation. You can naturally run things line by line in a repl (even in Spark) and view the results.



            Consider the example, of adding a new transformed (string mangled column) to a table, then grouping by it and doing some aggregations. The SQL gets pretty ugly. Pandas can solve this but is missing some things when it comes to truly big data or in particular partitions (perhaps improved recently).



            DataFrames should be viewed as a high-level API to SQL routines, even if with pandas they are not at all rendered to some SQL planner.



            --



            You can probably have many technical discussions around this, but I'm considering the user perspective below.



            One simple reason why you may see a lot more questions around Pandas data manipulation as opposed to SQL is that to use SQL, by definition, means using a database, and a lot of use-cases these days quite simply require bits of data for 'one-and-done' tasks (from .csv, web api, etc.). In these cases loading, storing, manipulating and extracting from a database is not viable.



            However, considering cases where the use-case may justify using either Pandas or SQL, you're certainly not wrong. If you want to do many, repetitive data manipulation tasks and persist the outputs, I'd always recommend trying to go via SQL first. From what I've seen the reason why many users, even in these cases, don't go via SQL is two-fold.



            Firstly, the major advantage pandas has over SQL is that it's part of the wider Python universe, which means in one fell swoop I can load, clean, manipulate, and visualize my data (I can even execute SQL through Pandas...). The other is, quite simply, that all too many users don't know the extent of SQL's capabilities. Every beginner learns the 'extraction syntax' of SQL (SELECT, FROM, WHERE, etc.) as a means to get your data from a DB to the next place. Some may pick up some of the more advance grouping and iteration syntax. But after that there tends to be a pretty significant gulf in knowledge, until you get to the experts (DBA, Data Engineers, etc.).



            tl;dr: It's often down to the use-case, convenience, or a gap in knowledge around the extent of SQL's capabilities.






            share|improve this answer











            $endgroup$









            • 2




              $begingroup$
              I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday
              $endgroup$
              – Dave
              Jul 12 '18 at 14:26






            • 3




              $begingroup$
              I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save.
              $endgroup$
              – jpmc26
              Jul 13 '18 at 7:37








            • 2




              $begingroup$
              I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets.
              $endgroup$
              – SaltySub2
              Jul 13 '18 at 8:45








            • 1




              $begingroup$
              @SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:23










            • $begingroup$
              @ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then.
              $endgroup$
              – SaltySub2
              Jul 15 '18 at 3:26
















            39












            39








            39





            $begingroup$

            The real first question is why are people more productive with DataFrame abstractions than pure SQL abstractions.



            TLDR; SQL is not geared around the (human) development and debugging process, DataFrames are.



            The main reason is that DataFrame abstractions allow you to construct SQL statements whilst avoiding verbose and illegible nesting. The pattern of writing nested routines, commenting them out to check them, and then uncommenting them is replaced by single lines of transformation. You can naturally run things line by line in a repl (even in Spark) and view the results.



            Consider the example, of adding a new transformed (string mangled column) to a table, then grouping by it and doing some aggregations. The SQL gets pretty ugly. Pandas can solve this but is missing some things when it comes to truly big data or in particular partitions (perhaps improved recently).



            DataFrames should be viewed as a high-level API to SQL routines, even if with pandas they are not at all rendered to some SQL planner.



            --



            You can probably have many technical discussions around this, but I'm considering the user perspective below.



            One simple reason why you may see a lot more questions around Pandas data manipulation as opposed to SQL is that to use SQL, by definition, means using a database, and a lot of use-cases these days quite simply require bits of data for 'one-and-done' tasks (from .csv, web api, etc.). In these cases loading, storing, manipulating and extracting from a database is not viable.



            However, considering cases where the use-case may justify using either Pandas or SQL, you're certainly not wrong. If you want to do many, repetitive data manipulation tasks and persist the outputs, I'd always recommend trying to go via SQL first. From what I've seen the reason why many users, even in these cases, don't go via SQL is two-fold.



            Firstly, the major advantage pandas has over SQL is that it's part of the wider Python universe, which means in one fell swoop I can load, clean, manipulate, and visualize my data (I can even execute SQL through Pandas...). The other is, quite simply, that all too many users don't know the extent of SQL's capabilities. Every beginner learns the 'extraction syntax' of SQL (SELECT, FROM, WHERE, etc.) as a means to get your data from a DB to the next place. Some may pick up some of the more advance grouping and iteration syntax. But after that there tends to be a pretty significant gulf in knowledge, until you get to the experts (DBA, Data Engineers, etc.).



            tl;dr: It's often down to the use-case, convenience, or a gap in knowledge around the extent of SQL's capabilities.






            share|improve this answer











            $endgroup$



            The real first question is why are people more productive with DataFrame abstractions than pure SQL abstractions.



            TLDR; SQL is not geared around the (human) development and debugging process, DataFrames are.



            The main reason is that DataFrame abstractions allow you to construct SQL statements whilst avoiding verbose and illegible nesting. The pattern of writing nested routines, commenting them out to check them, and then uncommenting them is replaced by single lines of transformation. You can naturally run things line by line in a repl (even in Spark) and view the results.



            Consider the example, of adding a new transformed (string mangled column) to a table, then grouping by it and doing some aggregations. The SQL gets pretty ugly. Pandas can solve this but is missing some things when it comes to truly big data or in particular partitions (perhaps improved recently).



            DataFrames should be viewed as a high-level API to SQL routines, even if with pandas they are not at all rendered to some SQL planner.



            --



            You can probably have many technical discussions around this, but I'm considering the user perspective below.



            One simple reason why you may see a lot more questions around Pandas data manipulation as opposed to SQL is that to use SQL, by definition, means using a database, and a lot of use-cases these days quite simply require bits of data for 'one-and-done' tasks (from .csv, web api, etc.). In these cases loading, storing, manipulating and extracting from a database is not viable.



            However, considering cases where the use-case may justify using either Pandas or SQL, you're certainly not wrong. If you want to do many, repetitive data manipulation tasks and persist the outputs, I'd always recommend trying to go via SQL first. From what I've seen the reason why many users, even in these cases, don't go via SQL is two-fold.



            Firstly, the major advantage pandas has over SQL is that it's part of the wider Python universe, which means in one fell swoop I can load, clean, manipulate, and visualize my data (I can even execute SQL through Pandas...). The other is, quite simply, that all too many users don't know the extent of SQL's capabilities. Every beginner learns the 'extraction syntax' of SQL (SELECT, FROM, WHERE, etc.) as a means to get your data from a DB to the next place. Some may pick up some of the more advance grouping and iteration syntax. But after that there tends to be a pretty significant gulf in knowledge, until you get to the experts (DBA, Data Engineers, etc.).



            tl;dr: It's often down to the use-case, convenience, or a gap in knowledge around the extent of SQL's capabilities.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Feb 18 at 19:40


























            community wiki





            2 revs, 2 users 52%
            cvonsteg









            • 2




              $begingroup$
              I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday
              $endgroup$
              – Dave
              Jul 12 '18 at 14:26






            • 3




              $begingroup$
              I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save.
              $endgroup$
              – jpmc26
              Jul 13 '18 at 7:37








            • 2




              $begingroup$
              I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets.
              $endgroup$
              – SaltySub2
              Jul 13 '18 at 8:45








            • 1




              $begingroup$
              @SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:23










            • $begingroup$
              @ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then.
              $endgroup$
              – SaltySub2
              Jul 15 '18 at 3:26
















            • 2




              $begingroup$
              I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday
              $endgroup$
              – Dave
              Jul 12 '18 at 14:26






            • 3




              $begingroup$
              I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save.
              $endgroup$
              – jpmc26
              Jul 13 '18 at 7:37








            • 2




              $begingroup$
              I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets.
              $endgroup$
              – SaltySub2
              Jul 13 '18 at 8:45








            • 1




              $begingroup$
              @SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:23










            • $begingroup$
              @ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then.
              $endgroup$
              – SaltySub2
              Jul 15 '18 at 3:26










            2




            2




            $begingroup$
            I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday
            $endgroup$
            – Dave
            Jul 12 '18 at 14:26




            $begingroup$
            I think SQL largely being set based plays a large part, when a lot of people from other technical areas are used to handling data line by line. Also consider that data is mostly just data to pandas but different SQL engines support different built in functions which can get wildly annoying fast if you have to chop and change during your workday
            $endgroup$
            – Dave
            Jul 12 '18 at 14:26




            3




            3




            $begingroup$
            I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save.
            $endgroup$
            – jpmc26
            Jul 13 '18 at 7:37






            $begingroup$
            I wouldn't say it's not viable. If you can get the data into a pandas data frame, you can probably shove it into a PostgreSQL DB. But for one and done, it's probably more effort and time than you would save.
            $endgroup$
            – jpmc26
            Jul 13 '18 at 7:37






            2




            2




            $begingroup$
            I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets.
            $endgroup$
            – SaltySub2
            Jul 13 '18 at 8:45






            $begingroup$
            I agree that some ETL approaches appear to be programmer-centric decisions. That is, they prefer to manipulate the data then present this "perfect" payload to the database. However, as you indicate, if it can be done via several SQL queries, then the extra programmatic layer is unnecessary. Exactly what I faced recently. As the OP and your answer indicates, it could be that "old-school" or DBA-centric people look at it and say, why not do it in SQL (even just several simple queries!). That said, I've found pandas to be very powerful for extremely diverse data sets.
            $endgroup$
            – SaltySub2
            Jul 13 '18 at 8:45






            1




            1




            $begingroup$
            @SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache.
            $endgroup$
            – Electric Head
            Jul 13 '18 at 18:23




            $begingroup$
            @SaltySub Just a point on shifting things out of the programmatic layer into SQL: It's a fair point and can be perfectly valid, but going as far as burying application logic in SQL procedures can bring its own special flavour of headache.
            $endgroup$
            – Electric Head
            Jul 13 '18 at 18:23












            $begingroup$
            @ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then.
            $endgroup$
            – SaltySub2
            Jul 15 '18 at 3:26






            $begingroup$
            @ElectricHead I agree that there needs to be a right balance. If a series of SQL queries can perform the tasks adequately, it can definitely be easier and more efficient. Conversely, as you indicate, if one has to place a huge amount of logic into SQL procedures, etc. then pandas should be strongly considered. Particularly as above if you are using different database flavours - SQL syntax differences can get very hairy then.
            $endgroup$
            – SaltySub2
            Jul 15 '18 at 3:26













            23












            $begingroup$

            As much as there is overlap in the application of these two things, this is comparing apples to oranges.



            pandas is a data analysis toolkit implemented in Python, a general purpose programming language. SQL is a domain-specific language for querying relational data (usually in an relational database management system which SQLite, MySQL, Oracle, SQL Server, PostgreSQL etc. are examples).



            SQL implies




            • working with data in an RDBMS* which may or may not be appropriate for the workload, even if it's just a small SQLite database,

            • database domain knowledge (as an end user, developer and/or administrator; the suggestion that "SQL is faster" I often see is a massive over-simplification), and

            • overcoming the not-insignificant learning curve in using SQL effectively, particularly in specialist applications such as data analysis (as opposed to creating simple reports of simple data).



            * It's worth underlining the fact that SQL is so domain-specific it's becoming much less relevant to working with increasingly common alternatives to relational databases such as NoSQL databases. This represents a fundamental shift in how data is stored and structured, and there is really no universally common way of accessing it like the development of SQL standardisation aimed to achieve.




            Python on the other hand (pandas is fairly "pythonic" so it holds true here) is flexible and accessible to people from various backgrounds. It can be used as a "scripting language", as a functional language and a fully featured OOP language. Visualisation capabilities and data source interoperability are built into pandas, but you're free to incorporate whatever Python can do into your workflow (which is most things); the scientific Python ecosystem has ballooned and includes great tools such as Jupyter Notebook and essential scipy libraries such as matplotlib and numpy (which pandas builds on). Significant elements of pandas' data analysis is R-inspired and you won't generally find statisticians umming and ahhing about whether they use R (or possibly increasingly pandas!) over putting everything in a database and writing their analyses in SQL.



            I'm not saying pandas is better than SQL or vice versa, but SQL is a very domain-specific tool whereas pandas is part of a giant, flexible and accessible ecosystem. I work with geospatial data systems, of which relational databases are a huge part, and SQL is a powerful and essential tool. However, pandas is an equally if not more essential part of my day-to-day toolkit and SQL is often relegated to fetching data -- perhaps with some pre-processing -- so I can do things with it in pandas.






            share|improve this answer











            $endgroup$









            • 1




              $begingroup$
              This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make.
              $endgroup$
              – gented
              Jul 13 '18 at 6:51










            • $begingroup$
              I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out. I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:00










            • $begingroup$
              I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage.
              $endgroup$
              – jpmc26
              Jul 14 '18 at 20:31










            • $begingroup$
              I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally.
              $endgroup$
              – Electric Head
              Jul 15 '18 at 8:25










            • $begingroup$
              I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised.
              $endgroup$
              – Media
              Jul 18 '18 at 0:55
















            23












            $begingroup$

            As much as there is overlap in the application of these two things, this is comparing apples to oranges.



            pandas is a data analysis toolkit implemented in Python, a general purpose programming language. SQL is a domain-specific language for querying relational data (usually in an relational database management system which SQLite, MySQL, Oracle, SQL Server, PostgreSQL etc. are examples).



            SQL implies




            • working with data in an RDBMS* which may or may not be appropriate for the workload, even if it's just a small SQLite database,

            • database domain knowledge (as an end user, developer and/or administrator; the suggestion that "SQL is faster" I often see is a massive over-simplification), and

            • overcoming the not-insignificant learning curve in using SQL effectively, particularly in specialist applications such as data analysis (as opposed to creating simple reports of simple data).



            * It's worth underlining the fact that SQL is so domain-specific it's becoming much less relevant to working with increasingly common alternatives to relational databases such as NoSQL databases. This represents a fundamental shift in how data is stored and structured, and there is really no universally common way of accessing it like the development of SQL standardisation aimed to achieve.




            Python on the other hand (pandas is fairly "pythonic" so it holds true here) is flexible and accessible to people from various backgrounds. It can be used as a "scripting language", as a functional language and a fully featured OOP language. Visualisation capabilities and data source interoperability are built into pandas, but you're free to incorporate whatever Python can do into your workflow (which is most things); the scientific Python ecosystem has ballooned and includes great tools such as Jupyter Notebook and essential scipy libraries such as matplotlib and numpy (which pandas builds on). Significant elements of pandas' data analysis is R-inspired and you won't generally find statisticians umming and ahhing about whether they use R (or possibly increasingly pandas!) over putting everything in a database and writing their analyses in SQL.



            I'm not saying pandas is better than SQL or vice versa, but SQL is a very domain-specific tool whereas pandas is part of a giant, flexible and accessible ecosystem. I work with geospatial data systems, of which relational databases are a huge part, and SQL is a powerful and essential tool. However, pandas is an equally if not more essential part of my day-to-day toolkit and SQL is often relegated to fetching data -- perhaps with some pre-processing -- so I can do things with it in pandas.






            share|improve this answer











            $endgroup$









            • 1




              $begingroup$
              This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make.
              $endgroup$
              – gented
              Jul 13 '18 at 6:51










            • $begingroup$
              I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out. I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:00










            • $begingroup$
              I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage.
              $endgroup$
              – jpmc26
              Jul 14 '18 at 20:31










            • $begingroup$
              I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally.
              $endgroup$
              – Electric Head
              Jul 15 '18 at 8:25










            • $begingroup$
              I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised.
              $endgroup$
              – Media
              Jul 18 '18 at 0:55














            23












            23








            23





            $begingroup$

            As much as there is overlap in the application of these two things, this is comparing apples to oranges.



            pandas is a data analysis toolkit implemented in Python, a general purpose programming language. SQL is a domain-specific language for querying relational data (usually in an relational database management system which SQLite, MySQL, Oracle, SQL Server, PostgreSQL etc. are examples).



            SQL implies




            • working with data in an RDBMS* which may or may not be appropriate for the workload, even if it's just a small SQLite database,

            • database domain knowledge (as an end user, developer and/or administrator; the suggestion that "SQL is faster" I often see is a massive over-simplification), and

            • overcoming the not-insignificant learning curve in using SQL effectively, particularly in specialist applications such as data analysis (as opposed to creating simple reports of simple data).



            * It's worth underlining the fact that SQL is so domain-specific it's becoming much less relevant to working with increasingly common alternatives to relational databases such as NoSQL databases. This represents a fundamental shift in how data is stored and structured, and there is really no universally common way of accessing it like the development of SQL standardisation aimed to achieve.




            Python on the other hand (pandas is fairly "pythonic" so it holds true here) is flexible and accessible to people from various backgrounds. It can be used as a "scripting language", as a functional language and a fully featured OOP language. Visualisation capabilities and data source interoperability are built into pandas, but you're free to incorporate whatever Python can do into your workflow (which is most things); the scientific Python ecosystem has ballooned and includes great tools such as Jupyter Notebook and essential scipy libraries such as matplotlib and numpy (which pandas builds on). Significant elements of pandas' data analysis is R-inspired and you won't generally find statisticians umming and ahhing about whether they use R (or possibly increasingly pandas!) over putting everything in a database and writing their analyses in SQL.



            I'm not saying pandas is better than SQL or vice versa, but SQL is a very domain-specific tool whereas pandas is part of a giant, flexible and accessible ecosystem. I work with geospatial data systems, of which relational databases are a huge part, and SQL is a powerful and essential tool. However, pandas is an equally if not more essential part of my day-to-day toolkit and SQL is often relegated to fetching data -- perhaps with some pre-processing -- so I can do things with it in pandas.






            share|improve this answer











            $endgroup$



            As much as there is overlap in the application of these two things, this is comparing apples to oranges.



            pandas is a data analysis toolkit implemented in Python, a general purpose programming language. SQL is a domain-specific language for querying relational data (usually in an relational database management system which SQLite, MySQL, Oracle, SQL Server, PostgreSQL etc. are examples).



            SQL implies




            • working with data in an RDBMS* which may or may not be appropriate for the workload, even if it's just a small SQLite database,

            • database domain knowledge (as an end user, developer and/or administrator; the suggestion that "SQL is faster" I often see is a massive over-simplification), and

            • overcoming the not-insignificant learning curve in using SQL effectively, particularly in specialist applications such as data analysis (as opposed to creating simple reports of simple data).



            * It's worth underlining the fact that SQL is so domain-specific it's becoming much less relevant to working with increasingly common alternatives to relational databases such as NoSQL databases. This represents a fundamental shift in how data is stored and structured, and there is really no universally common way of accessing it like the development of SQL standardisation aimed to achieve.




            Python on the other hand (pandas is fairly "pythonic" so it holds true here) is flexible and accessible to people from various backgrounds. It can be used as a "scripting language", as a functional language and a fully featured OOP language. Visualisation capabilities and data source interoperability are built into pandas, but you're free to incorporate whatever Python can do into your workflow (which is most things); the scientific Python ecosystem has ballooned and includes great tools such as Jupyter Notebook and essential scipy libraries such as matplotlib and numpy (which pandas builds on). Significant elements of pandas' data analysis is R-inspired and you won't generally find statisticians umming and ahhing about whether they use R (or possibly increasingly pandas!) over putting everything in a database and writing their analyses in SQL.



            I'm not saying pandas is better than SQL or vice versa, but SQL is a very domain-specific tool whereas pandas is part of a giant, flexible and accessible ecosystem. I work with geospatial data systems, of which relational databases are a huge part, and SQL is a powerful and essential tool. However, pandas is an equally if not more essential part of my day-to-day toolkit and SQL is often relegated to fetching data -- perhaps with some pre-processing -- so I can do things with it in pandas.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jul 15 '18 at 7:52


























            community wiki





            Electric Head









            • 1




              $begingroup$
              This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make.
              $endgroup$
              – gented
              Jul 13 '18 at 6:51










            • $begingroup$
              I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out. I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:00










            • $begingroup$
              I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage.
              $endgroup$
              – jpmc26
              Jul 14 '18 at 20:31










            • $begingroup$
              I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally.
              $endgroup$
              – Electric Head
              Jul 15 '18 at 8:25










            • $begingroup$
              I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised.
              $endgroup$
              – Media
              Jul 18 '18 at 0:55














            • 1




              $begingroup$
              This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make.
              $endgroup$
              – gented
              Jul 13 '18 at 6:51










            • $begingroup$
              I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out. I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:00










            • $begingroup$
              I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage.
              $endgroup$
              – jpmc26
              Jul 14 '18 at 20:31










            • $begingroup$
              I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally.
              $endgroup$
              – Electric Head
              Jul 15 '18 at 8:25










            • $begingroup$
              I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised.
              $endgroup$
              – Media
              Jul 18 '18 at 0:55








            1




            1




            $begingroup$
            This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make.
            $endgroup$
            – gented
            Jul 13 '18 at 6:51




            $begingroup$
            This is the only true answer, it should be the chosen one. SQL and Pandas are two different things, I do not understand what comparison people are trying to make.
            $endgroup$
            – gented
            Jul 13 '18 at 6:51












            $begingroup$
            I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out. I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down.
            $endgroup$
            – Electric Head
            Jul 13 '18 at 18:00




            $begingroup$
            I suspect it's an end-user perspective of writing something code-like to fetch and massage some data from somewhere and spit out some numbers. I'm not entirely surprised; I've had first hand experience of how data analysts presented with an old but otherwise unremarkable Oracle database haven't even the first idea of what it is and how to connect to it let alone get data out. I believe it betrays a fundamental lack of understanding of the technology - I've actually added a bit to hopefully emphasise how quickly misunderstanding the scope of SQL falls down.
            $endgroup$
            – Electric Head
            Jul 13 '18 at 18:00












            $begingroup$
            I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage.
            $endgroup$
            – jpmc26
            Jul 14 '18 at 20:31




            $begingroup$
            I'd challenge your bit about being irrelevant to NoSQL situations. Consider for example the strides PostgreSQL has made with its JSON storage.
            $endgroup$
            – jpmc26
            Jul 14 '18 at 20:31












            $begingroup$
            I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally.
            $endgroup$
            – Electric Head
            Jul 15 '18 at 8:25




            $begingroup$
            I tried to choose my words carefully; PostgreSQL is still an RDBMS despite doing a lot of things well (as SQL Server is despite supporting graphs). But, I've relaxed the wording a touch because it's still a good point: there is some crossover and, importantly, SQL APIs do exist for some NoSQL systems. It is crossover though, SQL is not a universal language and not all data is structured relationally.
            $endgroup$
            – Electric Head
            Jul 15 '18 at 8:25












            $begingroup$
            I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised.
            $endgroup$
            – Media
            Jul 18 '18 at 0:55




            $begingroup$
            I think you can do everything in SQL which is possible in pandas. SQL is not flexible but is so much optimised.
            $endgroup$
            – Media
            Jul 18 '18 at 0:55











            14












            $begingroup$

            First, pandas is not that much popular. I use both pandas and SQL. First I try to understand the task- if it can be done in SQL, I prefer SQL because it is more efficient than pandas. Try working on a large data (10,000,000 x 50). Try to do some groupby operation in both SQL and pandas. You will understand.



            I use pandas where it comes handy- like splitting a column values into an array and doing some stuff on it (like choosing only some values out of that array). Now this kind of task is relatively hard to code in SQL, but pandas will ease your task.






            share|improve this answer











            $endgroup$













            • $begingroup$
              Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes).
              $endgroup$
              – CodesInChaos
              Jul 13 '18 at 11:59










            • $begingroup$
              pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:05












            • $begingroup$
              @CodesInChaos There is this answer of pandas vs SQl - qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.
              $endgroup$
              – Ankit Seth
              Jul 14 '18 at 2:38


















            14












            $begingroup$

            First, pandas is not that much popular. I use both pandas and SQL. First I try to understand the task- if it can be done in SQL, I prefer SQL because it is more efficient than pandas. Try working on a large data (10,000,000 x 50). Try to do some groupby operation in both SQL and pandas. You will understand.



            I use pandas where it comes handy- like splitting a column values into an array and doing some stuff on it (like choosing only some values out of that array). Now this kind of task is relatively hard to code in SQL, but pandas will ease your task.






            share|improve this answer











            $endgroup$













            • $begingroup$
              Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes).
              $endgroup$
              – CodesInChaos
              Jul 13 '18 at 11:59










            • $begingroup$
              pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:05












            • $begingroup$
              @CodesInChaos There is this answer of pandas vs SQl - qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.
              $endgroup$
              – Ankit Seth
              Jul 14 '18 at 2:38
















            14












            14








            14





            $begingroup$

            First, pandas is not that much popular. I use both pandas and SQL. First I try to understand the task- if it can be done in SQL, I prefer SQL because it is more efficient than pandas. Try working on a large data (10,000,000 x 50). Try to do some groupby operation in both SQL and pandas. You will understand.



            I use pandas where it comes handy- like splitting a column values into an array and doing some stuff on it (like choosing only some values out of that array). Now this kind of task is relatively hard to code in SQL, but pandas will ease your task.






            share|improve this answer











            $endgroup$



            First, pandas is not that much popular. I use both pandas and SQL. First I try to understand the task- if it can be done in SQL, I prefer SQL because it is more efficient than pandas. Try working on a large data (10,000,000 x 50). Try to do some groupby operation in both SQL and pandas. You will understand.



            I use pandas where it comes handy- like splitting a column values into an array and doing some stuff on it (like choosing only some values out of that array). Now this kind of task is relatively hard to code in SQL, but pandas will ease your task.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            answered Jul 12 '18 at 9:36


























            community wiki





            Ankit Seth













            • $begingroup$
              Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes).
              $endgroup$
              – CodesInChaos
              Jul 13 '18 at 11:59










            • $begingroup$
              pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:05












            • $begingroup$
              @CodesInChaos There is this answer of pandas vs SQl - qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.
              $endgroup$
              – Ankit Seth
              Jul 14 '18 at 2:38




















            • $begingroup$
              Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes).
              $endgroup$
              – CodesInChaos
              Jul 13 '18 at 11:59










            • $begingroup$
              pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are.
              $endgroup$
              – Electric Head
              Jul 13 '18 at 18:05












            • $begingroup$
              @CodesInChaos There is this answer of pandas vs SQl - qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.
              $endgroup$
              – Ankit Seth
              Jul 14 '18 at 2:38


















            $begingroup$
            Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes).
            $endgroup$
            – CodesInChaos
            Jul 13 '18 at 11:59




            $begingroup$
            Is this inefficiency specific to pandas? I've done quite a bit of in-memory data manipulation in C# and found it pretty easy and efficient, provided it fit memory and was one-shot (i.e. no need to incrementally update indexes as the data changes).
            $endgroup$
            – CodesInChaos
            Jul 13 '18 at 11:59












            $begingroup$
            pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are.
            $endgroup$
            – Electric Head
            Jul 13 '18 at 18:05






            $begingroup$
            pandas is meant to be convenient over fast, but that's not to say it can't be fast if you use it right. In the end, executing a SQL query on data in a database isn't magic - it requires resources like anything, it's just that (if you do it right!) you're hopefully making use of resources on carefully configured, beefy database servers. Getting your pipeline right in pandas or similar (e.g. streaming data rather than loading it all into memory) is going to determine how successful some efforts are.
            $endgroup$
            – Electric Head
            Jul 13 '18 at 18:05














            $begingroup$
            @CodesInChaos There is this answer of pandas vs SQl - qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.
            $endgroup$
            – Ankit Seth
            Jul 14 '18 at 2:38






            $begingroup$
            @CodesInChaos There is this answer of pandas vs SQl - qr.ae/TUIpzE. There it is described the advantages and disadvantages of using pandas.
            $endgroup$
            – Ankit Seth
            Jul 14 '18 at 2:38













            9












            $begingroup$

            I'm one of those people who would use (in my case) R's dplyr (the language, not necessarily the tool) in every case if I could even though I know my SQL.



            The major benefit I see in Pandas/dplyr/data.table pipelines is that the operations are atomic and can be read top to bottom.



            In SQL you need to parse the whole script, jumping around (what's being sumamrized, what's being joined and how - left? inner? right?, are there any filters applied?) to fully grasp what is happening.



            In Pandas et al each step of the pipeline is self contained, it does something with the input data and returns output data, this sequential process makes it easier to reason about what's happening since there is a clearly defined state for each operation rather than just on a query level.



            And yes you can do WITH statements and such but it requires much more code and is not as clear what object is being used compared to piping.






            share|improve this answer











            $endgroup$


















              9












              $begingroup$

              I'm one of those people who would use (in my case) R's dplyr (the language, not necessarily the tool) in every case if I could even though I know my SQL.



              The major benefit I see in Pandas/dplyr/data.table pipelines is that the operations are atomic and can be read top to bottom.



              In SQL you need to parse the whole script, jumping around (what's being sumamrized, what's being joined and how - left? inner? right?, are there any filters applied?) to fully grasp what is happening.



              In Pandas et al each step of the pipeline is self contained, it does something with the input data and returns output data, this sequential process makes it easier to reason about what's happening since there is a clearly defined state for each operation rather than just on a query level.



              And yes you can do WITH statements and such but it requires much more code and is not as clear what object is being used compared to piping.






              share|improve this answer











              $endgroup$
















                9












                9








                9





                $begingroup$

                I'm one of those people who would use (in my case) R's dplyr (the language, not necessarily the tool) in every case if I could even though I know my SQL.



                The major benefit I see in Pandas/dplyr/data.table pipelines is that the operations are atomic and can be read top to bottom.



                In SQL you need to parse the whole script, jumping around (what's being sumamrized, what's being joined and how - left? inner? right?, are there any filters applied?) to fully grasp what is happening.



                In Pandas et al each step of the pipeline is self contained, it does something with the input data and returns output data, this sequential process makes it easier to reason about what's happening since there is a clearly defined state for each operation rather than just on a query level.



                And yes you can do WITH statements and such but it requires much more code and is not as clear what object is being used compared to piping.






                share|improve this answer











                $endgroup$



                I'm one of those people who would use (in my case) R's dplyr (the language, not necessarily the tool) in every case if I could even though I know my SQL.



                The major benefit I see in Pandas/dplyr/data.table pipelines is that the operations are atomic and can be read top to bottom.



                In SQL you need to parse the whole script, jumping around (what's being sumamrized, what's being joined and how - left? inner? right?, are there any filters applied?) to fully grasp what is happening.



                In Pandas et al each step of the pipeline is self contained, it does something with the input data and returns output data, this sequential process makes it easier to reason about what's happening since there is a clearly defined state for each operation rather than just on a query level.



                And yes you can do WITH statements and such but it requires much more code and is not as clear what object is being used compared to piping.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                answered Jul 12 '18 at 11:44


























                community wiki





                Robin Gertenbach
























                    3












                    $begingroup$

                    I thought I would add that I do a lot of time-series based data analysis, and the pandas resample and reindex methods are invaluable for doing this. Yes, you can do similar things in SQL (I tend to create a DateDimension table for helping with date-related queries), but I just find the pandas methods much easier to use.



                    Also, as others have said, the rest of my modelling is in Python, and I often have web calls or CSV files.






                    share|improve this answer











                    $endgroup$


















                      3












                      $begingroup$

                      I thought I would add that I do a lot of time-series based data analysis, and the pandas resample and reindex methods are invaluable for doing this. Yes, you can do similar things in SQL (I tend to create a DateDimension table for helping with date-related queries), but I just find the pandas methods much easier to use.



                      Also, as others have said, the rest of my modelling is in Python, and I often have web calls or CSV files.






                      share|improve this answer











                      $endgroup$
















                        3












                        3








                        3





                        $begingroup$

                        I thought I would add that I do a lot of time-series based data analysis, and the pandas resample and reindex methods are invaluable for doing this. Yes, you can do similar things in SQL (I tend to create a DateDimension table for helping with date-related queries), but I just find the pandas methods much easier to use.



                        Also, as others have said, the rest of my modelling is in Python, and I often have web calls or CSV files.






                        share|improve this answer











                        $endgroup$



                        I thought I would add that I do a lot of time-series based data analysis, and the pandas resample and reindex methods are invaluable for doing this. Yes, you can do similar things in SQL (I tend to create a DateDimension table for helping with date-related queries), but I just find the pandas methods much easier to use.



                        Also, as others have said, the rest of my modelling is in Python, and I often have web calls or CSV files.







                        share|improve this answer














                        share|improve this answer



                        share|improve this answer








                        edited Jul 13 '18 at 20:28


























                        community wiki





                        Ken Syme
























                            3












                            $begingroup$

                            Things Pandas can do, that SQL can't do




                            1. df.describe()

                            2. Plotting, e.g. df['population'].plot(kind='hist')

                            3. Use a dataframe directly for training machine learning algorithms


                            Things Pandas can do, I wasn't aware that SQL can do as well




                            1. Export to csv: df.to_csv('foobar.sv'). This is important when you want to show something to a business owner who wants to work with Excel. And there is df.to_excel as well. But in SQL, you can do SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; (thank you, vy32!)






                            share|improve this answer











                            $endgroup$









                            • 1




                              $begingroup$
                              Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.)
                              $endgroup$
                              – vy32
                              Jul 18 '18 at 19:51










                            • $begingroup$
                              Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification)
                              $endgroup$
                              – Martin Thoma
                              Jul 18 '18 at 20:19






                            • 1




                              $begingroup$
                              SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; See dev.mysql.com/doc/refman/8.0/en/select-into.html
                              $endgroup$
                              – vy32
                              Jul 19 '18 at 14:17










                            • $begingroup$
                              Thank you so much, vy! I think I'll adjust my answer when I'm home :-)
                              $endgroup$
                              – Martin Thoma
                              Jul 19 '18 at 16:52










                            • $begingroup$
                              Sure thing. Remember, the file ends up on the SQL server, not the client.
                              $endgroup$
                              – vy32
                              Jul 20 '18 at 18:21
















                            3












                            $begingroup$

                            Things Pandas can do, that SQL can't do




                            1. df.describe()

                            2. Plotting, e.g. df['population'].plot(kind='hist')

                            3. Use a dataframe directly for training machine learning algorithms


                            Things Pandas can do, I wasn't aware that SQL can do as well




                            1. Export to csv: df.to_csv('foobar.sv'). This is important when you want to show something to a business owner who wants to work with Excel. And there is df.to_excel as well. But in SQL, you can do SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; (thank you, vy32!)






                            share|improve this answer











                            $endgroup$









                            • 1




                              $begingroup$
                              Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.)
                              $endgroup$
                              – vy32
                              Jul 18 '18 at 19:51










                            • $begingroup$
                              Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification)
                              $endgroup$
                              – Martin Thoma
                              Jul 18 '18 at 20:19






                            • 1




                              $begingroup$
                              SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; See dev.mysql.com/doc/refman/8.0/en/select-into.html
                              $endgroup$
                              – vy32
                              Jul 19 '18 at 14:17










                            • $begingroup$
                              Thank you so much, vy! I think I'll adjust my answer when I'm home :-)
                              $endgroup$
                              – Martin Thoma
                              Jul 19 '18 at 16:52










                            • $begingroup$
                              Sure thing. Remember, the file ends up on the SQL server, not the client.
                              $endgroup$
                              – vy32
                              Jul 20 '18 at 18:21














                            3












                            3








                            3





                            $begingroup$

                            Things Pandas can do, that SQL can't do




                            1. df.describe()

                            2. Plotting, e.g. df['population'].plot(kind='hist')

                            3. Use a dataframe directly for training machine learning algorithms


                            Things Pandas can do, I wasn't aware that SQL can do as well




                            1. Export to csv: df.to_csv('foobar.sv'). This is important when you want to show something to a business owner who wants to work with Excel. And there is df.to_excel as well. But in SQL, you can do SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; (thank you, vy32!)






                            share|improve this answer











                            $endgroup$



                            Things Pandas can do, that SQL can't do




                            1. df.describe()

                            2. Plotting, e.g. df['population'].plot(kind='hist')

                            3. Use a dataframe directly for training machine learning algorithms


                            Things Pandas can do, I wasn't aware that SQL can do as well




                            1. Export to csv: df.to_csv('foobar.sv'). This is important when you want to show something to a business owner who wants to work with Excel. And there is df.to_excel as well. But in SQL, you can do SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; (thank you, vy32!)







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Jul 20 '18 at 20:19


























                            community wiki





                            Martin Thoma









                            • 1




                              $begingroup$
                              Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.)
                              $endgroup$
                              – vy32
                              Jul 18 '18 at 19:51










                            • $begingroup$
                              Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification)
                              $endgroup$
                              – Martin Thoma
                              Jul 18 '18 at 20:19






                            • 1




                              $begingroup$
                              SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; See dev.mysql.com/doc/refman/8.0/en/select-into.html
                              $endgroup$
                              – vy32
                              Jul 19 '18 at 14:17










                            • $begingroup$
                              Thank you so much, vy! I think I'll adjust my answer when I'm home :-)
                              $endgroup$
                              – Martin Thoma
                              Jul 19 '18 at 16:52










                            • $begingroup$
                              Sure thing. Remember, the file ends up on the SQL server, not the client.
                              $endgroup$
                              – vy32
                              Jul 20 '18 at 18:21














                            • 1




                              $begingroup$
                              Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.)
                              $endgroup$
                              – vy32
                              Jul 18 '18 at 19:51










                            • $begingroup$
                              Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification)
                              $endgroup$
                              – Martin Thoma
                              Jul 18 '18 at 20:19






                            • 1




                              $begingroup$
                              SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; See dev.mysql.com/doc/refman/8.0/en/select-into.html
                              $endgroup$
                              – vy32
                              Jul 19 '18 at 14:17










                            • $begingroup$
                              Thank you so much, vy! I think I'll adjust my answer when I'm home :-)
                              $endgroup$
                              – Martin Thoma
                              Jul 19 '18 at 16:52










                            • $begingroup$
                              Sure thing. Remember, the file ends up on the SQL server, not the client.
                              $endgroup$
                              – vy32
                              Jul 20 '18 at 18:21








                            1




                            1




                            $begingroup$
                            Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.)
                            $endgroup$
                            – vy32
                            Jul 18 '18 at 19:51




                            $begingroup$
                            Nice. Although most of these seem like functions that could be implemented in SQL. (SQL does have directly CSV export.)
                            $endgroup$
                            – vy32
                            Jul 18 '18 at 19:51












                            $begingroup$
                            Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification)
                            $endgroup$
                            – Martin Thoma
                            Jul 18 '18 at 20:19




                            $begingroup$
                            Could you please send me a query which exports to CSV? (I only know tools which do this for some SQL-based databases, but I've never seen a query ... so I doubt that this is part of the SQL specification)
                            $endgroup$
                            – Martin Thoma
                            Jul 18 '18 at 20:19




                            1




                            1




                            $begingroup$
                            SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; See dev.mysql.com/doc/refman/8.0/en/select-into.html
                            $endgroup$
                            – vy32
                            Jul 19 '18 at 14:17




                            $begingroup$
                            SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY 'n' FROM test_table; See dev.mysql.com/doc/refman/8.0/en/select-into.html
                            $endgroup$
                            – vy32
                            Jul 19 '18 at 14:17












                            $begingroup$
                            Thank you so much, vy! I think I'll adjust my answer when I'm home :-)
                            $endgroup$
                            – Martin Thoma
                            Jul 19 '18 at 16:52




                            $begingroup$
                            Thank you so much, vy! I think I'll adjust my answer when I'm home :-)
                            $endgroup$
                            – Martin Thoma
                            Jul 19 '18 at 16:52












                            $begingroup$
                            Sure thing. Remember, the file ends up on the SQL server, not the client.
                            $endgroup$
                            – vy32
                            Jul 20 '18 at 18:21




                            $begingroup$
                            Sure thing. Remember, the file ends up on the SQL server, not the client.
                            $endgroup$
                            – vy32
                            Jul 20 '18 at 18:21











                            2












                            $begingroup$

                            The only thing not covered in these answers that I'd like to mention is that it also depends on how you're using SQL. Take arcpy for example. For some reason none of the arcpy.da functions have an execute many feature. This is really strange because pretty much every other python sql library does. The Where statement in the arcpy.da functions is also limited to around 120 characters. This essentially means that if you have any relatively high number of things you're trying to do with your database your only real choice is to call your chosen arcpy.da function multiple times, changing the where statement each time you do. There are a few tricks you can use to make this process go faster - you can iterate over chunks of your dataset for example - but literally every single of these tricks is much much slower than just using one arcpy.da.searchcursor to load your entire table into a pandas data frame, and then manipulating it using pandas, numpy, and, if your data is really this massive, dask. I need to emphasize here that pandas isn't just a little faster in this case. It's disgustingly faster. It's so much faster that I was literally laughing at myself for not doing it sooner. Using pandas dropped one scripts execution time down from well over an hour - I forget if this was the jump from 3.5 hours or from 1.5 hours - to literally 12 minutes.



                            One thing to note is that while I could have done this with sql it would have taken me a lot longer to learn. I would have either had to learn operations specifically for sql in Access - that's where the data for this script ended up - - sql in Access wasn't as robust as I needed it to be when I was actually looking into doing this -, or I would have had to write all my data to a sqlite3 database, manipulate it there, and then put it in Access. While this might have given me similar performance results, it would have made my script harder to modify in the future.



                            So yeah, sometimes Pandas and is just strictly better than using the sql options you have at your disposal. Everything I would have needed to do in sql was done with a function in pandas. You can also use sql syntax with pandas if you want to. There's little reason not to use pandas and sql in tandem.



                            One more thing I want to mention about Pandas and numpy is that both of these libraries are by nature set based approaches. You can loop through dataframes and series build with these libraries, but it's really hard to modify data in these structures like that so you'll end up writing more efficient code - set based - with both of these libraries purely because it's so much easier to do. Being "guided" if not rail-roaded into using set based approaches is not something I've experienced with SQL.



                            One more massive thing I forgot to mention with Pandas. Money. Pandas is a tool that a lot of Data Science jobs want you to know how to use. Pretty much every Data Science job I've looked at has paid more than database management type jobs. The only exception to this that I've noticed is in Data Engineering, but I've seen far less of those job postings. Pandas looks like it makes you more money at a glance.






                            share|improve this answer











                            $endgroup$









                            • 4




                              $begingroup$
                              Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword.
                              $endgroup$
                              – SaltySub2
                              Jul 13 '18 at 8:49








                            • 1




                              $begingroup$
                              @ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python.
                              $endgroup$
                              – Steve
                              Jul 13 '18 at 18:23






                            • 1




                              $begingroup$
                              Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client.
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:33






                            • 1




                              $begingroup$
                              @Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though).
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:35






                            • 1




                              $begingroup$
                              @Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that.
                              $endgroup$
                              – SaltySub2
                              Jul 15 '18 at 3:32


















                            2












                            $begingroup$

                            The only thing not covered in these answers that I'd like to mention is that it also depends on how you're using SQL. Take arcpy for example. For some reason none of the arcpy.da functions have an execute many feature. This is really strange because pretty much every other python sql library does. The Where statement in the arcpy.da functions is also limited to around 120 characters. This essentially means that if you have any relatively high number of things you're trying to do with your database your only real choice is to call your chosen arcpy.da function multiple times, changing the where statement each time you do. There are a few tricks you can use to make this process go faster - you can iterate over chunks of your dataset for example - but literally every single of these tricks is much much slower than just using one arcpy.da.searchcursor to load your entire table into a pandas data frame, and then manipulating it using pandas, numpy, and, if your data is really this massive, dask. I need to emphasize here that pandas isn't just a little faster in this case. It's disgustingly faster. It's so much faster that I was literally laughing at myself for not doing it sooner. Using pandas dropped one scripts execution time down from well over an hour - I forget if this was the jump from 3.5 hours or from 1.5 hours - to literally 12 minutes.



                            One thing to note is that while I could have done this with sql it would have taken me a lot longer to learn. I would have either had to learn operations specifically for sql in Access - that's where the data for this script ended up - - sql in Access wasn't as robust as I needed it to be when I was actually looking into doing this -, or I would have had to write all my data to a sqlite3 database, manipulate it there, and then put it in Access. While this might have given me similar performance results, it would have made my script harder to modify in the future.



                            So yeah, sometimes Pandas and is just strictly better than using the sql options you have at your disposal. Everything I would have needed to do in sql was done with a function in pandas. You can also use sql syntax with pandas if you want to. There's little reason not to use pandas and sql in tandem.



                            One more thing I want to mention about Pandas and numpy is that both of these libraries are by nature set based approaches. You can loop through dataframes and series build with these libraries, but it's really hard to modify data in these structures like that so you'll end up writing more efficient code - set based - with both of these libraries purely because it's so much easier to do. Being "guided" if not rail-roaded into using set based approaches is not something I've experienced with SQL.



                            One more massive thing I forgot to mention with Pandas. Money. Pandas is a tool that a lot of Data Science jobs want you to know how to use. Pretty much every Data Science job I've looked at has paid more than database management type jobs. The only exception to this that I've noticed is in Data Engineering, but I've seen far less of those job postings. Pandas looks like it makes you more money at a glance.






                            share|improve this answer











                            $endgroup$









                            • 4




                              $begingroup$
                              Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword.
                              $endgroup$
                              – SaltySub2
                              Jul 13 '18 at 8:49








                            • 1




                              $begingroup$
                              @ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python.
                              $endgroup$
                              – Steve
                              Jul 13 '18 at 18:23






                            • 1




                              $begingroup$
                              Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client.
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:33






                            • 1




                              $begingroup$
                              @Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though).
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:35






                            • 1




                              $begingroup$
                              @Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that.
                              $endgroup$
                              – SaltySub2
                              Jul 15 '18 at 3:32
















                            2












                            2








                            2





                            $begingroup$

                            The only thing not covered in these answers that I'd like to mention is that it also depends on how you're using SQL. Take arcpy for example. For some reason none of the arcpy.da functions have an execute many feature. This is really strange because pretty much every other python sql library does. The Where statement in the arcpy.da functions is also limited to around 120 characters. This essentially means that if you have any relatively high number of things you're trying to do with your database your only real choice is to call your chosen arcpy.da function multiple times, changing the where statement each time you do. There are a few tricks you can use to make this process go faster - you can iterate over chunks of your dataset for example - but literally every single of these tricks is much much slower than just using one arcpy.da.searchcursor to load your entire table into a pandas data frame, and then manipulating it using pandas, numpy, and, if your data is really this massive, dask. I need to emphasize here that pandas isn't just a little faster in this case. It's disgustingly faster. It's so much faster that I was literally laughing at myself for not doing it sooner. Using pandas dropped one scripts execution time down from well over an hour - I forget if this was the jump from 3.5 hours or from 1.5 hours - to literally 12 minutes.



                            One thing to note is that while I could have done this with sql it would have taken me a lot longer to learn. I would have either had to learn operations specifically for sql in Access - that's where the data for this script ended up - - sql in Access wasn't as robust as I needed it to be when I was actually looking into doing this -, or I would have had to write all my data to a sqlite3 database, manipulate it there, and then put it in Access. While this might have given me similar performance results, it would have made my script harder to modify in the future.



                            So yeah, sometimes Pandas and is just strictly better than using the sql options you have at your disposal. Everything I would have needed to do in sql was done with a function in pandas. You can also use sql syntax with pandas if you want to. There's little reason not to use pandas and sql in tandem.



                            One more thing I want to mention about Pandas and numpy is that both of these libraries are by nature set based approaches. You can loop through dataframes and series build with these libraries, but it's really hard to modify data in these structures like that so you'll end up writing more efficient code - set based - with both of these libraries purely because it's so much easier to do. Being "guided" if not rail-roaded into using set based approaches is not something I've experienced with SQL.



                            One more massive thing I forgot to mention with Pandas. Money. Pandas is a tool that a lot of Data Science jobs want you to know how to use. Pretty much every Data Science job I've looked at has paid more than database management type jobs. The only exception to this that I've noticed is in Data Engineering, but I've seen far less of those job postings. Pandas looks like it makes you more money at a glance.






                            share|improve this answer











                            $endgroup$



                            The only thing not covered in these answers that I'd like to mention is that it also depends on how you're using SQL. Take arcpy for example. For some reason none of the arcpy.da functions have an execute many feature. This is really strange because pretty much every other python sql library does. The Where statement in the arcpy.da functions is also limited to around 120 characters. This essentially means that if you have any relatively high number of things you're trying to do with your database your only real choice is to call your chosen arcpy.da function multiple times, changing the where statement each time you do. There are a few tricks you can use to make this process go faster - you can iterate over chunks of your dataset for example - but literally every single of these tricks is much much slower than just using one arcpy.da.searchcursor to load your entire table into a pandas data frame, and then manipulating it using pandas, numpy, and, if your data is really this massive, dask. I need to emphasize here that pandas isn't just a little faster in this case. It's disgustingly faster. It's so much faster that I was literally laughing at myself for not doing it sooner. Using pandas dropped one scripts execution time down from well over an hour - I forget if this was the jump from 3.5 hours or from 1.5 hours - to literally 12 minutes.



                            One thing to note is that while I could have done this with sql it would have taken me a lot longer to learn. I would have either had to learn operations specifically for sql in Access - that's where the data for this script ended up - - sql in Access wasn't as robust as I needed it to be when I was actually looking into doing this -, or I would have had to write all my data to a sqlite3 database, manipulate it there, and then put it in Access. While this might have given me similar performance results, it would have made my script harder to modify in the future.



                            So yeah, sometimes Pandas and is just strictly better than using the sql options you have at your disposal. Everything I would have needed to do in sql was done with a function in pandas. You can also use sql syntax with pandas if you want to. There's little reason not to use pandas and sql in tandem.



                            One more thing I want to mention about Pandas and numpy is that both of these libraries are by nature set based approaches. You can loop through dataframes and series build with these libraries, but it's really hard to modify data in these structures like that so you'll end up writing more efficient code - set based - with both of these libraries purely because it's so much easier to do. Being "guided" if not rail-roaded into using set based approaches is not something I've experienced with SQL.



                            One more massive thing I forgot to mention with Pandas. Money. Pandas is a tool that a lot of Data Science jobs want you to know how to use. Pretty much every Data Science job I've looked at has paid more than database management type jobs. The only exception to this that I've noticed is in Data Engineering, but I've seen far less of those job postings. Pandas looks like it makes you more money at a glance.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            answered Jul 12 '18 at 15:06


























                            community wiki





                            Steve









                            • 4




                              $begingroup$
                              Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword.
                              $endgroup$
                              – SaltySub2
                              Jul 13 '18 at 8:49








                            • 1




                              $begingroup$
                              @ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python.
                              $endgroup$
                              – Steve
                              Jul 13 '18 at 18:23






                            • 1




                              $begingroup$
                              Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client.
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:33






                            • 1




                              $begingroup$
                              @Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though).
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:35






                            • 1




                              $begingroup$
                              @Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that.
                              $endgroup$
                              – SaltySub2
                              Jul 15 '18 at 3:32
















                            • 4




                              $begingroup$
                              Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword.
                              $endgroup$
                              – SaltySub2
                              Jul 13 '18 at 8:49








                            • 1




                              $begingroup$
                              @ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python.
                              $endgroup$
                              – Steve
                              Jul 13 '18 at 18:23






                            • 1




                              $begingroup$
                              Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client.
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:33






                            • 1




                              $begingroup$
                              @Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though).
                              $endgroup$
                              – Electric Head
                              Jul 13 '18 at 18:35






                            • 1




                              $begingroup$
                              @Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that.
                              $endgroup$
                              – SaltySub2
                              Jul 15 '18 at 3:32










                            4




                            4




                            $begingroup$
                            Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword.
                            $endgroup$
                            – SaltySub2
                            Jul 13 '18 at 8:49






                            $begingroup$
                            Perhaps sad that when it comes to modern jobs it's about having the right buzzwords in your resume as opposed to the approaches you take to solving a problem (assuming you can learn said buzzword relatively fast). It's like the buzzword is more important than the problem solving. When the problem solving for X should involve learning and using technology A, B, C, not the reverse. I wonder if most development teams now smash out things because of buzzword-ism and trendiness, then think about problem solving as a secondary, or "old-school" thing because you didn't know/use said buzzword.
                            $endgroup$
                            – SaltySub2
                            Jul 13 '18 at 8:49






                            1




                            1




                            $begingroup$
                            @ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python.
                            $endgroup$
                            – Steve
                            Jul 13 '18 at 18:23




                            $begingroup$
                            @ElectricHead in my experience if you're writing your own function involving sql in python, it's easier to just misuse your cursor and write bad queries than it is using pandas/numpy. Gotta remember that not all sql modules/libraries are made the same. In my case, with arcpy.da.SearchCursors and the like, there's really not a good way to do something to a bunch of records efficiently because of weird limitations. If I use pandas/numpy there becomes one good way to do things, and that's what I want when using python.
                            $endgroup$
                            – Steve
                            Jul 13 '18 at 18:23




                            1




                            1




                            $begingroup$
                            Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client.
                            $endgroup$
                            – Electric Head
                            Jul 13 '18 at 18:33




                            $begingroup$
                            Ahhh, OK. You mean a homespun SQL pipeline via a python dbapi implementation vs using numpy/pandas? In which case, yeah gotcha, no argument from me there; care required! It read to me as vs plain SQL which you obviously need to understand set operations with, but will find that out quite quickly when running silly queries from a database client.
                            $endgroup$
                            – Electric Head
                            Jul 13 '18 at 18:33




                            1




                            1




                            $begingroup$
                            @Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though).
                            $endgroup$
                            – Electric Head
                            Jul 13 '18 at 18:35




                            $begingroup$
                            @Steve Yes, won't stop people trying to dynamically modify stuff in loops in pandas or similar though :) I think understanding SQL helps working in pandas effectively (it's not like they hide the similarity in some concepts though).
                            $endgroup$
                            – Electric Head
                            Jul 13 '18 at 18:35




                            1




                            1




                            $begingroup$
                            @Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that.
                            $endgroup$
                            – SaltySub2
                            Jul 15 '18 at 3:32






                            $begingroup$
                            @Steve Indeed pandas is powerful too... I guess one of my frustrations is developers and management both, including myself, not spending adequate time evaluating solutions and chasing trends (where money is involved to promote self/company). But even in lean prototyping/ mvp one would have to lay appropriate groundwork for scaling. SQL, noSQL and Pandas... all have their purposes for the appropriate tasks and projects at different stages. For the past year plus, noSQL for a lean prototype/ mvp certainly helped me in more ways than one. SQL would have been overkill for that.
                            $endgroup$
                            – SaltySub2
                            Jul 15 '18 at 3:32













                            2












                            $begingroup$

                            I'm fairly new to Pandas/Python but have 20+ years as a SQLServer DBA, architect, administrator, etc.. I love Pandas and I'm pushing myself to always try to make things work in Pandas before returning to my comfy, cozy SQL world.



                            Why RDBMS's are Better:
                            The advantage of RDBMS's are their years of experience optimizing query speed and data read operations. What's impressive is that they can do this while simultaneously balancing the need to optimize write speed and manage highly concurrent access. Sometimes these additional overheads tilt the advantage to Pandas when it comes to simple, single-user use cases. But even then, a seasoned DBA can tune a database to be highly optimized for read speed over write speed. DBA's can take advantage of things like optimizing data storage, strategic disk page sizing, page filling/padding, data controller and disk partitioning strategies, optimized I/O plans, in-memory data pinning, pre-defined execution plans, indexing, data compression, and many more. I get the impression from many Pandas developers that they don't understand the depth that's available there. What I think usually happens is that if Pandas developer never has data that's big enough to need these optimizations, they don't appreciate how much time they can save you out of the box. The RDBMS world has 30 years of experience optimizing this so if raw speed on large datasets are needed, RDBMS's can be beat.



                            Why Is Python/Pandas Better:
                            That said, speed isn't everything and in many use cases isn't the driving factor. It depends on how you're using the data, whether it's shared, and whether you care about the speed of the processing.
                            RDBMS's are generally more rigid in their data structures and put a burden on the developer to be more deterministic with data shapes. Pandas lets you be more loose here. Also, and this is my favorite reason, you're in a true programming language. Programming languages give you infinitely more flexibility to apply advanced logic to the data. Of course there's also the rich ecosystem of modules and 3rd party frameworks that SQL can't come close to. Being able to go from raw data all the way to web presentation or data visualization in one code base is VERY convenient. It's also much more portable. You can run Python almost anywhere including public notebooks that can extend the reach of your results to get to people more quickly. Databases don't excel at this.



                            My Advice?
                            If you find yourself graduating to bigger and bigger datasets you owe it to take the plunge and learn how RDBMS's can help. I've seen million row, multi-table join, summed aggregate queries tuned from 5 minutes down to 2 seconds. Having this understanding in your tool belt just makes you a more well rounded data scientist. You may be able to do everything in Pandas today but some day your may have an assignment where RDBMS is the best choice.






                            share|improve this answer











                            $endgroup$


















                              2












                              $begingroup$

                              I'm fairly new to Pandas/Python but have 20+ years as a SQLServer DBA, architect, administrator, etc.. I love Pandas and I'm pushing myself to always try to make things work in Pandas before returning to my comfy, cozy SQL world.



                              Why RDBMS's are Better:
                              The advantage of RDBMS's are their years of experience optimizing query speed and data read operations. What's impressive is that they can do this while simultaneously balancing the need to optimize write speed and manage highly concurrent access. Sometimes these additional overheads tilt the advantage to Pandas when it comes to simple, single-user use cases. But even then, a seasoned DBA can tune a database to be highly optimized for read speed over write speed. DBA's can take advantage of things like optimizing data storage, strategic disk page sizing, page filling/padding, data controller and disk partitioning strategies, optimized I/O plans, in-memory data pinning, pre-defined execution plans, indexing, data compression, and many more. I get the impression from many Pandas developers that they don't understand the depth that's available there. What I think usually happens is that if Pandas developer never has data that's big enough to need these optimizations, they don't appreciate how much time they can save you out of the box. The RDBMS world has 30 years of experience optimizing this so if raw speed on large datasets are needed, RDBMS's can be beat.



                              Why Is Python/Pandas Better:
                              That said, speed isn't everything and in many use cases isn't the driving factor. It depends on how you're using the data, whether it's shared, and whether you care about the speed of the processing.
                              RDBMS's are generally more rigid in their data structures and put a burden on the developer to be more deterministic with data shapes. Pandas lets you be more loose here. Also, and this is my favorite reason, you're in a true programming language. Programming languages give you infinitely more flexibility to apply advanced logic to the data. Of course there's also the rich ecosystem of modules and 3rd party frameworks that SQL can't come close to. Being able to go from raw data all the way to web presentation or data visualization in one code base is VERY convenient. It's also much more portable. You can run Python almost anywhere including public notebooks that can extend the reach of your results to get to people more quickly. Databases don't excel at this.



                              My Advice?
                              If you find yourself graduating to bigger and bigger datasets you owe it to take the plunge and learn how RDBMS's can help. I've seen million row, multi-table join, summed aggregate queries tuned from 5 minutes down to 2 seconds. Having this understanding in your tool belt just makes you a more well rounded data scientist. You may be able to do everything in Pandas today but some day your may have an assignment where RDBMS is the best choice.






                              share|improve this answer











                              $endgroup$
















                                2












                                2








                                2





                                $begingroup$

                                I'm fairly new to Pandas/Python but have 20+ years as a SQLServer DBA, architect, administrator, etc.. I love Pandas and I'm pushing myself to always try to make things work in Pandas before returning to my comfy, cozy SQL world.



                                Why RDBMS's are Better:
                                The advantage of RDBMS's are their years of experience optimizing query speed and data read operations. What's impressive is that they can do this while simultaneously balancing the need to optimize write speed and manage highly concurrent access. Sometimes these additional overheads tilt the advantage to Pandas when it comes to simple, single-user use cases. But even then, a seasoned DBA can tune a database to be highly optimized for read speed over write speed. DBA's can take advantage of things like optimizing data storage, strategic disk page sizing, page filling/padding, data controller and disk partitioning strategies, optimized I/O plans, in-memory data pinning, pre-defined execution plans, indexing, data compression, and many more. I get the impression from many Pandas developers that they don't understand the depth that's available there. What I think usually happens is that if Pandas developer never has data that's big enough to need these optimizations, they don't appreciate how much time they can save you out of the box. The RDBMS world has 30 years of experience optimizing this so if raw speed on large datasets are needed, RDBMS's can be beat.



                                Why Is Python/Pandas Better:
                                That said, speed isn't everything and in many use cases isn't the driving factor. It depends on how you're using the data, whether it's shared, and whether you care about the speed of the processing.
                                RDBMS's are generally more rigid in their data structures and put a burden on the developer to be more deterministic with data shapes. Pandas lets you be more loose here. Also, and this is my favorite reason, you're in a true programming language. Programming languages give you infinitely more flexibility to apply advanced logic to the data. Of course there's also the rich ecosystem of modules and 3rd party frameworks that SQL can't come close to. Being able to go from raw data all the way to web presentation or data visualization in one code base is VERY convenient. It's also much more portable. You can run Python almost anywhere including public notebooks that can extend the reach of your results to get to people more quickly. Databases don't excel at this.



                                My Advice?
                                If you find yourself graduating to bigger and bigger datasets you owe it to take the plunge and learn how RDBMS's can help. I've seen million row, multi-table join, summed aggregate queries tuned from 5 minutes down to 2 seconds. Having this understanding in your tool belt just makes you a more well rounded data scientist. You may be able to do everything in Pandas today but some day your may have an assignment where RDBMS is the best choice.






                                share|improve this answer











                                $endgroup$



                                I'm fairly new to Pandas/Python but have 20+ years as a SQLServer DBA, architect, administrator, etc.. I love Pandas and I'm pushing myself to always try to make things work in Pandas before returning to my comfy, cozy SQL world.



                                Why RDBMS's are Better:
                                The advantage of RDBMS's are their years of experience optimizing query speed and data read operations. What's impressive is that they can do this while simultaneously balancing the need to optimize write speed and manage highly concurrent access. Sometimes these additional overheads tilt the advantage to Pandas when it comes to simple, single-user use cases. But even then, a seasoned DBA can tune a database to be highly optimized for read speed over write speed. DBA's can take advantage of things like optimizing data storage, strategic disk page sizing, page filling/padding, data controller and disk partitioning strategies, optimized I/O plans, in-memory data pinning, pre-defined execution plans, indexing, data compression, and many more. I get the impression from many Pandas developers that they don't understand the depth that's available there. What I think usually happens is that if Pandas developer never has data that's big enough to need these optimizations, they don't appreciate how much time they can save you out of the box. The RDBMS world has 30 years of experience optimizing this so if raw speed on large datasets are needed, RDBMS's can be beat.



                                Why Is Python/Pandas Better:
                                That said, speed isn't everything and in many use cases isn't the driving factor. It depends on how you're using the data, whether it's shared, and whether you care about the speed of the processing.
                                RDBMS's are generally more rigid in their data structures and put a burden on the developer to be more deterministic with data shapes. Pandas lets you be more loose here. Also, and this is my favorite reason, you're in a true programming language. Programming languages give you infinitely more flexibility to apply advanced logic to the data. Of course there's also the rich ecosystem of modules and 3rd party frameworks that SQL can't come close to. Being able to go from raw data all the way to web presentation or data visualization in one code base is VERY convenient. It's also much more portable. You can run Python almost anywhere including public notebooks that can extend the reach of your results to get to people more quickly. Databases don't excel at this.



                                My Advice?
                                If you find yourself graduating to bigger and bigger datasets you owe it to take the plunge and learn how RDBMS's can help. I've seen million row, multi-table join, summed aggregate queries tuned from 5 minutes down to 2 seconds. Having this understanding in your tool belt just makes you a more well rounded data scientist. You may be able to do everything in Pandas today but some day your may have an assignment where RDBMS is the best choice.







                                share|improve this answer














                                share|improve this answer



                                share|improve this answer








                                answered Feb 16 at 19:36


























                                community wiki





                                sisdog
























                                    1












                                    $begingroup$

                                    I'll attempt to answer this question based on my own experience. In contrast to the other answers, I prefer Sql for deep learning and big-data-related things. There are numerous reasons for that. As it can be seen here,




                                    Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale.




                                    Sql engines usually keep the keys or special columns in data-structures like $B ^+$ tree in order to facilitate CRUD operations. This data structure keeps the status of all data in the database. This is not pandas can do because it can't access all the data simultaneously. On the other hand, it can't do some operations even with its chunk parameter used in read_csv. As an example, you can't have direct batch operations for large datasets that your memory can't accommodate them. Any other tasks which depend on your entire dataset need extra coding. All of these can be handled in Sql without extra coding, just with a simple query. Simple Sql operations are just used without any fear about the memory.



                                    Another difference is that CRUD operations in Sql can be applied distributed with different authorisation policies which are not possible in pandas.



                                    It is not meant to say which is better, it all depends on your task. For large scale computation I prefer Sql and for small ones, I prefer pandas.



                                    There are other things that are not in pandas which are really important for fast experience for data extraction that I'll refer to later. For now, just take a look at here.






                                    share|improve this answer











                                    $endgroup$


















                                      1












                                      $begingroup$

                                      I'll attempt to answer this question based on my own experience. In contrast to the other answers, I prefer Sql for deep learning and big-data-related things. There are numerous reasons for that. As it can be seen here,




                                      Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale.




                                      Sql engines usually keep the keys or special columns in data-structures like $B ^+$ tree in order to facilitate CRUD operations. This data structure keeps the status of all data in the database. This is not pandas can do because it can't access all the data simultaneously. On the other hand, it can't do some operations even with its chunk parameter used in read_csv. As an example, you can't have direct batch operations for large datasets that your memory can't accommodate them. Any other tasks which depend on your entire dataset need extra coding. All of these can be handled in Sql without extra coding, just with a simple query. Simple Sql operations are just used without any fear about the memory.



                                      Another difference is that CRUD operations in Sql can be applied distributed with different authorisation policies which are not possible in pandas.



                                      It is not meant to say which is better, it all depends on your task. For large scale computation I prefer Sql and for small ones, I prefer pandas.



                                      There are other things that are not in pandas which are really important for fast experience for data extraction that I'll refer to later. For now, just take a look at here.






                                      share|improve this answer











                                      $endgroup$
















                                        1












                                        1








                                        1





                                        $begingroup$

                                        I'll attempt to answer this question based on my own experience. In contrast to the other answers, I prefer Sql for deep learning and big-data-related things. There are numerous reasons for that. As it can be seen here,




                                        Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale.




                                        Sql engines usually keep the keys or special columns in data-structures like $B ^+$ tree in order to facilitate CRUD operations. This data structure keeps the status of all data in the database. This is not pandas can do because it can't access all the data simultaneously. On the other hand, it can't do some operations even with its chunk parameter used in read_csv. As an example, you can't have direct batch operations for large datasets that your memory can't accommodate them. Any other tasks which depend on your entire dataset need extra coding. All of these can be handled in Sql without extra coding, just with a simple query. Simple Sql operations are just used without any fear about the memory.



                                        Another difference is that CRUD operations in Sql can be applied distributed with different authorisation policies which are not possible in pandas.



                                        It is not meant to say which is better, it all depends on your task. For large scale computation I prefer Sql and for small ones, I prefer pandas.



                                        There are other things that are not in pandas which are really important for fast experience for data extraction that I'll refer to later. For now, just take a look at here.






                                        share|improve this answer











                                        $endgroup$



                                        I'll attempt to answer this question based on my own experience. In contrast to the other answers, I prefer Sql for deep learning and big-data-related things. There are numerous reasons for that. As it can be seen here,




                                        Pandas provides an intuitive, powerful, and fast data analysis experience on tabular data. However, because Pandas uses only one thread of execution and requires all data to be in memory at once, it doesn’t scale well to datasets much beyond the gigabyte scale.




                                        Sql engines usually keep the keys or special columns in data-structures like $B ^+$ tree in order to facilitate CRUD operations. This data structure keeps the status of all data in the database. This is not pandas can do because it can't access all the data simultaneously. On the other hand, it can't do some operations even with its chunk parameter used in read_csv. As an example, you can't have direct batch operations for large datasets that your memory can't accommodate them. Any other tasks which depend on your entire dataset need extra coding. All of these can be handled in Sql without extra coding, just with a simple query. Simple Sql operations are just used without any fear about the memory.



                                        Another difference is that CRUD operations in Sql can be applied distributed with different authorisation policies which are not possible in pandas.



                                        It is not meant to say which is better, it all depends on your task. For large scale computation I prefer Sql and for small ones, I prefer pandas.



                                        There are other things that are not in pandas which are really important for fast experience for data extraction that I'll refer to later. For now, just take a look at here.







                                        share|improve this answer














                                        share|improve this answer



                                        share|improve this answer








                                        answered Nov 7 '18 at 19:44


























                                        community wiki





                                        Media
























                                            0












                                            $begingroup$

                                            Panda is more popular since python in the form of jupyter notebooks is the most populair toolbox as used by data scientist in neural network area. Python is becomming "the" langauge. Its even possible to use SQL backend but your not bound to SQL only with panda.






                                            share|improve this answer











                                            $endgroup$


















                                              0












                                              $begingroup$

                                              Panda is more popular since python in the form of jupyter notebooks is the most populair toolbox as used by data scientist in neural network area. Python is becomming "the" langauge. Its even possible to use SQL backend but your not bound to SQL only with panda.






                                              share|improve this answer











                                              $endgroup$
















                                                0












                                                0








                                                0





                                                $begingroup$

                                                Panda is more popular since python in the form of jupyter notebooks is the most populair toolbox as used by data scientist in neural network area. Python is becomming "the" langauge. Its even possible to use SQL backend but your not bound to SQL only with panda.






                                                share|improve this answer











                                                $endgroup$



                                                Panda is more popular since python in the form of jupyter notebooks is the most populair toolbox as used by data scientist in neural network area. Python is becomming "the" langauge. Its even possible to use SQL backend but your not bound to SQL only with panda.







                                                share|improve this answer














                                                share|improve this answer



                                                share|improve this answer








                                                answered 19 mins ago


























                                                community wiki





                                                user3800527































                                                    draft saved

                                                    draft discarded




















































                                                    Thanks for contributing an answer to Data Science Stack Exchange!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid



                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.


                                                    Use MathJax to format equations. MathJax reference.


                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function () {
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f34357%2fwhy-do-people-prefer-pandas-to-sql%23new-answer', 'question_page');
                                                    }
                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    Popular posts from this blog

                                                    is 'sed' thread safeWhat should someone know about using Python scripts in the shell?Nexenta bash script uses...

                                                    How do i solve the “ No module named 'mlxtend' ” issue on Jupyter?

                                                    Pilgersdorf Inhaltsverzeichnis Geografie | Geschichte | Bevölkerungsentwicklung | Politik | Kultur...