Special Edition : Analytics As Applied Accounting
Welcome to my Data Analytics Journal, where I write about data science and analytics.
I have prepared a surprise for you in this special kick-off 2024 newsletter: an interview with the famous, daring, fun-at-parties, one-and-only Lauren Balik! ⭐
It’s either love or hate with her - there’s no in-between - but you can’t ignore her bullshit-free, academically inspiring, and deciphering exciting writing. Since meeting Lauren, I’ve been fascinated by her attention to detail, thorough and in-depth research, professional storytelling, and courage.
Lauren Balik is a data leader, data wrangler, blogger, advisor, investor, and owner of Upright Analytics. She is a co-host of the Tech Bros Show with Mary McCarthy.
I am honored Lauren agreed to be interviewed for my humble newsletter, and I am excited to share her always-insightful, fresh, spicy, and detailed perspective on analytics.
Below, you will learn about:
Navigating tensions between Finance and Data teams.
The biggest challenges for data and analytics leadership today.
Centralized vs. decentralized data team setups.
The reckoning and fate of BI tooling.
Will next-generation spreadsheets finally take over Excel?
Why and how Tableau continues to dominate the market despite its challenges and limitations.
Should you consider switching to semantic layers?
And much more! 🔥
Today, you can become a data analyst or engineer in 3-4 months via bootcamps. Every few months, hundreds of people enter the data industry lacking a proper foundation or an appropriate degree yet can secure salaries of $130k or higher.
How does this affect the data industry?
Great question. I think this is largely on the way out.
What is often pitched by bootcamps or cloud vendors is this romanticized version of what an analyst or data engineer does. There is this mythology that a company will have its back to the wall, about to go under, then all of a sudden, this hero analyst crunches a variety of datasets and finds the one true insight to make the company millions of dollars and save the day.
In reality, most data jobs have merely become “human middleware” jobs. These are generally “glue” jobs of gluing systems together and stringing together APIs and data transfer between systems. Often times, when the glue cannot be achieved between System A and System B, the solution is to dump System A and System B data into a data lake or warehouse and then glue it together afterward.
The number of human glue jobs in the market is determined mostly by interest rates and consumer discretionary spend. When interest rates go up and consumer discretionary spend goes down, companies have less money, and it becomes expensive to have the 15th analytics engineer or 23rd data scientist on staff.
How does this affect the industry? Well, in sales, you have the concept of the ICP, or Ideal Customer Profile, the real or imagined persona of the person who will buy or otherwise transact with your offering. The proliferation of new career people and bootcamp grads has become an ICP. There are entire teams of people with titles like Developer Advocate or Data Advocate at product vendors who try to control a narrative to sell into this ICP. However, this ICP typically does not have budget authority to make big purchases, so the products are all based on install numbers, GitHub stars, and other loose metrics the vendors use to raise more money from venture capital firms.
This is why so much of the narrative is based around new tools and toys – it’s easier for product vendors to sell junk to many 25-year-olds at startups vs. selling more complete solutions to 45-year-old VPs of Engineering or Directors of Analytics at more established companies.
What do you think are the biggest challenges for data analytics leaders today?
Right now, the biggest challenge for analytics leaders is knowing where to draw boundaries. For example, in the last few years, one trend I’ve noticed and been pretty vocal about is the use of analytics teams in doing accounting. What happens is that a privately owned, venture-backed company starts growing, and the amount of SKUs or products they sell increases, and their sales deals become more complex. Now, larger, established businesses will have an ERP system, a quote-to-cash system, and purpose-built software for handling the accounting. However, those are all expensive things to buy and manage in terms of both the products and the headcount needed to manage them.
What happens is that the Finance team will often start encroaching on a capital D “Data” team or analytics team since they are already on staff, and begin taking more time and effort from the Data team to custom roll an accounting function, with logic often held together in analytics tools and on cloud data warehouses like Snowflake owning the compute for crunching the accounting numbers.
I have very rarely seen this go well. There are many issues that arise.
First, there is almost nobody in the analytics world who is skilled a Certified Public Account, or CPA.
Second, accounting is very fluid, which is something most people outside of finance do not fully appreciate. Projections for the upcoming year, changes in accounting laws, and changes in what is considered COGS are all levers that accounting teams pull in order to present the best possible picture to investors, the market, and the government to which a company pays taxes. It’s very fluid.
When you have this fluidity going on naturally with accounting, then you add an analytics team into the mix to come up with the numbers and logic to arrive at these numbers, you have many moving parts.
I am very of the opinion that accounting should be able to close their books based on the systems they own and operate without the need for intervention from analytics teams.
Plus, I know many junior people who think they are signing up for an analytics job or data science job but inevitably get sucked into being an accountant-without-the-CPA.
It’s up to Data & Analytics leaders to push back on things like this internally.
What is your take on centralized vs decentralized data team setup? Do you have a preference? Pros/cons for each?
Decentralized, which I take to mean as analysts/data scientists/ops people sit in business units like FP&A or Marketing or Sales, is in virtually all cases the best way to get things done.
This puts the actual analysis close to the business outcomes and puts the analysis of data closest to profit centers, not cost centers, like centralized IT teams.
One of the biggest industry changes that we will see in 2024, at least in the United States, though there is a global impact, is what is called Section 174 of the IRS Code.
https://www.thomsonreuters.com/en-us/posts/tax-and-accounting/5-things-sect-174-capitalization/
I don’t think very many startups are doing this right, and I am pretty doubtful most larger enterprises are thinking about this, either.
Essentially, the amended IRC Sec. 174 eliminates the ability for businesses to deduct their R&D costs as an expense. Instead, they must capitalize these expenses and amortize them over a period of 5 years for US companies or 15 years for foreign corporations. In short, technical staff and engineers are becoming a lot more expensive.
Many data engineers and even analytics and data science (research) jobs will be impacted, in addition to product management and product engineering.
Whether you run a business unit, a P&L, a whole team, or an individual contributor, you should be asking internally where your salary rolls up. Are you R&D? G&A? or S&M? There is going to be a lot of shuffling around and likely layoffs, given that US businesses, and especially smaller businesses and startups, can no longer take R&D expenses in the same tax year. In general, this benefits big fish like Google, Microsoft, Amazon, and others at the expense of startups.
I think this will all lead to more decentralization in many small and mid-size businesses. Large headcount teams will just become too expensive to be considered R&D.
The current market for BI tools is limited. Although there are many cloud analytics dashboard tools, not many can overtake Tableau in its reach and functionality. Why do you think this is the case?
BI tools have a relatively low barrier to entry in the current market. Because of this, pricing is a race to the bottom, especially for cloud-first and cloud-native solutions. If every vendor is cutting on price, it is very difficult for these companies to generate meaningful revenue in the mid-tenths or hundreds of millions of dollars in annual recurring revenue (ARR) to generate the returns venture capital needs.
Because of this, many of these BI tools become involved in two separate but distinct games to make them more attractive to potential buyers and thus raise their valuations.
First, many BI tools engage in a game with cloud service providers (AWS, Azure, Google Cloud) and resellers of CSPs like Snowflake and Databricks, in which the BI tools drive incremental consumption revenue back to the CSPs and CSP resellers. Many buyers should be aware of this game before and during a POC or purchasing process.
Some BI tools will make overly complex queries or suggest patterns that increase Snowflake or similar bills an extra 10-20% per year just by using them vs. earlier stage tools like Tableau. Snowflake loves it. The BI tools love it. The customer may even love the experience at first until the costs start kicking in.
Ultimately, it’s up to customers to realize that these games occur. Removing things like overly complicated and overly scheduled materializations can go a long way to saving costs and latency. Removing 5 layers of SQL or dbt SQL before data even hits a BI tool is important as well.
Tableau is so widespread because they won the battle of the User Group and of community, and it’s not even close.
In the mid-to-late 2010s, Tableau had a very low barrier to entry with their Tableau Public offering. There were many programs like Viz of the Day and similar that allowed people to submit data projects and visualizations and grow their careers and presence in the market.
No other company has allowed individual developers to become overnight data superheroes. Thoughtspot doesn’t have this. Mode doesn’t have this. Chartio didn’t have this. Even dbt Labs, which uses the idea of “community” heavily, doesn’t have this.
Alteryx and PowerBI also have similar programs that allow developers to be showcased and have their projects shared with the masses. Alteryx went public and has recently announced a large buyout to take it private again. It’s very well distributed. PowerBI is very well distributed.
The common thread here in GTM is that the most successful visual or front-end/customer-facing products with the largest enterprise values all enabled individual developers to become superheroes. These developers then show off their work on social media, they get job opportunities, and they become known in data and analytics pockets as experts.
Could Looker do it?
In 2018/19, I remember being fascinated by Looker, its version control, and embeddings. It felt like the tool of tomorrow. LookML was easy to learn and intuitive. Eventually, it disappointed me with the lack of custom visualizations and charting functions.
And yet, as you know, I’d still choose it today over static Tableau workbooks.
Looker did not necessarily follow the same playbook in their lifecycle. In the early days, a lot of Looker’s growth was fueled by their great professional services and customer success teams. Businesses were basically hiring “SQL experts,” and Looker was the product wrapped around it. Looker had a great semantic layer and was a sticky product because once an organization commits to a semantic layer, they are going to continue with that vendor. All the intermediary SQL is secondary.
Is using semantic layers the path to take?
I’m not sure what to make of the latest wave of semantic layers. I am not sure there is much benefit a semantic layer offers vs. a traditional OLAP cube.
The semantic layer sitting on cloud SQL, on consumption-based data warehouses, is mostly just a way to drive more incremental consumption.
What I think is much more interesting is the work being done outside of the SQL-only world. For example, I recently met with the founder of Brim Data, Steve McCanne, who showed me his view of the world in which data types are a first-class citizen. Right now, all the Avro, Parquet, throw-it-into-warehouse, SQL compute-it-out way of thinking about the world is highly inefficient. What if data types were a first-class citizen? This would remove a lot of the reformatting needed in managing data. Right now, data is being transformed down and rolled back up, passing through 4-5 people at a minimum, and it’s all just a factory line.
I also think another real battle is in change data capture. Right now, the “Modern Data Stack” is merely just a tax on latency. If you want better latency, the easiest way is to crank up the computing resources involved, which very quickly becomes expensive. Things like managing data schemas before dumping the data into location number two is way more efficient than relying on the raw compute resources of location number two to write layers of business logic.
Are dashboards really dead? [refereing to Thoughtspot ebook]
Dashboards are not dead.
The word dashboard in data is literally stolen from a car and airplane dashboard. A dashboard tells you how to get from Point A to Point B. How fast are you going, how much gas is left in the tank, and are there notifications that need to be addressed, like a check engine light?
But the important thing is that you can’t take actions on the car dashboard. You can watch your fuel go from “Full” to “Empty” on the dashboard, but you can’t actually fill your tank with the dashboard.
I am very confident that every dashboard should have a line graph showing a trend over time, a big number showing the KPI for the day/week/quarter/year, and then Top N. If you want to go a step further and add a full table, go ahead. If you want to add a color or notification to show anomalous behavior, go ahead.
One very funny thing is that ThoughtSpot recently bought Mode, and Mode is heavily into the idea of the dashboard. I’ve seen companies where there are more Mode dashboards than employees working at the company by a factor of 2 or 3. So how can ThoughtSpot say the dashboard is dead if they just paid about $200M for a company that is based around the concept of dashboards?
Dashboards will continue to be both dead and undead at the same time, like Michael Jackson in the “Thriller” music video.
Over the last few years, there has been a wave of “next-generation” Excel and spreadsheet tools (e.g., Equals, Quip, Arcwise, Zoho, etc.)
Do you believe they could finally take over Excel?
I do not, and the reason why relates to metadata and collection and privacy reasons. Maybe this can be considered a hot take, but it is something that I know prevents distribution at many companies.
Many of these Excel alternatives are simply Excel-in-the-cloud. They collect a lot of data on users, which can include browsing patterns and metadata about what is specifically being run through the platform.
For example, if you have a table called “all_orders,” which means all gross orders your business does, and each row in the table is a gross order, you are passing through usage data about the health of your business (how strong your sales are) to many of these vendors. In the Terms and Privacy Policies of all these vendors, you can see that most of them reserve the right to sell or use your data to “improve the services,” which legally means nothing.
If, for example, your business traditionally does 1000 orders a day on average, then all of a sudden you are only doing 500 a day on average over a few weeks, you are passing this data to these data vendors that help you make your reports. They will be able to detect these patterns and then use this information about your business if they wish.
This is a large reason why many of these cloud-first platforms struggle with enterprise. Many enterprise customers simply do not want to give away usage patterns about their business health.
The upside and benefits of these spreadsheet platforms are that they may offer some niceties over Excel, but in my experience, Excel, which can be run locally, is too widely distributed and too popular to be overtaken.
You dedicate a lot of time to researching different data platforms. What tools, companies, or technologies do you appreciate and recommend?
This is an incredibly silly answer, but I promise it goes somewhere. I have something called “Mascot Theory” that has become popular in investment circles. The idea is that the more a business uses mascots to sell and promote their offerings, the worse their offerings are, and that once mascots start appearing, there are unfavorable terms and hidden costs the customer may not understand when signing up.
Take, for example, Payday Loan or Cash-For-Gold places that you may see across the US, in small towns, in rural areas, and even in big cities. Many of these places have sign spinners out front or even an Uncle Sam on stilts. Most people know, of course, that they aren’t getting good rates or offers from these places, but because these businesses use mascots, it takes away the pain for the customer of knowing you are getting the bad end of the bargain.
It’s the same with banks. If you go to a baseball game and they have Free Bobblehead Night or Free Bat Night, where all the kids get a free bat, these are almost always sponsored by banks and almost always have the sports team’s mascot shooting out T-shirts out of a cannon with the bank’s name on it.
Why is this? Well, when you add a fun mascot in the mix, you forget that that bank owns 50% of the equity in your house. You forget that the bank declined you for a credit line increase on your credit card. You forget that the bank credit card has worse interest rates than other credit cards you could be using.
It’s all based on mascots. If thousands of people at the ballgame are laughing and everyone else is playing along with the mascots, and we all collectively look at each other and agree we are all having fun, we forget that the bank that’s putting on the show owns 50% of our houses and charges us high-interest rates on credit but low returns on our deposits.
How does this relate to the data world? It’s very simple.
Snowflake, for example, has a well-known polar bear mascot. You can hug the Snowflake mascot or take a picture with it at all kinds of events. Snowflake, too, now has superhero mascots floating around in costumes at various events. Some of these are hired actors, and others are various analytics people and engineers they put in capes. So now we’ve got a polar bear mascot, plus superhero mascots in the Snowflake ecosystem.
Salesforce has about a dozen mascots. If you go to a Salesforce event, there are many costumed mascots floating around. In fact, some in the Salesforce community even dress up as their favorite mascots and post these pictures on Instagram and other social media!
Why do these platforms have dozens of mascots? Well, it is because these platforms have the worst unit costs. The more silliness that goes on, the worse the rates tend to be, just like with the bank, just like with the Cash-For-Gold place.
Google Cloud Platform doesn’t have mascots. Databricks doesn’t have mascots. For example, there is no Bricky the Databricks Brickster dressed up like a Databricks brick. AWS has started in on mascots lately, and they now have an S3 data storage mascot.
It’s true!
Of course, you can blow up spend and make all kinds of poor solutions on any platform – there is no shortage of writing bad code or designing inefficient solutions, and every platform vendor is happy to take your money. However, I always lean toward staying away from mascots.
I know this answer will get me in trouble with the Snowflake people so I’ll say that Snowflake can be a great solution to get started, there is no denying that.
How about data influencers and experts - whom you enjoy reading and keeping an eye on?
There are no good data influencers. Data influencers can be dropped into a volcano as a sacrifice.
How do you see this new era of GenAI transforming analytics?
I believe most text-to-SQL is going nowhere, and those companies are a dime a dozen and will put out or pivot entirely in the next year. I am not especially bullish on anything that seeks to improve BI or improve tooling.
What I am interested in is the ability of AI to improve audiences and consumer privacy. Right now, the entire analytics world is set up to:
Collect click and event-based behaviors
Collect purchase behaviors
Throw this all into models and recommendation engines and return back to the consumer what a company *thinks* is what the consumer wants, which is then delivered through ads and promotions.
But all of this is wrong, in my opinion. Companies already know who their top 1%, 2%, 5% spenders are, and these customers are always an outsized portion of revenue.
Instead of companies making best guesses about what these consumers want, both parties should have tighter feedback loops, and these consumers would ideally want to give more data to these brands they like and trust and give less data to brands they don’t like or trust.
Right now, ad networks have the most power, brands have the second-most power, and consumers have the least power. Consumers have their data sold 10 different ways across brokers, networks, it ends up in hedge funds used to price equities, it’s all a mess.
The cookie is going away after years of Google playing “will they or won’t they” with the market. Gmail is also making it harder for bulk senders of emails to spam customers in 2024.
All of this means that it is now more expensive to acquire new customers, and these dollars can probably be better allocated to retaining top existing customers. This means more personalization and more experiences. AI will be best deployed in these retention use cases.
Is there anything else you want to share to encourage or inspire people to learn data and analytics?
All of analytics and data engineering is just applied accounting. The more you know about finance, the stronger you are as an analytics professional.
Thank you, Lauren!
Thanks for reading, everyone!
ncG1vNJzZmickamuorrApbCsoaNjwLau0q2YnKNemLyue89oqqmdk56urXnEnaCtoZ%2BjeqK6wKWwraGTqHqiv4yap6mkmZqx