Senior Software Engineer - Web Crawling
DatafinitiAustin Texas / Remotea month ago
SENIOR SOFTWARE ENGINEER, Web Crawling Architecture, $160K-$180K, Austin/Remote
About the company
Datafiniti was founded with the vision of empowering people and organizations with data. If someone was building technology that required data to power it, we wanted to remove the hurdle of data acquisition so that that technology could come to life. Ten years later, we have brought that vision to life and are looking to expand it as far as possible. We currently have a wide variety of customers in finance, retail, proptech, and marketing, who use our technology to develop their products.
We collect a wide variety of information and content from thousands of sources and transform it all into highly-structured, instantly accessible data that can be integrated right away into any application or analysis someone is building. Startups and Fortune 500s alike use our various data sets and APIs to power thousands of solutions, including fraud prevention, investment algorithms, pricing analysis, mobile apps, lead generation, and much, much more.
In order to efficiently support such a wide variety of customers, we focus on building a highly flexible, robust, and scalable data infrastructure. Our technology is capable of ingesting, processing, and serving out billions of data points every day. Our small, close-knit team works together to develop technology and operational capabilities that allow us to meet the needs of an ever-expanding set of use cases.
Over the last two years, we have doubled our customer base and revenue each year and are on course to do so again this year! As we enter a new phase of growth, we are seeking to build a new engineering team from scratch and are specifically looking for people who are eager to "own" the technology and scale it for growth. This will be a unique opportunity that provides the experience similar to joining a start-up on Day 1, while already having an established customer base and strong technology foundation. We’re hoping to bring on new team members that are excited to work on the challenges of our unique business and push the boundaries of what “scale” truly means.
About the role
We wish to hire an experience software engineer who has a strong passion for developing highly-scalable distributed systems, particularly around web crawling or web scraping. A significant part of our technology stack is dedicated to:
- Efficiently fetching content from a large volume of URLs (15 million per day)
- Running self-contained “micro applications” on each URL’s page content to convert it into semi-structured data
- Passing that data to a separate stack responsible for data warehousing
At a high level, this engineer will be responsible for the maintenance, improvement, and expansion of this stack.
Specific responsibilities can be broken down into the following categories:
Maintenance of existing systems
- Maintaining optimal throughput within the crawling engine’s pipeline
- Maintaining and building upon each individual microservice to tackle problems such as:
- Preventing duplicate URLs from being crawled
- Distributing the correct amount of work to different microservices
- Safely running arbitrary code against HTML without locking up an entire process
- Maintaining queryable queues that hold hundreds of millions of URLs while keeping retrieval within milliseconds
- Leveraging one or many different proxy providers to increase the likelihood of a data-yielding response
- Collecting and reporting valuable data points and metrics throughout all parts of the system (90 million data points per minute)
- Streaming large volumes of normalized web data to multiple points of interest
- Maintaining an API that sits in front of the crawling engine
New systems to support upcoming business needs
- Architecting an implementing a robust solution to increase the likelihood of getting a successful response from any number of domains. This solution will leverage IPs provided from various proxy providers along with an elegant approach to rotate these IPs and pair them with cookies generated for a given domain
- Implementing a solution that provides more insight/vision into the health and status of data acquisition from any domain
- Architecting and implementing an elegant, dynamic, and performant distribution mechanism that provides us with full control over what we’re crawling at any given time
Additional responsibilities include:
- Diagnosing and fixing highly complex technical issues independently
- Supporting the build and deployment pipeline and, when necessary, diagnosing and solving production support issues
- Communicating individual and project-level development statuses, issues, risks, and concerns to technical leadership and management
- Identifying and communicating cross-team dependencies to respective peers
- Writing specification documents that include the feature-set being developed, explaining how these features will be implemented, and gaining stakeholder approval for the feature-set
- Conducting thorough QA as a part of the development lifecycle prior to a production release
Specific technologies required for this role include:
- Node.js (5+years experience): All of the microservices that make up our crawling engine are written in Node.js and each microservice leverages Node’s native clustering capabilities for parallelism
- Express.js or a similar API framework such as Hapi, Koa, or Restify: The API that sits in front of our crawling engine is written in Express.js
- MongoDB: We use MongoDB as our primary user database
- MySQL: We use MySQL in a novel way to maintain our URL queues
- Redis: We use Redis to maintain global state within our distributed system
- AWS and Docker: All of our microservices are containerized via Docker and deployed to AWS
This role will also require a deep understanding of the following concepts or skills:
- Thorough understanding of distributed systems and how to make them reliable, scalable and maintainable
- Experience with web crawling/scraping at a very large scale including the use of proxy rotation
- Deep understanding of the design, implementation, and consumption of REST APIs
- Excellent verbal and written communication skills
- Strong analytical, problem solving, debugging and troubleshooting skills
Additional skills that will be considered a plus but are not required:
- StatsD / Graphite / Graphana
Compensation & Benefits
Compensation and benefits for this role include:
- $160K - $180K annual base salary
- Equity in a company that is doubling its revenue every year
- Comprehensive health insurance (medical, dental, vision, life)
- Unlimited PTO, but 15 days MINIMUM required, preferably more
- 7 federal holidays + 4 quarterly company-wide holidays
Additional benefits include:
- Highly flexible work-life balance: If working in Austin, we ask you work from the office at least twice a week (for team bonding!), and you are free to work from home at your own schedule otherwise.
- High degree of job autonomy: Team members are encouraged to experiment with their own implementations, propose ideas for company needs, and explore new solutions.
- Career development: Executive leaders work with team members to align personal goals with company goals.
- A supportive team environment: Team culture is focused on providing a supportive and positive environment for everyone.
NOTE: We prefer to hire someone who can work from our Austin office, but are open to hiring remote for someone who is a great fit for the role.
Job Type: Full-time
Pay: $160,000.00 - $180,000.00 per year
Our office is open again. We are allowing fully vaccinated employees to work from the office without wearing masks. Anyone who is not yet fully vaccinated is asked to continue working from home.